DEWDROP-NANOFOLD: Active Learning for Protein Structure Prediction

This is an implementation of the retrospective experiments in the paper "Active Learning for Protein Structure Prediction".

The DEWDROP selection strategy is available in the alien module, which is a modified version of ALIEN.
The Nanofold model is an adaptation of the original Equifold repository and is defined in model.py.

The repo can be divided into data pre-processing (data/), active learning retrospective experiments (al_selection/ & synthetic_example/), alien implementation of the strategies (alien/), and model training/inference scripts.

Setup and Usage

Environment

We use the following GPU-enabled setup with conda and pip. conda is used to manage GPU and bio-conda/conda-forge dependencies. Pip manages the rest. Note: CUDA_VERSION is the cuda driver version of your pytorch, which should be the same as your compute environment. For example, CUDA_VERSION=118 is version 11.8.

$ sudo apt-get update
$ sudo apt-get install hmmer
$ conda env create -f environment.yml
$ conda activate NanoFold
$ pip install transformers[torch] accelerate
$ pip install ml_collections
$ pip install e3nn
$ pip install torch_scatter torch_cluster -f https://data.pyg.org/whl/torch-2.4.0+cu${CUDA_VERSION}.html 
$ python -m ipykernel install --user --name NanoFold --display-name "Python (NanoFold)"

Model Weights and Configurations

configs/ contains .ini files which are high-level configuration for all steps in the model lifecycle, from preprocessing data to benchmarking the final model. Each section corresponds to a different step in the process.

models/ contains checkpoints, model weights, and model configuration files that can be used for fine-tuning and inference. There are serveral model weights and configurations: the original Equifold version under ab/, the recycling version under models_with_recycling/, and nanobody finetuned version nanobody_finetune_200. For running retrospective experiment, we start with the original weight and configuration.

Data Preprocessing

data/ contains code for pre-processing .csv and corresponding .pdb or .cif files into the correct format. If you want to run any training, you need to run python data/prepare_pdb_1700.py to obtain the .pdb files for all the names; otherwise, just having the sequences in a .csv file is fine. After that, run python data/equifold_process_input.py to generate the dataset .pickle file, then run python data/train_val_test_split.py for dataset splitting. The preprocessing section in config.ini should point to the corresponding files and directories. The output files can later be used for training.

Run retrospective experiments

al_selection contains scripts for different strategies. After processing the data and curated the configuration file, call python -m al_selection.[name-of-script] to perform experiments. All the training and validataion statistics are stored under model_logs/[name-of-configuration] with the name specified as in the configuration file.

End-to-end Training of Nanofold

python -m run_training will begin training according to the configuration set in the training section. Logging is done with tensorboard by default. Parameters can be set in config.ini.

Inference

python -m run_inference will perform inference on a given set of structures from a csv file. Parameters can be set in config.ini. The n_seeds parameter in config.ini determines how many predictions are made for each input. inference_demo.ipynb also walks through the process with visualizations of predicted structures.

Use python -m run_inference_parallel to perform inference leveraging multiple GPUs.

Citations

@article{Xue2024,
  author       = {Xue, Zexin and Bailey, Michael and Gupta, Abhinav and Li, Ruijiang and Corrochano-Navarro, Alejandro and Li, Sizhen and Kogler-Anele, Lorenzo and Yu, Qui and Bar-Joseph, Ziv and Jager, Sven},
  title        = {Active Learning for Protein Structure Prediction},
  note         = {Under review},
  year         = {2024},
  institution  = {R\&D Data \& Computational Science, Sanofi, Cambridge, MA, United States}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
al_selection		al_selection
alien		alien
benchmarking		benchmarking
configs		configs
data		data
model_utils		model_utils
models		models
openfold_light		openfold_light
plotting		plotting
synthetic_example		synthetic_example
unit_test		unit_test
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
dataparallel_example.py		dataparallel_example.py
env_setup.sh		env_setup.sh
environment.yml		environment.yml
run_inference.py		run_inference.py
run_inference_parallel.py		run_inference_parallel.py
run_training.py		run_training.py
submit_jobs.sh		submit_jobs.sh
test_inference.py		test_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEWDROP-NANOFOLD: Active Learning for Protein Structure Prediction

Setup and Usage

Environment

Model Weights and Configurations

Data Preprocessing

Run retrospective experiments

End-to-end Training of Nanofold

Inference

Citations

About

Releases

Packages

Contributors 2

Languages

License

Sanofi-Public/dewdrop

Folders and files

Latest commit

History

Repository files navigation

DEWDROP-NANOFOLD: Active Learning for Protein Structure Prediction

Setup and Usage

Environment

Model Weights and Configurations

Data Preprocessing

Run retrospective experiments

End-to-end Training of Nanofold

Inference

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages