Skip to content

Latest commit

 

History

History
166 lines (111 loc) · 6.79 KB

README.md

File metadata and controls

166 lines (111 loc) · 6.79 KB

Benchmarking and Property Prediction Framework (PREFER)

The PREFER framework automatizes the evaluation of different combinations of molecular representations and machine learning models for predicting molecular properties. It covers different molecular representation from classical, e.g. Fingerprints and 2D Descriptors, to data-driven representations, e.g. Continuous and Data Driven representations (CDDD) [1] or MoLeR[2]. PREFER uses AutoSklearn [3] to implement the ML model selection and the hyperparameter tuning.

caption

General overview of the PREFER framework where the Model Selection part is based on [3].

Getting Started

Installation

Python Environment

The main conda environment for using PREFER can be installed from prefer-environment.yml, as follows:

conda env create -f prefer-environment.yml

Depending to the models employed to generate model-based molecular representations, other environments need to be installed (one for each model). The supported models in the current PREFER code are CDDD [1] and MoLeR [2]. The corresponding environments can be found in moler-environment-light.yml and cddd-environment-light.yml and can be installed as follows:

conda env create -f moler-environment-light.yml
OR
conda env create -f cddd-environment-light.yml

Before running any experiments, relevant paths need to be set (including cddd and moler folders which are integrated in PREFER as git submodules), as follows:

PYTHONPATH="path_to/PREFER/prefer/model_based_representations/models/cddd/:path_to/PREFER/prefer/model_based_representations/models/molecule-generation/:path_to/PREFER/:$PYTHONPATH"
export PYTHONPATH

New models should be included as git submodules and add in the PYTHONPATH.

Conda Environments in Jupyter

To use the PREFER conda environment in a Jupyter notebook, the environment needs to be added to Jupyter's kernelspec:

conda activate prefer-env
python -m ipykernel install --user --name prefer-env --display-name "Python (prefer-env)"

Check that Jupyter has access to this environment by running

jupyter kernelspec list

The recently added env Python (prefer-env) should be available now in Jupyter.

Prerequisites

In order to run PREFER, we provide one notebook (Run-PREFER.ipynb) and one python script (run_prefer_automation.py)

Main steps are as follows:

STEP 0: clone the repository and unpack the git submodules

Once you have cloned this repository, please go into your cloned folder and run the following commands:

git submodule update --init --recursive

This is needed to unpack the git submodules used to connect PREFER to the models used to compute the model-based representations.

STEP 1: download public test datasets

Two public datasets can be used to test the code:

STEP 2: download models for calculating data-based molecular representations

Two models are supported currenlty as submodules in PREFER: CDDD and MOLER. Pre-trained models can be dowloaded from:

Save these trained models locally, since they will be used afterwards.

STEP 3: set the configuration files

For each PREFER job a yaml config file need to be prepared as follows:

  1. Main settings:
path_to_df: 'path_to_df'
experiment_name: 'experiment_name'
id_column_name:  'id_column_name'
smiles_column_name:  'smiles_column_name'
properties_column_name_list: 
      - 'property_1_col_name'
      - 'property_2_col_name'
problem_type: 'regression' # or 'classification'
splitting_strategy: 'random' # or 'cluster' or 'temporal'
temporal_info_column_name: 'temporal_info_column_name'

Examples are provided in ./config_files.

  1. Settings for model based representations:
model_based_representations:
    'model_name': 
        'path_to_model': 'path to model folder'(see STEP2)
        'conda_env': 'name of the conda env installed for this model'
        'submodule_path': 'path to the submodule folder included in PREFER for running the model'(e.g. path_to/prefer/model_based_representations/models/cddd/)
    
prefer_path: 'path_to_/PREFER/'

Examples of configuration file for the representations is provided in ./config_files/config_model_based_representations.yaml.

STEP 4: run Run-PREFER.ipynb notebook

To run the notebook Run_PREFER.ipynb, first of all select the correct kernel (Python (prefer-env)) and then change the needed paths, in particular:

  • sys.path.append('path_to/PREFER/')
  • sys.path.append('path_to/models/cddd/') # to connect CDDD model
  • sys.path.append('path_to/models/molecule-generation/') # to connect MOLER model

By running the notebook a folder (PREFER_results) will be created with the main results (benchmarking object and models). Moreover different folders with structure {model_name}representations{experiment_name} will be created containing the model_based representations.

In the notebook one can also find an example of how to use the stored PREFER-model-wrapper to predict new samples. This way the best model found for each molecular representation can be used later to predict the property under analysis.

An automatized version of the notebook can be found in run_prefer_automation.py. You can run it from the terminal with the following commands:

conda activate prefer-env

PYTHONPATH="path_to/PREFER/prefer/model_based_representations/models/cddd/:path_to/PREFER/prefer/model_based_representations/models/molecule-generation/:path_to/PREFER/:$PYTHONPATH"
export PYTHONPATH

python run_prefer_automation.py --prefer_args path_to_yaml_configuration_file(see STEP3) --model_based_representations_args path_to_yaml_configuration_file_for_models_used_to_compute_the_representations(see STEP4)

WARNING:

Please make sure that you select the right model type according to the dataset used (e.g.for a classification model binary labels should be provided in the dataset).

Authors

  • Jessica Lanini

With the contribution of

  • Nadine Schneider
  • Gianluca Santarossa
  • Sarah Lewis
  • Krzysztof Maziarz
  • Marwin Segler
  • Hubert Misztela

References

[1] Winter, Robin, et al. "Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations." Chemical science 10.6 (2019): 1692-1701.

[2] Maziarz, Krzysztof, et al. "Learning to extend molecular scaffolds with structural motifs." arXiv preprint arXiv:2103.03864 (2021).

[3] Feurer, Matthias, et al. "Efficient and robust automated machine learning." Advances in neural information processing systems 28 (2015).