Continuous and Data-Driven Descriptors (CDDD)

Implementation of the Paper "Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations" by Robin Winter, Floriane Montanari, Frank Noe and Djork-Arne Clevert.¹

Installing

Prerequisites

python 3
tensorflow 1.10
numpy
rdkit
scikit-learn

Conda

Create a new enviorment:

git clone https://github.com/jrwnter/cddd.git
cd cddd
conda env create -f environment.yml
source activate cddd

Install tensorflow without GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl

Or with GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

And install the cddd package:

pip install .

Downloading Pretrained Model

A pretrained model as described in ref. 1 is available on Google Drive. Download and unzip by execuiting the bash script "download_default_model.sh":

./download_default_model.sh

The default_model.zip file can also be downloaded manualy under https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Testing

Extract molecular descriptors from two QSAR datasets (ref. 2,3) and evaluate the perfromance of a SVM trained on these descriptors.

cd example
python3 run_qsar_test.py --model_dir ../default_model

or with gpu support on e.g. device 0:

python3 run_qsar_test.py --model_dir ../default_model --use_gpu --device 0

The accuracy on the Ames dataset should be arround 0.814 +/- 0.006.

The r2 on the Lipophilicity dataset should be arround 0.731 +/- 0.029.

Getting Started

Extracting Molecular Descripotrs

Run the script run_cddd.py to extract molecular descripotrs of your provided SMILES:

cddd --input smiles.smi --output descriptors.csv  --smiles_header smiles

Supported input:

.csv-file with one SMILES per row
.smi-file with one SMILES per row

For .csv: Specify the header of the SMILES column with the flag --smiles_header (default: smiles)

Inference Module

The pretrained model can also be imported and used directly in python via the inference class:

import pandas as pd
from cddd.inference import InferenceModel
from cddd.preprocessing import preprocess_smiles

Load and preprocess data:

ames_df = pd.read_csv("example/ames.csv", index_col=0)
ames_df["smiles_preprocessed"] = ames_df.smiles.map(preprocess_smiles)
ames_df = ames_df.dropna()
smiles_list = ames_df["smiles_preprocessed"].tolist()

Create a instance of the inference class:

inference_model = InferenceModel()

Encode all SMILES into the continuous embedding (molecular descriptor):

smiles_embedding = inference_model.seq_to_emb(smiles_list)

The infernce model instance can also be used to decode a molecule embedding back to a interpretable SMILES string:

decoded_smiles_list = inference_model.emb_to_seq(smiles_embedding)

References

[1] R. Winter, F. Montanari, F. Noe and D. Clevert, Chem. Sci, 2019, https://pubs.rsc.org/en/content/articlelanding/2019/sc/c8sc04175j#!divAbstract

[2] K. Hansen, S. Mika, T. Schroeter, A. Sutter, A. Ter Laak, T. Steger-Hartmann, N. Heinrich and K.-R. MuÌ´Lller, J. Chem. Inf. Model., 2009, 49, 2077–2081.

[3] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing and V. Pande, Chemical Science, 2018, 9, 513–530.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
cddd		cddd
example		example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
download_default_model.sh		download_default_model.sh
environment.yml		environment.yml
environment_pypi.yml		environment_pypi.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous and Data-Driven Descriptors (CDDD)

Installing

Prerequisites

Conda

Downloading Pretrained Model

Testing

Getting Started

Extracting Molecular Descripotrs

Inference Module

References

About

Releases

Packages

Languages

License

healx/cddd

Folders and files

Latest commit

History

Repository files navigation

Continuous and Data-Driven Descriptors (CDDD)

Installing

Prerequisites

Conda

Downloading Pretrained Model

Testing

Getting Started

Extracting Molecular Descripotrs

Inference Module

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages