COBS

Classification of biochemical sequences

The project goal is to develop a framework for the classification of biochemical sequences. Working with sequences like fasta will be the subject of study.

Models available:

KNN (knn)
Logistic regression (logreg)
RandomForestClassifier (rf)
SVC (svc)
Isolation Forest (if)
ResidualNN (residual)
Perceptron (perceptron)
Multilayer perceptron (mperceptron)

Models in progress:

LSTM
RNN

Use cobs/config.ini to configure the models.

rparams (type: dictionary) for basic configuration
gparams (type: dictionary) for randomized search configuration

KNOWN BUG in Parallel: need to restart script after using keras model in experiments table.

Install with Conda

Linux
Python 3.6 or 2.7
Install https://github.com/DentonJC/virtual_screening
source virtual_screening/env.sh
- or add virtual_screening to PATH
Conda (https://www.anaconda.com/download/#linux)
conda install --file requirements

Already installed for virtual_screening:

Python3: pip install configparser
Python2: pip install ConfigParser
pip install argparse

Usage

usage: Classification of biochemical sequences
              [-h] [--output OUTPUT]
              [--configs CONFIGS]
              [--n_iter N_ITER]
              [--n_jobs N_JOBS]
              [--patience PATIENCE]
              [--gridsearch]
              [--experiments_file EXPERIMENTS_FILE]
              [--length LENGTH]
              select_model [select_model ...]
              dataset_path [dataset_path ...]

positional arguments:
select_model          name of the model, select from list in README
dataset_path          path to dataset

optional arguments:
-h, --help            show this help message and exit
--output OUTPUT       path to output directory
--configs CONFIGS     path to config file
--n_iter N_ITER       number of iterations in RandomizedSearchCV
--n_jobs N_JOBS       number of jobs
--patience PATIENCE, -p PATIENCE    patience of fit
--gridsearch, -g      use RandomizedSearchCV
--experiments_file EXPERIMENTS_FILE, -e EXPERIMENTS_FILE address where to write results of experiments
--length LENGTH, -l LENGTH    maximum length of sequences
--targets TARGETS, -t TARGETS    set number of target column

Example input

Single experiment:

python cobs/run_model.py logreg data/dataset.csv --n_jobs -1 --n_iter 6 --length 256 -g

Table of experiments:

Fill in the table with experiments parameters (examples in /etc, False = empty cell)
Run python run.py
Experiments will be performed one by one and fill in the columns with the results

Example output

2018-01-05 19:55:57,028 [main] INFO: GRID SEARCH
2018-01-05 19:55:57,028 [main] INFO: FIT
Fitting 10 folds for each of 6 candidates, totalling 60 fits
...
[Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 5.7min finished
2018-01-05 20:01:53,124 [main] INFO: Accuracy test: 86.59%
2018-01-05 20:01:54,589 [main] INFO: 0:06:07.959393
Can't create history plot for this type of experiment
Report complete, you can see it in the results folder
2018-01-05 20:01:54,720 [main] INFO: Done
2018-01-05 20:01:54,720 [main] INFO: Results path: /cobs/tmp/2018-01-05 19:55:46.630191/

Datasets

Generate dataset from local files

Put FASTA files into data/ folder
Run data/create_dataset.py

Download dataset from ncbi server

Configure search.ini: select requests and name of labels
Run data/load_dataset.py

Use dataset from the "wild"

First row is headers
First column is indexes
Second column is sequences
Third column is classes

Results

DNA classification: Promoter Gene Sequences

Class Distribution:

positive instances: 53 (50%)
negative instances: 53 (50%)

Random split:

Train 70%
Val 9%
Test 21%

Model	train accuracy	test accuracy
regression	89.56	88.34
random forest	100	93.27
SVC	100	89.38
IF	17.73	20.62
KNN	100	87.44

DNA classification: Splice-junction Gene Sequences

Class Distribution:

EI: 767 (25%)
IE: 768 (25%)
Neither: 1655 (50%)

Random split:

Train 70%
Val 9%
Test 21%

Model	train accuracy	test accuracy
regression	100	77.27
random forest	97.29	86.36
SVC	100	72.72
IF	48.64	27.27
KNN	100	77.27

Resources

Used:

Tested:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/
https://www.ncbi.nlm.nih.gov/unigene/?term=human[organism]
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cobs		cobs
data		data
etc		etc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements		requirements
run.py		run.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COBS

Table of Contents

Install with Conda

Usage

Example input

Single experiment:

Table of experiments:

Example output

Datasets

Generate dataset from local files

Download dataset from ncbi server

Use dataset from the "wild"

Results

DNA classification: Promoter Gene Sequences

DNA classification: Splice-junction Gene Sequences

Resources

About

Releases

Packages

Languages

License

DentonJC/cobs

Folders and files

Latest commit

History

Repository files navigation

COBS

Table of Contents

Install with Conda

Usage

Example input

Single experiment:

Table of experiments:

Example output

Datasets

Generate dataset from local files

Download dataset from ncbi server

Use dataset from the "wild"

Results

DNA classification: Promoter Gene Sequences

DNA classification: Splice-junction Gene Sequences

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages