This repository contains code and data for the second ML project on dynamic protein localization by Leonardo Bocchieri, Emilia Farina, RIccardo Rota. It covers all the pipeline followed for solving our problem, from datacleaning to postprocessing and interpretation of the results.
Table of contents
This repository contains a selections of models for static and dynamic protein localization on datasets of budding yeast proteins. The task performed by the models is predicting the position of the protein inside the cell among 15 possible classes, taking as input the protein sequences and information about its concentration and how it interacts with other proteins. The static predictions are made on a single label, i.e. the most frequent position of the given protein in the cell, while dynamic predictions are performed on 5 time-steps, correspondent to 5 different phases of the eukaryotic cell cycle.
For a much deeper insite, refer to the report of the project.
For setting up all the needed packages, you need to run on your bash file:
pip install -r requirements.txt
Adding if necessary the full path of the requirements.txt on your computer.
You can create dataset running data_cleaning.ipynb
. In this notebook, we provide our datacleaning and data inspection pipeline, together with plots and remarks.
All the created datasets are then saved to the datasets folder. Feel free to skip this preprocessing and directly go to the trainings of the models.
You can go through all our workflow in run_all.ipynb
. In this Jupyter Notebook, we create and run all the models we have implemented for the static and the dynamic problem, and also do cross-validation of the main ones.
The architectures used allow for a wide range of different hyperparameters: for easily tune them we provide some JSON files with which you can play. In particular, you can decide which architecture to use setting the model_type variable in the JSON files. For more informations, please refer to the tables shown below.
The main difficulty of our problem is the small number of samples in our datasets, which forces us to use some pre-trained SOTA architectures. We decided to use ESM (Evolutionary Scale Modeling). ESM provides large models pre-trained on huge protein datasets coming from different species. The models take as input proteic sequences of variable length, and output a correspondent fixed-dimensional embedding. In all our implementation we use the model esm2_t30_150M_UR50D, that outputs embeddings of dimension 640.
For our static localization problem, we tried 3 different architectures:
Name | model_type | Sequences of the Extremities | Description |
---|---|---|---|
MLP | 0 | Not Used | Simple Fully Connected Neural Network |
XGBoost | 1 | Not Used | Model created loading the xgboost library: just hyperparameter tuning |
StaticModelBranch | 2 | Used | Combines transformers to make sequence embeddings for extremities and linear layers to |
For our dynamic localization problem, we tried 4 different architectures.
Name | model_type | Description |
---|---|---|
LSTM Dynamic Model | 0 | Dynamic and static Data are combined together and given to bidirectional LSTM layers for feature extraction. Linear layers for classification |
TCN Model | 1 | Temporal Convolutional Layers for dynamic data feature extraction, then combined with static model output with linear layers |
Simple Model | 2 | Just linear layers, no temporal architecture involved |
Modulable LSTM Dynamic Model | 3 | Same architecture as LSTM Dynamic Model, but allows to choose which data to consider for training |
Apart from the two Jupyter Notebooks, the reposutory is organized in folders which handle different tasks:
- configs: for creating datasets and dataloaders
- datasets: with all the datasets created with
run_data_cleaning.ipynb
- hyperparameters: with all the JSON files (for our best models, refer to
hyperparameters_cross_val.json
) - losses: with definitions of all the different losses we can use (and also the map to handle them for function passing)
- models:
- static models : with all the static model classes
- dynamic models : with all the dynamic model classes
- naive model: for creating and running the required naive model
- plot outputs: with data inspection figures
- trainers: with all functions for training and validation
- utils: with all helpers function used in our data cleaning