Dynamic Protein Localization

This repository contains code and data for the second ML project on dynamic protein localization by Leonardo Bocchieri, Emilia Farina, RIccardo Rota. It covers all the pipeline followed for solving our problem, from datacleaning to postprocessing and interpretation of the results.

Table of contents

Problem Descprition
Usage
Models
Repository Infos

Problem Description

This repository contains a selections of models for static and dynamic protein localization on datasets of budding yeast proteins. The task performed by the models is predicting the position of the protein inside the cell among 15 possible classes, taking as input the protein sequences and information about its concentration and how it interacts with other proteins. The static predictions are made on a single label, i.e. the most frequent position of the given protein in the cell, while dynamic predictions are performed on 5 time-steps, correspondent to 5 different phases of the eukaryotic cell cycle.

For a much deeper insite, refer to the report of the project.

Usage

Quick start

For setting up all the needed packages, you need to run on your bash file:

pip install -r requirements.txt

Adding if necessary the full path of the requirements.txt on your computer.

Data Cleaning

You can create dataset running data_cleaning.ipynb. In this notebook, we provide our datacleaning and data inspection pipeline, together with plots and remarks.

All the created datasets are then saved to the datasets folder. Feel free to skip this preprocessing and directly go to the trainings of the models.

Run Models

You can go through all our workflow in run_all.ipynb. In this Jupyter Notebook, we create and run all the models we have implemented for the static and the dynamic problem, and also do cross-validation of the main ones.

The architectures used allow for a wide range of different hyperparameters: for easily tune them we provide some JSON files with which you can play. In particular, you can decide which architecture to use setting the model_type variable in the JSON files. For more informations, please refer to the tables shown below.

Models

ESM Models

The main difficulty of our problem is the small number of samples in our datasets, which forces us to use some pre-trained SOTA architectures. We decided to use ESM (Evolutionary Scale Modeling). ESM provides large models pre-trained on huge protein datasets coming from different species. The models take as input proteic sequences of variable length, and output a correspondent fixed-dimensional embedding. In all our implementation we use the model esm2_t30_150M_UR50D, that outputs embeddings of dimension 640.

Static Models

For our static localization problem, we tried 3 different architectures:

Name	model_type	Sequences of the Extremities	Description
MLP	0	Not Used	Simple Fully Connected Neural Network
XGBoost	1	Not Used	Model created loading the xgboost library: just hyperparameter tuning
StaticModelBranch	2	Used	Combines transformers to make sequence embeddings for extremities and linear layers to

Dynamic Models

For our dynamic localization problem, we tried 4 different architectures.

Name	model_type	Description
LSTM Dynamic Model	0	Dynamic and static Data are combined together and given to bidirectional LSTM layers for feature extraction. Linear layers for classification
TCN Model	1	Temporal Convolutional Layers for dynamic data feature extraction, then combined with static model output with linear layers
Simple Model	2	Just linear layers, no temporal architecture involved
Modulable LSTM Dynamic Model	3	Same architecture as LSTM Dynamic Model, but allows to choose which data to consider for training

Repository infos

Apart from the two Jupyter Notebooks, the reposutory is organized in folders which handle different tasks:

configs: for creating datasets and dataloaders
datasets: with all the datasets created with run_data_cleaning.ipynb
hyperparameters: with all the JSON files (for our best models, refer to hyperparameters_cross_val.json)
losses: with definitions of all the different losses we can use (and also the map to handle them for function passing)
models:
- static models : with all the static model classes
- dynamic models : with all the dynamic model classes
- naive model: for creating and running the required naive model
plot outputs: with data inspection figures
trainers: with all functions for training and validation
utils: with all helpers function used in our data cleaning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Protein Localization

Problem Description

Usage

Quick start

Data Cleaning

Run Models

Models

ESM Models

Static Models

Dynamic Models

Repository infos

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
configs		configs
datasets		datasets
hyperparameters		hyperparameters
losses		losses
models		models
plots_output		plots_output
trainers		trainers
utils		utils
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_all.ipynb		run_all.ipynb
run_data_cleaning.ipynb		run_data_cleaning.ipynb
runner_dynamic.py		runner_dynamic.py
runner_naive.py		runner_naive.py
runner_static.py		runner_static.py

License

CS-433/ml-project-2-teamrelu

Folders and files

Latest commit

History

Repository files navigation

Dynamic Protein Localization

Problem Description

Usage

Quick start

Data Cleaning

Run Models

Models

ESM Models

Static Models

Dynamic Models

Repository infos

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages