In this folder you will find a couple examples of how to use the pipeline or its outputs.
You can refer to the parameters_blueprint.yml
file for all pipeline parameters, including details about their functionality.
Ready to run pipeline examples:
For each of the following examples, cd
in the directory (e.g. cd use_case_one
) and execute bio_embeddings config.yml
. All example outputs will also soon will be available for comparison purposes.
In brief:
Folder | Purpose |
---|---|
use_case_one | Embedding generation & visualization |
use_case_two | Embedding generation & visualization |
use_case_three | Embedding visualization |
Supervised annotation extraction | Prediction of localization and secondary structure |
Unsupervised annotation extraction | Transfer of annotations |
goPredSim | Transfer of annotations |
goPredSim using ProtTrans BERT BFD | Transfer of annotations |
deeploc | Embedding generation & visualization |
cath | Embedding generation & visualization |
disprot | Embedding generation & visualization |
docker | Pipeline use through Docker |
advanced_use_case | Embedding generation & transformation |
deepblast | Use DeepBLAST to align sequences |
tucker | Shows how tucker embeddings better separates sequences by CATH class when compared to plain Bert embeddings |
light attention | Embedding generation & Prediction of subcellular localization |
mutagenesis | Prediction of mutation effects |
Use case: you have a set of proteins (in FASTA format) and want to create amino acid-level embeddings, as well as protein-level embeddings.
Additionally, you have an annotation file with some property for a subset of the proteins in your dataset. For these, you want to produce a visualization of the sequences and how they separate in space.
In this example, the sequence-level embeddings (reduced_embeddings_file
) are leveraged to compute t-SNE projections, which are then used to generate a 3D plot of the "sequence space".
Noteworthy files produced:
- The
embed
stage produces anembeddings_file
and areduced_embeddings_file
. The former contains embeddings for each AA in each sequence, while the latter contains (fixed sized) embeddings for each sequence in your set. You can use the notebooks to check out how to open these. - The
project
stage produces a h5projected_reduced_embeddings_file
, which contains(x,y,z)
coordinates for each sequence in your set. - The
visualize
stage produces an HTMLplot_file
containing the plot of the sequences derived from the projection's coordinates.
Use case: you have a set of proteins (in FASTA format) and want to create amino acid-level embeddings, as well as protein-level embeddings. Additionally, you have an annotation file with some property for a subset of the proteins in your dataset. For these, you want to produce a visualization of the sequences and how they separate in space. This time around: you downloaded the models locally (faster execution) and want to provide the path to the model's weights and options. You also annotated your proteins using an md5 hash of the sequence instead of arbitrary identifiers.
Use case: you already have per-protein embeddings of a certain dataset and want to produce various t-sne plots, using both different annotation files and different t-sne parameters.
Files you need:
- Reduced embeddings file: a per-protein embedding file in hdf5 format
- Mapping file: a file containing a mapping from md5 hash of the sequence to an arbitrary identifier (e.g. the one used in an annotation file)
- Annotation file(s): CSV files containing annotations for the proteins in the reduced embeddings file
Note: While it is possible to use the pipeline to produce many visualizations for many different annotations, it may be more efficient to use a Notebook for this.
We include a notebook (project_visualize_pipeline_embeddings
) covering the same use case as the one presented here in the notebooks
folder at the root of this project.
Trained supervised models: get protein structure and function annotations, supervised_annotation_extraction
Use case: you have a set of proteins (in FASTA format) and want to extract features using the supervised models published during evaluation of ProtT5 (aka: DSSP3, DSSP8, disorder, localization and membrane vs. soluble).
Noteworthy files produced:
- The
extract
stages produceDSSP3_predictions_file
,DSSP8_predictions_file
, anddisorder_predictions_file
, which are FASTA files containing the respective, per-AA annotations;- additionally a CSV
per_sequence_predictions_file
contains per-sequence annotations, aka: localization and if a sequence is predicted to be membrane-bound or not.
Transfer annotations from labeled sequences to unlabeled sequences: unsupervised_annotation_extraction
Use case: you have a set of proteins with known properties (we call this "reference
"), and you have a set of proteins for which you would like to infer these properties.
Unsupervised annotation extraction (also annotation transfer) happens through k-nearest-neighbour search of the closest embeddings in a reference, annotated dataset.
Distances between input sequences and reference dataset are calculated via pairwise distances between target (your input sequences) and reference embeddings (e.g. SwissProt).
The pipeline's implementation is inspired by goPredSim and offers standard distance metrics, e.g. euclidean, manhattan, and also pseudo distances e.g. cosine
In this example, we use the reduced_embeddings_file
calculated in disprot
, and annotations from the CSV file there to transfer annotations onto an unknown dataset.
Noteworthy files produced:
- The
extract
stages produces:- a CSV
pairwise_distances_matrix_file
, which contains all pairwise distances (euclidean and cosine in this example) between input sequences/embeddings and reference embeddings; - a CSV
transferred_annotations_file
, which contains a column with the transferred annotations, and k columns with the k-th closest element its distance, identifier and annotations.
- a CSV
Use case: You have a set of proteins for which you would like to infer GO annotations for (as is done in goPredSim). This uses the unsupervised_annotation_extraction idea.
Prerequisites:
- for this example, you have to download the
seqvec_reference_embeddings.h5
andannotations.csv
files from http://data.bioembeddings.com/public/embeddings/reference/goa/ , and store them in the same folder as theconfig.yml
. - Our tests show that this experiment consumed 16.3GB of system RAM at peak computation (average: 2.5GB). The RAM requirements will increase with the number of sequences in your set.
Noteworthy files produced:
- The
extract
stages produces:- a CSV
transferred_annotations_file
, which contains a column with the transferred GO terms.
- a CSV
The same as goPredSim, but using prottrans-bert-bfd
instead. You can find the reference protbert embeddings at: http://data.bioembeddings.com/public/embeddings/reference/goa/
This example includes sequences pulled directly from DeepLoc 1.0 (http://www.cbs.dtu.dk/services/DeepLoc/data.php). Annotations (deeploc_annotations.csv
) were extracted from the FASTA headers.
Note: since the FASTA file pulled from DeepLoc contains duplicate sequences, the remapping has been set to "simple". This is generally discouraged, but for the sake of simplicity of this example has been kept "as-is".
This example includes sequences pulled directly from the CATH database (http://www.cathdb.info) and annotations for structural folds. To reduce the embeddings, UMAP was used.
Note: since the FASTA file pulled from CATH contains duplicate sequences, the remapping has been set to "simple". This is discouraged, as it may lead to higher computation times (embedding the same sequence multiple times), and could lead to conflicts when overlaying annotations.
Similar to the cath
example, but using the DisProt (https://www.disprot.org) database instead. Annotations contain "highly disorder" for proteins with >80% disorder, and "low disorder" for proteins with <20% disordered AA content.
Note: in this example we exclude proteins with unknown annotation (see visualize stage in config).
This example is similar to use_case_one, except that the paths are changed to work with docker (i.e. /mnt
prefixes everywhere). From the project root, you can run it with:
docker run --rm --gpus all \
-v "$(pwd)/examples/docker":/mnt \
-u $(id -u ${USER}):$(id -g ${USER}) \
rostlab/bio_embeddings /mnt/config.yml
In general, you should mount all input files into /mnt
, e.g. you might need to add something like -v /nfs/my_sequence_storage/proteomes.fasta:/mnt/sequences.fasta
. The --gpus all
lets docker use the GPU and -u $(id -u ${USER}):$(id -g ${USER})
makes sure that the results are owned by the current user and not by root.
You'll find the results in examples/docker/output
.
We compare tucker to plain Bert for separating CATH classes. As baseline, we embed a small test set of CATH domains (which were excluded from tucker training) with Bert (prottrans_bert_bfd
, to be exact) and plot those with umap and plotly. We then project the bert embeddings with tucker (pb_tucker
) and plot them in the same way.
Noteworthy files produced:
- The
visualize
stages produce:tucker_cath/visualize_bert_class/plot_file.html
for plain Berttucker_cath/visualize_tucker_class/plot_file.html
for Bert projected with tucker
By comparing the two you can observe that tucker separates the mainly alpha and the mainly beta classes much clearer than plain Bert.
In some cases, you are interested in doing something else then mean pooling the embeddings for per-sequence representations. The pipeline has an experimental feature which offers you to directly transform the per-amino acid embeddings into a different format (e.g. you can max pool, you can do other types of transformations). A small example extracting the first LSTM layer and mean pooling that (in the case of SeqVec) and max pooling instead of mean pooling (in the case of ProtTrans-BERT-BFD) is available in the advanced use case folder.
Using a light attention mechanism to aggregate residue embeddings for protein sequences we trained supervised models to predict subcellular localization.
Noteworthy files produced:
la_prott5
createsper_sequence_predictions_file.csv
In-silico mutagenesis using ProtTrans-Bert-BFD. This computes the likelihood that, according to Bert, a residue in a protein can be a certain amino acid. This can be used as an estimate for the effect of a mutation.
Noteworthy files produced:
- The
protbert_bfd_mutagenesis
protocol writesresidue_probabilities_file.csv
with probabilities for all sequences plot_mutagenesis
creates a separate interactive plot for each sequence as html file
Note: Mutagenesis is much slower than embedding, so only use it with few sequences. Internally, we have to rerun the entire model for each residue we want to predict, so we do the computation for each residue instead of once per protein.