Skip to content

pretrained LookingGlass language model for biological read-length DNA sequences, and related models derived from transfer learning

Notifications You must be signed in to change notification settings

ahoarfrost/LookingGlass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Welcome to LookingGlass

LookingGlass is a general-purpose 'universal language of life' deep learning model for read-length biological sequences. It can be used for diverse downstream transfer learning tasks for biological data, some of which are described in the paper.

This is the main repository for these pretrained models. Static URLs for downloading these models are available in release v1 of this repo. See more detailed descriptions of the models contained in this repository below.

A python package - fastBio - for loading, manipulating, and training biological data for deep learning, as well as using the pretrained models in this release, is also available.

If you find LookingGlass, LookingGlass-derived models, or fastBio helpful, please cite the paper:

Hoarfrost, A., Aptekmann, A., Farfañuk, G. et al. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat Commun 13, 2606 (2022). https://doi.org/10.1038/s41467-022-30070-8.

Models in the most recent release

The most recent release of LookingGlass provides static URLs for available pretrained models.

  • LookingGlass

    LookingGlass is a 'universal language of life', producing contextually-aware, functionally and evolutionarily relevant representations of short DNA reads. As a general purpose 'biological language' representation model, it is broadly useful for training diverse downstream transfer learning tasks. A number of exports are available:

  • Functional Classifier

    The functional classifier can classify DNA reads into one of 1247 experimentally-validated functional annotations with 81.5% accuracy. The following exports are available:

  • Oxidoreductase Classifier

    The oxidoreductase can classify whether a DNA read originates from an oxidoreductase (EC number 1.-.-.-) with 82.3% accuracy. The following exports are available:

  • Optimal Temp Classifier

    The optimal temperature classifier can identify whether a DNA read originates from an enzyme with a psychrophilic (<15 C), mesophilic (20-40 C), or thermophilic (>50 C) optimal temperature with 70.1% accuracy. The following exports are available:

  • Reading Frame Classifier

    The reading frame classifier identifies the correct reading frame start position (1, 2, 3, -1, -2, or -3) for DNA reads (and thus the true amino acid sequence). Note it is currently only intended for prokaryotic sequences with low proportions of noncoding DNA. The following exports are available:

LookingGlass vocabulary

The vocabulary used for training LookingGlass and all LookingGlass-derived models above is also available for direct download:

  • ngs_vocab_k1_withspecial.npy - maintains the vocabulary token order for correct numericalization of DNA sequences for LookingGlass-derived models.

Tutorial

The fastBio python package can be used to directly create LookingGlass and LookingGlass-derived models (with or without pretrained weights). See the fastBio tutorial.

Alternatively, models described above were saved with pytorch.save() and can be loaded directly into your own scripts with pytorch.load().

About

pretrained LookingGlass language model for biological read-length DNA sequences, and related models derived from transfer learning

Resources

Stars

Watchers

Forks

Packages

No packages published