Pubtator provides automated annotations of biomedical entities in scientific publications. Here we present recent results of applying PubTator on the literature about COVID-19 and other coronaviruses. In particular, we feature results on two specific data collections: LitCovid and CORD-19. Pubtator annotations are provided for six entity types (gene/protein, drug/chemical, disease, cell type, species and genomic variants) in two formats (BioC JSON and BioC XML).
LitCovid is a curated literature hub, built and maintained by the National Library of Medicine (NLM), for tracking up-to-date scientific information about SARS-CoV-2 and COVID-19 [1]. It is a comprehensive resource specific to SARS-CoV-2 and COVID-19, with new PubMed articles added daily. PubTator annotations for LitCovid are updated daily for article title and abstract, as well as full text for PMC Open Access articles, when applicable.
Download PubTator annotations for LitCovid from here.
CORD-19, the COVID-19 Open Research Dataset provided by the Allen Institute for AI and in partnership with the NLM and many others, contains (mostly) full-text publications on COVID-19 and coronavirus-related research [2]. PubTator annotations are in sync with the weekly updated CORD-19 dataset.
Download the PubTator annotations for CORD-19 from here.
The publications in LitCovid focus on COVID-19, while CORD-19 includes other coronaviruses (e.g. SARS and MERS) and a wider time period (i.e. before the current outbreak). As of early May 2020, there are ~9,000 and ~59,000 articles in LitCovid and CORD-19, respectively. ~1,500 articles appear in both datasets.
PubTator provides automatic annotations of biomedical concepts such as genes and mutations in PubMed abstracts and PMC full-text articles [3-4]. Annotations can be viewed in a web interface or downloaded via RESTful API or FTP. Downloaded annotations are provided in BioC JSON and BioC XML formats [5] (full-text articles) and in PubTator format (title and abstract), as described here.
Automated annotations for PubTator are created with automated concept recognition systems using machine learning and disambiguated with cutting-edge deep learning for improved accuracy. Concepts identified are linked to several biomedical resources:
- Genes and proteins are annotated by GNormPlus and linked to NCBI Gene.
- Chemicals are annotated by a concept recognition system using bluebert, an extension of the BERT deep learning transformer model, and linked to Medical Subject Headings (MeSH).
- Diseases are annotated by TaggerOne and linked to the MEDIC disease vocabulary, which includes both Medical Subject Headings (MeSH) and OMIM.
- Cell lines are annotated by TaggerOne and linked to Cellosaurus.
- Species are annotated by SR4GN and linked to NCBI Taxonomy.
- Genomic variants are annotated by tmVar and linked to dbSNP. NOTE: While our annotation tools are state-of-the-art, all automated tools are imperfect and their annotations will contain some errors.
- PubAnnotation: https://covid19.pubannotation.org
- CORD-19-on-FHIR: https://github.com/fhircat/CORD-19-on-FHIR
If you have a related project, please let us know by opening an issue or submitting a pull request.
[1] Chen, Q., Allot, A., & Lu, Z. (2020). Keep up with the latest coronavirus research. Nature, 579(7798), 193. doi: 10.1038/d41586-020-00694-1
[2] Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., et al. (2020). CORD-19: The Covid-19 Open Research Dataset. arXiv preprint arXiv:2004.10706.
[3] Wei, C. H., Allot, A., Leaman, R., & Lu, Z. (2019). PubTator central: automated concept annotation for biomedical full text articles. Nucleic acids research, 47(W1), W587-W593. doi: 10.1093/nar/gkz389
[4] Wei, C. H., Leaman, R., & Lu, Z. (2016). Beyond accuracy: creating interoperable and scalable text-mining web services. Bioinformatics, 32(12), 1907-1910.
[5] Comeau, D. C., Wei, C. H., Islamaj, R., & Lu, Z. (2019). PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics, 35(18), 3533-3535. btz070.