This repository provides a low-friction interface to the COVID-19 Research Challenge Dataset (CORD19) through an installable data-package, similar to the way trained NLP/ML models are tracked/distributed in the various corresponding libraries (SpaCy, gensim, flair, nltk, etc.). This is intended to provide smooth access to the collection of publications for researchers and analysts, along with validated tools for preliminary preprocessing and interface to various formats.
This package is meant to assist analysts in accessing and processing COVID-19 publication data as quickly as possible. Please keep in mind that it is a work-in-progress, as the data situation evolves rapidly, the codebase and subsequent user interface likely will as well.
Roughly, features are split into resources (data and models to use in your analyes) and tools (things to help in getting data into models, into analysis).
Some of the already supported features:
Structured access to CORD19 challenge tasks as typed dataclasses:
from cv_py.resources.builtins import cord19tasks
cord19tasks()[-2].question
>>> 'What do we know about COVID-19 risk factors?'
Make use of powerful state-of-the-art Neural question answering, to search for relevant CORD19 passages with a single query to Korea University's covidAsk model:
from cv_py.tools.remote import covid_ask
ans = covid_ask(cord19tasks()[-2].question)
ans['ret'][0]['answer']
>>> 'patients with cancer had a higher risk of COVID-19'
Backed by the NIST-curated COVID-19 Data Repository, and versioned to ensure your pipelines don't break as the data changes:
from cv_py.resources.datapackage import load
df = load("cord19_cdcs")
It's backed by Dask, using read-optimized Apache Parquet storage format. Need to get back to a more familiar pandas
framework? Each parallel, lazy partition is itself a DataFrame, only a .compute()
away.
Data archival is achieved through the cached archives of the CDCS instance. If you are looking for a more DIY access to data in its raw form, head over to that releases page and download desired raw XML, JSON, or CSV data types.
For now, this repository is installable via pip
directly from its github location:
pip install cv-py
This will provide access to cv-py
modules via the cv_py
namespace, and includes dask
and scikit-learn
by default. cv-py
is built to provide smooth access to existing tools, as needed for analysis. Consequently, dependencies to use the various tools supported are split into groups for convenience, installed with brackets as pip install cv-py[extras]
:
extras alias | dependencies |
---|---|
spacy | spacy , textacy , scispacy |
flair | flair |
viz | seaborn , holoviews[recommended] |
These can be combined, e.g. pip install cv-py[flair,viz]
.
In addition to installing cv-py
, installation will provide a cv-download
command to be used in a terminal (in the same environment cv-py
was installed). The default options will automatically install the latest compatible version of the curated CORD19 dataset. Use the -h
flag for more information.
After downloading, the data can be loaded directly:
from cv_py.resources import datapackage
df = datapackage.load("cord19_cdcs")
The load()
function returns an out-of-memory Dask Datafame, ensuring scalability on a wide range of devices. Most Pandas functions are available, and the underlying pandas.DataFrame
object is exposed upon executing the .compute()
method. See Dask Documentation for more details. More data interfaces are to come!
There are many excellent packages in the Python ecosystem for completing NLP tasks. In fact, cv-py
depends on many of them.
However, one of the key barriers to rapid NLP analysis and tool development lies in pipeline construction... namely, cleaning, data munging, and "gluing together" all the libraries that, united, achieve the decision support we needed in the first place.
This is not another attempt at collecting techniques together to create another "NLP framework"; rather, cv-py
provides some of the "glue" needed to allow rapid data and method interaction between the excellent tools that already exist.
This is done with the express purpose of contextualization within the problem domain of the CORD-19 challenge, and to assist others in the public who are willing and able to apply their data-science skills, but may otherwise spend far more effort applying "glue" than building solutions.
resources
: data and modelsbuiltins
: lightweight data/models contained natively- Tasks: from Kaggle
- TREC topics (?)
datapackage
: installable withcv-download
- CORD19: papers and NER tags from NIST CDCS
- Sci-spaCy models: see their documentation for usage
tools
: to assist in moving CORD19 data aroundprocess
: to spacy, flair, etc. (WIP)- parallelism and scale (pandas -> dask)
- ease of use: built-in pipeline tools
remote
: external tools, accessible via API- Univ. S. Korea Neural NER APIs (CovidAsk, BERN(?))
- TextAE pubannotation vizualizer (WIP)
More to come, but primary requirement is the use of Poetry.
Notebooks are kept nicely git-ified thanks to Jupytext