Proposed quality control pipeline:
- Standardize chromosome names (TODO: Only for humans? Maybe just remove
chr
) - Verify positions are positive
- Investigate other error conditions for pytabix and check them in QC
The workflows exposed by this service currently depend on:
htslib
1.10 or laterbcftools
1.10 or later
The service itself depends on the following non-Python utilities:
bcftools
1.10 or later
The Bento Variant Service is copyright © 2019-2020 the Canadian Centre for Computational Genomics, McGill University.
Portions of this codebase (namely, test VCF data) comes from the 1000 Genomes Project, and is thus subject to their copyright and terms of use.
VCFs, per the spec, use 1-based coordinates:
POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically,in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)
Beacon, on the other hand, specifies that 0-based coordinates should be used:
... Precise start coordinate position, allele locus (0-based, inclusive).
All Beacon endpoints use 0-based coordinates with closed ranges.
All other endpoints use 1-based coordinates with half-open ranges.
Default values for environment variables are listed on the right-hand side.
TABLE_MANAGER=drs
DRS_URL_BASE_PATH=/api/drs
SERVICE_ID=ca.c3g.bento:variant:VERSION
INITIALIZE_IMMEDIATELY=true
DATA=/path/to/data/directory
CHORD_URL=http://localhost/ # URL for the Bento node or standalone service
WORKERS= # If set and more than one, a multiprocessing pool will be used.
-
TABLE_MANAGER
is used to specify how data will be stored in the service instance, and what types of data files the service will look for in theDATA
folder. The available options are:drs
: expects data as Data Repository Service (DRS) object links to.vcf.gz
and.vcf.gz.tbi
filesmemory
: stores data in memory for the duration of the service's process uptimevcf
: expects data as.vcf.gz
and.vcf.gz.tbi
files directly
-
INITIALIZE_IMMEDIATELY
is used to specify whether to wait for aGET
request to /private/post-start-hook to initialize the service table manager -
DRS_URL
is used to specify an optional override DRS base URL for the Bento container-internal-network DRS instance, with no trailing slash. If left unset, achord_singularity
-compatible value is assumed:http+unix://%2Fchord%2Ftmp%2Fnginx_internal.sock/api/drs
-
If left unset,
SERVICE_ID
will default toca.c3g.bento:variant:VERSION
, whereVERSION
is the current version of the service package. -
CHORD_URL
is used to construct the reverse domain-name notation identifier for the GA4GH Beacon endpoints. -
WORKERS
sets the number of processes used to parallel-search tables with large numbers of VCFs. If set to 1, this will not use any multiprocessing, which may be better in some situations.
Development dependencies are described in requirements.txt
and can be
installed using the following command:
pip install -r requirements.txt
The Flask development server can be run with the following command:
FLASK_APP=bento_variant_service.app FLASK_DEBUG=True flask run
To run all tests and calculate coverage, including branch coverage, run the following command:
python3 -m tox
In production, the service should be deployed using a WSGI service like uWSGI or Gunicorn.
The Dockerfile configures the service with Gunicorn. It is thus strongly recommended that a reverse proxy such as NGINX is added in front of the container.
The data for the service is stored inside the container's /data
directory.
This should be bound as a persistent volume on the container host.
The service runs inside the container on port 8080.
Running the container by itself will use the following default configuration:
-
1 worker process. Right now, the in-memory variant file cache means that using more than one worker can cause unexpected behaviour. This can be overridden by running the container with the option
--workers n
, wheren
is the number of workers. -
CHORD_URL=http://localhost/
. This will NOT work in production properly, as it is meant to represent the public URL of the node. This should be overridden.