OncodashKB is a data conversion tool that extracts data from Oncodash' tables and feed them in a graph database.
Under the hood, it uses Biocypher as the main tool for doing the ontology alignment and for supporting several graph database backends and Ontoweaver for instanciating a Biocypher adapter.
The project uses Poetry. You can install OncodashKB using the commands below:
git clone https://github.com/oncodash/oncodashkb.git
cd oncodashkb
poetry install
Poetry will create a virtual environment according to your configuration (either
centrally or in the project folder). You can activate it by running poetry shell
inside the project directory.
If you have a problem with the poetry install command, it may be that the 'poetry lock' command has not been ran after changing dependencies modification in '$ONCODASHKB_HOME/pyproject.toml'. Try running 'poetry lock' to fix the issue.
Theoretically, any graph database supported by Biocypher may be used.
As of now, OncodashKB targets using Neo4j, which have some particularities.
So far, it has been extensively tested with Neo4j 5+ but it should also works with Neo4j 4+.
As of now, OncodashKB targets using Neo4j, which have some particularities. Neo4j Graph Database Self-Managed
version can be downloaded here. When using with this version, be sure to add the bin/
directory to PATH
and PYTHONPATH
,
as well as the correct version of Java to JAVA_HOME
.
Note that the community edition of Neo4j do not support multiple database,
hence the need to configure the default database in $NEO4J_HOME/conf/neo4j.conf
to be:
initial.dbms.default_database=oncodash
(which is commented out by default, hence the default database will be called neo4j).
Note that the default database does not always need to be named oncodash
, but should match the name of the database in $ONCODASHKB_HOME/config/biocypher_config.yaml
.
When using oncodashkb, you should generally follow the steps below.
poetry shell
The command transforms the array-shape database into a graph database, thanks to OntoWeaver.
./weave.py [-database] <CSV data file>
Where [-databse] can be:
- --cgi
- --oncokb
- --open_targets
- --open_targets_drugs
- --open_targets_diseases
- --open_targets_evidences
Look below in the OncodashKB adapters section for more information on each of these options.
Once executed, Biocypher prepares a shell script named neo4j-admin-import-call.sh
in a timestamped sub-directory in '$ONCODASHKB_HOME/biocypher-out'. The complete path of this file is printed at the end of execution.
In case you use the environment variable 'NEO4J_HOME', don't forget to delete the 'bin/' prefixe in the import call.
#!/bin/bash
version=$(~~bin/~~neo4j-admin --version | cut -d '.' -f 1)
...
Before importing the data, be sure that the server has not been started. Executing this script will connect to the Neo4j server and feed it with the extracted graph.
sh [$PATH_TO/neo4j-admin-import-call.sh]
You can start the server by using the command below.
Neo4j 5+:
neo4j-admin server start
Neo4j 4:
neo4j start
This will give you a link to the neo4j browser where you can explore your graph. By default, the link to neo4j browser is 'localhost:7474/'.
You can stop the server by using the command below.
Neo4j 5+:
neo4j-admin server stop
Neo4j 4:
neo4j stop
[poetry] exit
The steps should always be in the order above.
Cancer Genome Interpreter is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, disease type, and transcript information.
To launch CGI adapter, use --cgi
option and path to the CSV file with the data that you want to integrate.
Example of use:
./weave.py –cgi /path_to_file/test_genomics_cgimutation.csv
OncoKB is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, and disease type, as well as treatment options with FDA, OncoKB evidence levels, and related publications.
To launch OncoKB adapter, use --oncokb
option and path to the CSV file with the data that you want to integrate.
Example of use:
./weave.py –oncokb /path_to_file/test_genomics_oncokbannotation.csv
Gene Ontology is one of the biggest biomedical databases. The described adapter helps to integrate the data about the molecular function of the gene product, as well as the biological process in which these genes are involved.
- Molecular function: GO annotations that have relation type
enabled
orcontributes_to
. - Biological process: GO annotations that have relation type
involved_in
.
To integrate the data, three files are necessary:
--gene_ontology
option for GO annotations in GAF format Download GO annotations--gene_ontology_owl
option for GO ontology in OWL format Download GO ontology--gene_ontology_genes
option for the list of genes for which we want to integrate the GO annotations (example in adapters/Hugo_Symbol_genes.conf file, by default = list of genes from OncoKB database).
Example of use:
./weave.py --gene_ontology /path_to_file/goa_human.gaf --gene_ontology_owl /path_to_file/go.owl --gene_ontology_genes /path_to_file/Hugo_Symbol_genes.conf
If you want to integrate annotations with another type of relations, you can modify the adapters/gene_ontology.py
file by adding the next code in the class Gene_ontology (example for the involved_in
edge type):
# create new columns that depends on edge type
df['GO_involved_in'] = None
# cut df to include only edge type that we have chosen and annotations for genes from OncoKB
df = df[((df['Qualifier'].isin(['enables', 'involved_in', 'contributes_to'])) &
(df['DB_Object_Symbol'].isin(included_genes)))]
Also, you need to add code in separate_edges_types
method:
# function to copy GO_term to related column for future ontoweaver mapping based on Qualifier column (relation type)
def separate_edges_types(row):
if row['Qualifier'] == 'enables':
row['GO_enables'] = row['GO_term']
elif row['Qualifier'] == 'involved_in':
row['GO_involved_in'] = row['GO_term']
Finally, you need to specify the node and edge types in the gene_ontology.yaml
for GO_involved_in
column.
Open Targets is a public database that aims to systematically identify and prioritize drug targets for disease treatment. The described adapter helps to integrate the data about the targets, disease/phenotypes, drugs and evidences.
Current adapter works with the data in Parquet format.
To download the data, you can visit this page and separately download needed datasets or execute the next bash script:
#!/bin/bash
mkdir OpenTargets
cd OpenTargets
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/24.06/output/etl/parquet/targets .
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/24.06/output/etl/parquet/diseases .
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/24.06/output/etl/parquet/molecule .
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/24.06/output/etl/parquet/evidence .
As Open Targets database contains millions of the rows of the data, in order to integrate only necessary information, you need to precise the genes (Hugo Symbols and Ensembl IDs) in the configuration files:
- Hugo symbols in the file
oncodashkb/adapters/Hugo_Symbol_genes.conf
- Ensembl ID in the file
oncodashkb/adapters/Ensembl_genes.conf
Example of use for targets, diseases, drugs and evidences (only from Chembl) integration:
./weave.py --open_targets path_to_OpenTargets/OpenTargets/targets --open_targets_drugs path_to_OpenTargets/OpenTargets/molecule --open_targets_diseases path_to_OpenTargets/OpenTargets/diseases --open_targets_evidences path_to_OpenTargets/OpenTargets/evidence/sourceId\=chembl
When modifying any dependencies in '$ONCODASHKB_HOME/pyproject.toml', be sure to run 'poetry lock'.
Hints and tips about designing the ontology alignements:
- Ontologies may be browsed with Protégé.
- The biolink model
has (a lot of) classes attached at the root
Thing
. These are actually decomissioned stuff, the actual classes are underentity
.
To check whether there is some data in your graph database, you can use the command-line client of Neo4j:
cypher-shell -d oncodash -u neo4j "MATCH (n) RETURN n LIMIT 5;"
and you should see 5 nodes.
To visualize [a part of] the graph, you can use neo4j-browser with a similar Cypher query.
Notes:
- Neo4j-browser needs a specific node version, you can install it with:
pip install nodeenv nodeenv --node=16.10.0 env . env/bin/activate npm install yarn yarn install yarn start
- Neo4j server disable connection across the network by default.
To connect the browser to a server on another machine,
be sure to edit the server's
neo4j.conf
with the0.0.0.0
address:server.bolt.listen_address=0.0.0.0:7687