This repository complements FedPyDESeq2, by providing scripts to download and preprocess the TCGA data used in the FedPyDESeq2 preprint.
More precisely, it contains the data, assets and scripts necessary to:
- download the raw data necessary to run the tests and experiments, when not
available in the repository, in the
download_data
directory; - open the data when performing a Substra experiment in the
assets
directory; - store the data in the
data
directory.
You can install the package from PyPI using the following command:
pip install fedpydeseq2-datasets
For a detailed description of the data download process, please refer to the README.
If you want to run the pipeline directly, you can use the script which is available in the distribution: fedpydeseq2-download-data
fedpydeseq2-download-data
By default, this script download the data in the data/raw
directory at the root of the github repo.
To change the location of the raw data download, add the following option:
fedpydeseq2-download-data --raw_data_output_path <path>
If you only want the LUAD dataset, add the --only_luad
flag.
You can pass the conda
activation path as an argument as well, for example:
fedpydeseq2-download-data --raw_data_output_path <path> --conda_activate_path /opt/miniconda/bin/activate
Origin of the data
- The
Counts_raw.parquet
andrecount3_metadata.csv
files are downloaded from the RECOUNT3 database. - The
tumor_purity_metadata.csv
file is downloaded from the Systematic pan-cancer analysis of tumour purity paper. - The
cleaned_clinical_metadata.csv
file is downloaded from the An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics paper.
For more detailed references, see the References section.
The assets
directory contains a TCGA opener necessary to open the data on each center
when performing a federated experiment with Substra.
In particular, the fedpydeseq2_datasets/assets/tcga
directory contains the following files:
assets/tcga
├── description.md
├── opener.py
The opener is a Python script that opens the data and makes it available to the
Substra platform. The description.md
file contains a description of the data.
For more details on how the opener works, please refer to the Substra documentation.
The data
directory contains the raw data.
The raw
directory contains the data downloaded from the original sources,
with the download_data
scripts.
It is organized as follows:
data
├── raw
│ └── tcga
│ ├── <COHORT NAME>
│ │ ├── Counts_raw.parquet
│ │ └── recount3_metadata.csv
│ ├── centers.csv
│ ├── cleaned_clinical_metadata.csv
│ └── tumor_purity_metadata.csv
This module not only provides the raw data on which to test fedpydeseq2
; it also provides the necessary
preprocessing functions, to organise the data according to their center of origin, and to aggregate the raw
data into metadata and counts acceptable to run pydeseq2 or fedpydeseq2.
These preprocessing function usually create the preprocessed data in a processed_data_path
directory, with the following
structure (the files shown below are created by different preprocessing functions).
└── <processed_data_path>
├── tcga
│ ├── <COHORT NAME>
│ │ ├── counts.parquet
│ │ └── clinical_data.csv
├── centers_data
│ └── tcga
│ ├── <dataset_and_dge_experiment_id>
│ │ ├── <center_i>
│ │ │ ├── counts_data.csv
│ │ │ ├── metadata.csv
│ │ │ └── ground_truth_dds.pkl
└── pooled_data
└── tcga
├── <dataset_and_dge_experiment_id>
│ ├── counts_data.csv
│ ├── metadata.csv
│ └── ground_truth_dds.pkl
These files are automatically generated, if they are not
already present from the raw files or if the force
option is on.
Note that the centers are always indexed by an integer, starting from 0. For example, one would
have center_0,...,center_3
if there are
4 centers in the experiment.
The <dataset_and_dge_experiment_id>
is an identifier of a differential gene expression task (and its specific hyperparameters)
and a TCGA dataset. fedpydeseq2
or deseq2
can then be run on the data corresponding to that DGE task.
In this repository, we study the following cofactors:
- the
gender
of the patients, which is obtained from thecleaned_clinical_metadata.csv
file; - the
CPE
of the samples, which is obtained from thetumor_purity_metadata.csv
file; - the
stage
of the patients, which is obtained from thecleaned_clinical_metadata.csv
file. this stage is originally a stage betweenI
andIV
, but we have grouped them intoI-II-III
(Non-advanced
) andIV
(Advanced
) stages, to have a binary covariate. - the
center_id
of the samples, which is obtained from thecenters.csv
file and used to create natural centers for the federated experiments.
The processing is done by functions in the fedpydeseq2_datasets
directory. There are three main functions.
- the
common_preprocessing_tcga
function in thefedpydeseq2_datasets/common_preprocessing.py
file; - the
setup_tcga_dataset
function in thefedpydeseq2_datasets/process_and_split_data.py
file; - the
setup_tcga_ground_truth_dds
function in thetcga_preprocessing/create_reference_dds.py
file.
The role of the common_preprocessing_tcga
function is to generate counts and processed
clinical data for a given cohort (e.g. LUAD
), from the raw data.
└── processed
├── tcga
│ ├── <COHORT NAME>
│ │ ├── counts.parquet
│ │ └── clinical_data.csv
The counts.parquet
file contains the counts data, indexed by TCGA sample barcode,
and with columns corresponding to the gene_id in ENSEMBL convention.
Note that we filter out the PAR_Y
genes, as they are not common to all patients.
The clinical_data.csv
file aggregates the different metadata from the different sources
described above in a per-cohort fashion. This csv
is indexed by the TCGA sample barcode.
It contains the following columns:
gender
: the gender of the patient;CPE
: CPE stands for "consensus measurement of purity estimations", and is an aggregate of different purity estimations for the sample;stage
: the stage of the patient, as an integer between 1 and 4center_id
: the center id of the sample, as an integeris_normal_tissue
: a boolean indicating if the sample is a normal tissue or not.T
: the tumor grade of the patient, as an integer between 1 and 4N
: the nodal status of the patient, as an integer between 0 and 3M
: the metastasis status of the patient, as an integer between 0 and 1
The role of the setup_tcga_dataset
function and the setup_tcga_ground_truth_dds
function
is to generate the data necessary for the federated AND corresponding pooled experiments, creating
this part of the arborescence:
└── processed
├── centers_data
│ └── tcga
│ ├── <dataset_and_dge_experiment_id>
│ │ ├── <center_i>
│ │ │ ├── counts_data.csv
│ │ │ ├── metadata.csv
│ │ │ └── ground_truth_dds.pkl
└── pooled_data
└── tcga
├── <dataset_and_dge_experiment_id>
│ ├── counts_data.csv
│ ├── metadata.csv
│ └── ground_truth_dds.pkl
The <dataset_and_dge_experiment_id>
identifies an experiment. It concatenates
not only the dataset name (TCGA cohort), but also the design factors, continuous factors
as well as other parameters used to filter the data.
The counts_data.csv
file contains the counts data, indexed by TCGA sample barcode,
and with columns corresponding to the gene_id in ENSEMBL convention.
The metadata.csv
file contains the clinical data, indexed by the TCGA sample barcode, and
containing only the columns corresponding to a design factor.
The ground_truth_dds.pkl
file contains the ground truth for the differential expression
analysis, as a dds
object from the DESeq2
package.
For more details on these functions, please refer to their respective documentations.
Note: the
setup_tcga_dataset
function will binarize thestage
into two categories:Advanced
andNon-advanced
.Advanced
corresponds to stageIV
, andNon-advanced
corresponds to stagesI
,II
andIII
. For the TCGA-PRAD cohort, we do not have the stage information, but we infer the stage from theT
,N
andM
columns. If theN
orM
columns are > 0, the stage is IV (see the following reference) and hence theAdvanced
stage. Otherwise, it isNon-advanced
.
The data downloaded here has mainly been obtained from TCGA and processed by the following works.
[1] Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015 Dec 4;6:8971. doi: 10.1038/ncomms9971. Erratum in: Nat Commun. 2016 Feb 05;7:10707. doi: 10.1038/ncomms10707. PMID: 26634437; PMCID: PMC4671203. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4671203/
[2] Jianfang Liu, Tara Lichtenberg, Katherine A. Hoadley, Laila M. Poisson, Alexander J. Lazar, Andrew D. Cherniack, Albert J. Kovatich, Christopher C. Benz, Douglas A. Levine, Adrian V. Lee, Larsson Omberg, Denise M. Wolf, Craig D. Shriver, Vesteinn Thorsson et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics, Cell, Volume 173, Issue 2, 2018, Pages 400-416.e11, ISSN 0092-8674, https://doi.org/10.1016/j.cell.2018.02.05 https://www.sciencedirect.com/science/article/pii/S0092867418302290
[3] Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL, Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L, Hansen KD, Langmead B (2021). "recount3: summaries and queries for large-scale RNA-seq expression and splicing." Genome Biol. doi:10.1186/s13059-021-02533-6 https://doi.org/10.1186/s13059-021-02533-6, https://doi.org/10.1186/s13059-021-02533-6.