Improving Personalized Prediction of Cancer Prognoses with Clonal Evolution Models

Introduction

This repository contains source code for the pipeline in the following paper: Yifeng Tao, Ashok Rajaraman, Xiaoyue Cui, Ziyi Cui, Jesse Eaton, Hannah Kim, Jian Ma, and Russell Schwartz. Improving Personalized Predictiion of Cancer Prognoses with Clonal Evolution Models. bioRxiv 761510. 2019.

Prerequisites

The code runs on Python 2.7.

Common Python packages are required: os, warnings, collections, copy, math, sklearn, pickle, numpy, pandas.
Additional Python packages need to be installed as well: vcf, lifelines.

Depending on the availability of data, external programs for variant calling and evolutionary modeling may be required. Please follow their instructions for installation and running.

Variant calling (WGS): You may need Weaver, novoBreak, or Sanger pipeline. Alternatively, the variants called by Sanger pipeline are readily available from ICGC Data Portal with proper license and permission.
Variant calling (WES): You may need to follow TCGA pipeline. Alternatively, the variants called by TCGA pipeline are readily available from GDC Data Portal with proper license and permission.
Evolutionary modeling (WGS): TUSV is used in the pipeline.
Evolutionary modeling (WES): Canopy is used in the pipeline.

Data

The data/ directory saves the the raw/intermediate/result files. You need to take care of these directories specifically:

Clinical data: You may download the TCGA clinical data of different cancer types, and save them as data/TCGA/[CANCER]/clin.merged.txt, where [CANCER] can be BRCA, LUAD, and LUSC etc.
Somatic variants (SNVs/CNAs/SVs): Should be saved in the compressed VCF format as data/vcf-data/[SAMPLE-BARCODE].[variant-caller].[variant].vcf.gz. For example, TCGA-33-4586.sngr.sv.vcf.gz is the structural variant (SV) file of the sample with the barcode of TCGA-33-4586 called by the Sanger variant caller pipeline.
Phylogenies: You may save the TUSV output of the WGS sample with barcode [SAMPLE-BARCODE] under data/tusv/[SAMPLE-BARCODE]/. Similarly, save the Canopy output of WES sample with barcode [SAMPLE-BARCODE] under data/canopy/[SAMPLE-BARCODE]/.

You do not need to organize the following directories, which saves auxiliary files, intermediate files and results.

List of potential drivers for BRCA (BRCA_driver.txt) and LUCA (lung_driver.txt), positions of potential drivers in the genome (driver_hg19.txt for GRCh37/hg19, driver_hg38.txt for GRCh38/hg38) have been provided under data/intogen/.
Extracted survival data, clinical/driver/two-node/multi-node features will be saved under the directories data/{survival,clinical,driver,twonode,multinode}/.
Results will be saved under data/result/{brca_os,brca_dfs,lung_os,lung_dfs}/ for BRCA/LUCA cancer types and OS/DFS tasks.

Experiment

Download the repository and create directories:

git clone https://github.com/CMUSchwartzLab/cancer-phylogenetics-prognostic-prediction.git
cd cancer-phylogenetics-prognostic-prediction
python create_directories.py

Follow instructions of variant caller to call variants, and evolutionary models to build up phylogenies. Sort, rename, and put them under proper directories.

The experiment code below implements the feature extraction and maching learning algorithm. To avoid too many if-else branches and redundancies, the code in this repository specifically focuses on the WGS data, with somatic variants called by Sanger pipeline, and phylogenies generated by TUSV model. However, one should note that the pipeline is flexible, and is easy to adapt to other variant callers and phylogenetic models by slightly revising the corresonding modules.

To extract the OS/DFS time and different features from clinical data, somatic variants and cancer phylogenies:

python prepare_feature.py

To run feature filtering, step-wise feature selection and k-fold cross-validation of Cox regression:

python experiment.py

Contributors

The code has been contributed by Yifeng Tao, Ashok Rajaraman, Xiaoyue Cui, Ziyi Cui, Jesse Eaton, and Hannah Kim.

Contact: Yifeng Tao ([email protected]), Russell Schwartz ([email protected])

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. You are free to share or adapt the material for non-commercial purposes.

If you find this work helpful, please cite:

@article{tao2019cancerphylo,
  title = {Improving Personalized Prediction of Cancer Prognoses with Clonal Evolution Models},
  author = {Tao, Yifeng  and
    Rajaraman, Ashok  and
    Cui, Xiaoyue  and
    Cui, Ziyi  and
    Eaton, Jesse  and
    Kim, Hannah  and
    Ma, Jian  and
    Schwartz, Russell},
  journal = {bioRxiv},
  elocation-id = {761510},
  doi = {10.1101/761510},
  publisher = {Cold Spring Harbor Laboratory},
  URL = {https://www.biorxiv.org/content/early/2019/09/18/761510},
  eprint = {https://www.biorxiv.org/content/early/2019/09/18/761510.full.pdf},
  year = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/intogen		data/intogen
README.md		README.md
create_directories.py		create_directories.py
experiment.py		experiment.py
pipeline.png		pipeline.png
prepare_feature.py		prepare_feature.py
utils_exp.py		utils_exp.py
utils_feature.py		utils_feature.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Personalized Prediction of Cancer Prognoses with Clonal Evolution Models

Introduction

Prerequisites

Data

Experiment

Contributors

License

About

Releases

Packages

Languages

CMUSchwartzLab/cancer-phylogenetics-prognostic-prediction

Folders and files

Latest commit

History

Repository files navigation

Improving Personalized Prediction of Cancer Prognoses with Clonal Evolution Models

Introduction

Prerequisites

Data

Experiment

Contributors

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages