-
Notifications
You must be signed in to change notification settings - Fork 21
MMRD input file format
The repeat regions identified MSIProfiler, and MSISensor algorithms are combined and can be found in inst/extdata/repeat_bed_files
directory. The following steps use these bed files to calculate the overlap of indels with the repeat regions. The scripts are in cd SigMA/examples
.
(Suggested)
If you have access to a Linux machine using it is much faster to use bedtools than the alternative below. This calculation can be done either from vcf or maf files with the few steps below:
Step0: Downloading repeat region bed files from git-lfs
The bed files are located in SigMA/inst/extdata/repeat_bed_files/
. The script in SigMA/examples/pull_bed_files_from_lfs.sh
.
cd SigMA/examples/
Rscript create_list_bedfiles.R [bed_list]
git lfs install --local --skip-smudge
source pull_bed_files_from_lfs.sh [bed_list]
Step1: Create a bed file containing indels
Rscript convert_muts_to_bed.R [file_type] [input] [mut_bed_file]
Description of the input arguments
file_type
: "maf" or "vcf"
input
: either the path to the maf file or path to the directory containing only the vcf files
mut_bed_file
: path for the output file where mutations will be saved in bed format
Step2: Intersect bed files defining repeat regions and mutation bed file.
./overlap_with_repeat_regions.sh [bed_list] [mut_bed_file] [output_file]
Description of the input arguments:
bed_list
: a file containing the list of bed files that contain the repeat regions. The create_list_bedfiles.R script in Step0 creates the list.
mut_bed_file
: path for bed file containing mutations
output_file
: this path will be used to save two files, one that contains the bed files with mutations that overlap with repeat regions and another one that is named as counts_nmsi_<output_file>.csv
that has indel counts overlapping with repeat regions per sample using SigMA/examples/make_table_indel_repeat.R.
Step3: Combine the indel counts with the 96-dimensional matrix of single base substitutions. After loading SigMA:
devtools::load_all()
combine_SBS_indel(file_SBS, file_overlap_repeat, mut_bed_file)
Make sure that the tumor columns agree in different input files. The SBS file is updated to contain nins
, ndel
, nmsi_ins
and nmsi_del
columns.
Description of input arguments:
file_SBS
: A file containing SBS spectra. This file can be generated with the output of make_matrix() function, 96 columns defining SBS spectra, tumor column with sample ids.
file_overlap_repeat
: A file containing the number of insertions and deletions overlapping with repeat regions. This file can be generated with overlap_with_repeat_regions.sh
script, it is saved as counts_nmsi_*.csv.
mut_bed_file
: Bed file containing indels. This file can be generated with convert_muts_to_bed.R macro.
Install bed files from git-lfs Using bedtools/Step0. A much slower but a more straightforward option is to use the built-in run_overlap_repeats() function.
After loading the algorithm you can calculate indels that overlap with repeat regions by:
run_overlap_repeats(input = <input_maf_file> or <input vcf dir>, file_type = 'maf' or'vcf')
The input file should contain:
- The SBS mutational spectra in the first 96 columns
- A
tumor
columns with sample ids. - Counts of small insertions and deletions
nins
andndel
columns - Counts of small insertions and deletions overlapping with repeat regions
nmsi_ins
andnmsi_del
. - Optional: MSISensor score under the column
msisensor
. See: https://github.com/ding-lab/msisensor.
git clone https://github.com/parklab/SigMA
cd SigMA
git checkout -b dev
git pull origin dev
cd examples/
Rscript create_list_bedfiles.R bed_list.txt
git lfs install --local --skip-smudge
source pull_bed_files_from_lfs.sh bed_list.txt
Rscript convert_muts_to_bed.R maf example_muts_mmrd.maf example_muts_mmrd.bed
./overlap_with_repeat_regions.sh bed_list.txt example_muts_mmrd.bed indel_repeat.bed
Rscript example_combine_files_mmrd.R