Skip to content

MMRD input file format

Doga C. Gulhan edited this page Feb 1, 2024 · 20 revisions

Calculating indels that overlap with repeat regions

The repeat regions identified MSIProfiler, and MSISensor algorithms are combined and can be found in inst/extdata/repeat_bed_files directory. The following steps use these bed files to calculate the overlap of indels with the repeat regions. The scripts are in cd SigMA/examples.

Using bedtools

(Suggested)

If you have access to a Linux machine using it is much faster to use bedtools than the alternative below. This calculation can be done either from vcf or maf files with the few steps below:

Step0: Downloading repeat region bed files from git-lfs

The bed files are located in SigMA/inst/extdata/repeat_bed_files/. The script in SigMA/examples/pull_bed_files_from_lfs.sh.

cd SigMA/examples/
Rscript create_list_bedfiles.R [bed_list]
git lfs install --local --skip-smudge
source pull_bed_files_from_lfs.sh [bed_list]

Step1: Create a bed file containing indels

Rscript convert_muts_to_bed.R [file_type] [input] [mut_bed_file]

Description of the input arguments

file_type: "maf" or "vcf"

input: either the path to the maf file or path to the directory containing only the vcf files

mut_bed_file: path for the output file where mutations will be saved in bed format

Step2: Intersect bed files defining repeat regions and mutation bed file.

./overlap_with_repeat_regions.sh  [bed_list] [mut_bed_file] [output_file]

Description of the input arguments:

bed_list: a file containing the list of bed files that contain the repeat regions. The create_list_bedfiles.R script in Step0 creates the list.

mut_bed_file: path for bed file containing mutations

output_file: this path will be used to save two files, one that contains the bed files with mutations that overlap with repeat regions and another one that is named as counts_nmsi_<output_file>.csv that has indel counts overlapping with repeat regions per sample using SigMA/examples/make_table_indel_repeat.R.

Step3: Combine the indel counts with the 96-dimensional matrix of single base substitutions. After loading SigMA:

devtools::load_all()
combine_SBS_indel(file_SBS, file_overlap_repeat, mut_bed_file)

Make sure that the tumor columns agree in different input files. The SBS file is updated to contain nins, ndel, nmsi_ins and nmsi_del columns.

Description of input arguments:

file_SBS: A file containing SBS spectra. This file can be generated with the output of make_matrix() function, 96 columns defining SBS spectra, tumor column with sample ids.

file_overlap_repeat: A file containing the number of insertions and deletions overlapping with repeat regions. This file can be generated with overlap_with_repeat_regions.sh script, it is saved as counts_nmsi_*.csv.

mut_bed_file: Bed file containing indels. This file can be generated with convert_muts_to_bed.R macro.

Using the built-in function

Install bed files from git-lfs Using bedtools/Step0. A much slower but a more straightforward option is to use the built-in run_overlap_repeats() function.

After loading the algorithm you can calculate indels that overlap with repeat regions by:

run_overlap_repeats(input = <input_maf_file> or <input vcf dir>, file_type = 'maf' or'vcf')

Input file format

The input file should contain:

  • The SBS mutational spectra in the first 96 columns
  • A tumor columns with sample ids.
  • Counts of small insertions and deletions nins and ndel columns
  • Counts of small insertions and deletions overlapping with repeat regions nmsi_ins and nmsi_del.
  • Optional: MSISensor score under the column msisensor. See: https://github.com/ding-lab/msisensor.

Example

git clone https://github.com/parklab/SigMA
cd SigMA
git checkout -b dev
git pull origin dev

cd examples/
Rscript create_list_bedfiles.R bed_list.txt
git lfs install --local --skip-smudge
source pull_bed_files_from_lfs.sh bed_list.txt

Rscript convert_muts_to_bed.R maf example_muts_mmrd.maf example_muts_mmrd.bed
./overlap_with_repeat_regions.sh bed_list.txt example_muts_mmrd.bed indel_repeat.bed
Rscript example_combine_files_mmrd.R