Skip to content

Quick start

Doga C. Gulhan edited this page Mar 25, 2021 · 7 revisions

After following the Installation you can get started.

Input/Output file format

Input

MAF file example:

An example script that runs on an example maf file input is provided: Rscript examples/test_maf.R

As can be seen in test_maf.R, set file_type = 'maf' in make_matrix() function to create 96-dimensional matrix from your mutation calls in maf format and then SigMA is run.

You can start with this code skeleton and:

  • Replace the input to set to your mutation calls:

Replace:

data_file<- system.file("extdata/examples/test_mutations_50sample.maf", package="SigMA")

with

data_file <- <local_path_to_your_input>

VCF file example:

An example script that runs on example vcf files is provided: Rscript examples/test.R

As can be seen in test.R, set file_type = 'vcf' in make_matrix() function to create 96-dimensional matrix from your mutation calls in maf format and then SigMA is run.

You can start with this code skeleton and:

  • Replace the input to direct to your mutation calls, a directory that only contains the vcf files to be processed by SigMA. In test.R:

replace:

data_dir <- system.file("extdata/vcf_examples/", package="SigMA")

with

data_dir <- <local_path_to_your_input_file>

Specify reference genome through one of the below:

  • using the ref_genome_name parameter which takes 'hg19' or 'hg38' values
  • if you want to use a different reference genome you can install it and provide it by ref_genome parameter

Output

There are two different outputs formats a large format or a lite format where the information in the large format is summarized using lite_df() function. The conversion can be done after the output is produced using the output file of the run() function, let's say is called output_file_name:

df <- read.csv(output_file_name)
lite <- lite_df(df)

or by setting lite_format = T in the settings of run().

In the lite file:

  • total_snvs indicates the number of SNVs in that sample
  • columns ending with _ml indicate the total likelihood of clusters which are high in that signature, e.g. Signature_3_ml is the total likelihood of clusters with Signature 3
  • Signature_3_c is the cosine similarity to Signature 3
  • exp_sig3 is the exposure of Signature 3 calculated with NNLS
  • Signature_3_l_rat is the likelihood ratio of the NNLS decomposition with Signature 3 considering the possibility of alternative decompositions without Signature 3. A value of 0.5 indicate that an NNLS decomposition without Signature 3 is as likely.
  • Signature_3_mva is the SigMA score indicating the presence of Signature 3 estimated by the gradient boosting classifier implemented in SigMA
  • pass_mva and pass_mva_strict are booleans indicating the presence of Signature 3 with the looser and strict selection thresholds that corresponds to 10% FPR and 1-5% FPR respectively. Produced only if do_mva is True.
  • sigs_all and exps_all are the names of signatures included in NNLS calculation and the exposures.
  • categ is the general category the sample falls into according to the dominant signatures. The Signature_3_hc indicates that the sample passes the strict threshold and hc stands for high confidence, and Signature_3_lc indicates that the sample passes the looser threshold but not the strict threshold.

How to get the NNLS exposure of each signature

The NNLS results are contained in the exps_all (or exps_all_msi if check_msi = T in run function) and sigs_all (or sigs_all_msi). If samples are mismatch repair proficient using exps_all and sigs_all is suggested.

df <- run(<...>)
> head(df$sigs_all)
 [1] "Signature_1.Signature_2.Signature_13.Signature_17b"
 [2] "Signature_2.Signature_3.Signature_8.Signature_13"  
 [3] "Signature_1.Signature_5.Signature_18.Signature_28" 
 [4] "Signature_1.Signature_5"              
 [5] "Signature_1.Signature_2.Signature_13.Signature_18" 
> head(df$exps_all)
[1] "6.28753746867414_20.8619624348833_36.4565328140342_1.61705928069764"
[2] "1.02559997853178_1.83454771316774_2.27976476868483_9.25209214668376"
[3] "134.888014310562_188.867066648019_106.398387417935_68.6727894422497"
[4] "16.0273878191902_25.711979828044"                  
[5] "1.36820147677162_14.7735768920957_8.1727146267034_5.23987489059937"  

You can convert this compact info to a data frame using get_sig_exps():

df_exps <- get_sig_exps(df = df, col_exps = 'sigs_all', col_sigs = 'exps_all')
df <- cbind(df, df_exps)

Note that the NNLS output is not always reliable you can find the corresponding l_rat columns for each signature to quantify the uniqueness of the NNLS solution. As an example:

> df$exp_sig17b[[1]]
1.617059
> df$Signature_17b_l_rat
0.01476486

This means that Signature 17b NNLS exposure is not robust, instead in the same sample

> df$exp_sig2[[1]]
20.8619624348833
> df$Signature_2_l_rat[[1]]
0.9964119

Signature 2 is necessary for a good decomposition.

How to interpret the cluster matching

Each Signature_X_ml (or Signature_X_ml_msi if check_msi = T) column indicate the likelihood of a WGS cluster. The X was selected based on the dominant signatures, but interpreting the meaning of this matching from the name alone might be challenging. You can use the llh_max_characteristics function to get relative activity of each signature in the most likely clusters.

df <- run(<...>) # run SigMA
df <- llh_max_characteristics(df, tumor_type, cosmic_version)

shows

> head(df$cluster_sigs_all)
[1] "Signature_2.Signature_13"                                         
[2] "Signature_2.Signature_13"                                         
[3] "Signature_5.Signature_14.Signature_15.Signature_10a.Signature_10b"
[4] "Signature_1.Signature_15.Signature_20.Signature_26.Signature_6"   
[5] "Signature_2.Signature_5.Signature_13"                             
[6] "Signature_1.Signature_5.Signature_20.Signature_26.Signature_6"  
> head(df$cluster_exps_all)
[1] "0.379876383326604_0.479620279610456"                                                      
[2] "0.379876383326604_0.479620279610456"                                                      
[3] "0.104838013337862_0.117388801504446_0.102457262738129_0.247320034628237_0.153266503507717"
[4] "0.290296634361362_0.11172525703329_0.123746638882048_0.106566789720104_0.100395257268131" 
[5] "0.330216090479053_0.160128871589518_0.313269064300588"                                    
[6] "0.161309358864425_0.145096377344316_0.134592613533534_0.102922984915257_0.192013161212538"

You can then split this info in a data frame using get_sig_exps() function:

df_exps_clusters <- get_sig_exps(df = df, col_exps ='cluster_exps_all', col_sigs = 'cluster_sigs_all')
colnames(df_exps_clusters) <- paste0('clust_', colnames(df_exps_clusters)
df <- cbind(df, df_exps_clusters)

Your data frame now contains the exposures of the most likely cluster for each sample in separate columns.

> df$clust_exp_sig2
0.379876383326604