-
Notifications
You must be signed in to change notification settings - Fork 21
Quick start
After following the Installation you can get started.
MAF file example:
An example script that runs on an example maf file input is provided:
Rscript examples/test_maf.R
As can be seen in test_maf.R, set file_type = 'maf'
in make_matrix()
function to create 96-dimensional matrix from your mutation calls in maf format and then SigMA is run.
You can start with this code skeleton and:
- Replace the input to set to your mutation calls:
Replace:
data_file<- system.file("extdata/examples/test_mutations_50sample.maf", package="SigMA")
with
data_file <- <local_path_to_your_input>
- Update the
data
andtumor_type
options, see how to find the best parameter settings.
VCF file example:
An example script that runs on example vcf files is provided:
Rscript examples/test.R
As can be seen in test.R, set file_type = 'vcf'
in make_matrix()
function to create 96-dimensional matrix from your mutation calls in maf format and then SigMA is run.
You can start with this code skeleton and:
- Replace the input to direct to your mutation calls, a directory that only contains the vcf files to be processed by SigMA. In test.R:
replace:
data_dir <- system.file("extdata/vcf_examples/", package="SigMA")
with
data_dir <- <local_path_to_your_input_file>
- Update the
data
andtumor_type
options, see how to find the best parameter settings.
Specify reference genome through one of the below:
- using the
ref_genome_name
parameter which takes'hg19'
or'hg38'
values - if you want to use a different reference genome you can install it and provide it by
ref_genome
parameter
There are two different outputs formats a large format or a lite format where the information in the large format is summarized using lite_df()
function. The conversion can be done after the output is produced using the output file of the run()
function, let's say is called output_file_name:
df <- read.csv(output_file_name)
lite <- lite_df(df)
or by setting lite_format = T
in the settings of run()
.
In the lite file:
-
total_snvs
indicates the number of SNVs in that sample - columns ending with
_ml
indicate the total likelihood of clusters which are high in that signature, e.g.Signature_3_ml
is the total likelihood of clusters with Signature 3 -
Signature_3_c
is the cosine similarity to Signature 3 -
exp_sig3
is the exposure of Signature 3 calculated with NNLS -
Signature_3_l_rat
is the likelihood ratio of the NNLS decomposition with Signature 3 considering the possibility of alternative decompositions without Signature 3. A value of 0.5 indicate that an NNLS decomposition without Signature 3 is as likely. -
Signature_3_mva
is the SigMA score indicating the presence of Signature 3 estimated by the gradient boosting classifier implemented in SigMA -
pass_mva
andpass_mva_strict
are booleans indicating the presence of Signature 3 with the looser and strict selection thresholds that corresponds to 10% FPR and 1-5% FPR respectively. Produced only ifdo_mva
isTrue
. -
sigs_all
andexps_all
are the names of signatures included in NNLS calculation and the exposures. -
categ
is the general category the sample falls into according to the dominant signatures. TheSignature_3_hc
indicates that the sample passes the strict threshold and hc stands for high confidence, andSignature_3_lc
indicates that the sample passes the looser threshold but not the strict threshold.
The NNLS results are contained in the exps_all
(or exps_all_msi
if check_msi = T in run
function) and sigs_all
(or sigs_all_msi
). If samples are mismatch repair proficient using exps_all
and sigs_all
is suggested.
df <- run(<...>)
> head(df$sigs_all)
[1] "Signature_1.Signature_2.Signature_13.Signature_17b"
[2] "Signature_2.Signature_3.Signature_8.Signature_13"
[3] "Signature_1.Signature_5.Signature_18.Signature_28"
[4] "Signature_1.Signature_5"
[5] "Signature_1.Signature_2.Signature_13.Signature_18"
> head(df$exps_all)
[1] "6.28753746867414_20.8619624348833_36.4565328140342_1.61705928069764"
[2] "1.02559997853178_1.83454771316774_2.27976476868483_9.25209214668376"
[3] "134.888014310562_188.867066648019_106.398387417935_68.6727894422497"
[4] "16.0273878191902_25.711979828044"
[5] "1.36820147677162_14.7735768920957_8.1727146267034_5.23987489059937"
You can convert this compact info to a data frame using get_sig_exps()
:
df_exps <- get_sig_exps(df = df, col_exps = 'sigs_all', col_sigs = 'exps_all')
df <- cbind(df, df_exps)
Note that the NNLS output is not always reliable you can find the corresponding l_rat
columns for each signature to quantify the uniqueness of the NNLS solution. As an example:
> df$exp_sig17b[[1]]
1.617059
> df$Signature_17b_l_rat
0.01476486
This means that Signature 17b NNLS exposure is not robust, instead in the same sample
> df$exp_sig2[[1]]
20.8619624348833
> df$Signature_2_l_rat[[1]]
0.9964119
Signature 2 is necessary for a good decomposition.
Each Signature_X_ml (or Signature_X_ml_msi if check_msi = T
) column indicate the likelihood of a WGS cluster. The X was selected based on the dominant signatures, but interpreting the meaning of this matching from the name alone might be challenging. You can use the llh_max_characteristics
function to get relative activity of each signature in the most likely clusters.
df <- run(<...>) # run SigMA
df <- llh_max_characteristics(df, tumor_type, cosmic_version)
shows
> head(df$cluster_sigs_all)
[1] "Signature_2.Signature_13"
[2] "Signature_2.Signature_13"
[3] "Signature_5.Signature_14.Signature_15.Signature_10a.Signature_10b"
[4] "Signature_1.Signature_15.Signature_20.Signature_26.Signature_6"
[5] "Signature_2.Signature_5.Signature_13"
[6] "Signature_1.Signature_5.Signature_20.Signature_26.Signature_6"
> head(df$cluster_exps_all)
[1] "0.379876383326604_0.479620279610456"
[2] "0.379876383326604_0.479620279610456"
[3] "0.104838013337862_0.117388801504446_0.102457262738129_0.247320034628237_0.153266503507717"
[4] "0.290296634361362_0.11172525703329_0.123746638882048_0.106566789720104_0.100395257268131"
[5] "0.330216090479053_0.160128871589518_0.313269064300588"
[6] "0.161309358864425_0.145096377344316_0.134592613533534_0.102922984915257_0.192013161212538"
You can then split this info in a data frame using get_sig_exps()
function:
df_exps_clusters <- get_sig_exps(df = df, col_exps ='cluster_exps_all', col_sigs = 'cluster_sigs_all')
colnames(df_exps_clusters) <- paste0('clust_', colnames(df_exps_clusters)
df <- cbind(df, df_exps_clusters)
Your data frame now contains the exposures of the most likely cluster for each sample in separate columns.
> df$clust_exp_sig2
0.379876383326604