"For many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")." - The Pareto Principle by Vilfredo Pareto
Get 80% of all standard (biomedical) data science analyses done semi-automated with 20% of the effort, by leveraging Snakemake's module functionality to use and combine pre-existing workflows into arbitrarily complex analyses.
Important
If you use MrBiomics, please don't forget to give credit to the authors by citing this original repository and the respective Modules and Recipes.
"Programming is about trying to make the future less painful. It’s about making things easier for our teammates." from The Pragmatic Programmer by Andy Hunt & Dave Thomas
- Why: Time is the most precious resource. By taking care of efficiency (i.e., maximum output with limited resources) scientists can re-distribute their time to focus on effectiveness (i.e., the biggest impact possible).
- How: Use the latest developments in workflow management to (re-)use and combine independent computational modules into arbitrarily complex analyses to leverage modern innovation methods (e.g., fast prototyping, design thinking, and agile concepts).
- What: Independent and single-purpose computational Modules, implemented as Snakemake workflows, encode standard approaches that are used to scale, automate, and parallelize analyses. Recipes combine modules into end-to-end best practice workflows, thereby accelerating analyses to the point of the unknown. Snakemake's module functionality enables Projects to combine modules, recipes and custom code into arbitrarily complex multi-omics analyses at scale.
Illustration of MrBiomics Modules, Recipes and Projects
Note
Altogether this enables complex, portable, transparent, reproducible, and documented analyses of multi-mics data at scale.
"Is it functional, multifunctional, durable, well-fitted, simple, easy to maintain, and thoroughly tested? Does it provide added value, and doesn't cause unnecessary harm? Can it be simpler? Is it an innovation?" - Patagonia Design Principles
Modules are Snakemake workflows, consisting of Rules for multi-step analyses, that are independent, single-purpose, and sufficiently abstracted to be compatible with most up- and downstream analyses. A {module}
can be general-purpose (e.g., Unsupervised Analysis) or modality-specific (e.g., ATAC-seq processing). Currently, the following nine modules are available, ordered by their applicability from general to specific:
Module | Type (Data Modality) | DOI | Stars |
---|---|---|---|
Unsupervised Analysis | General Purpose (tabular data) |
||
Split, Filter, Normalize and Integrate Sequencing Data | Bioinformatics (NGS counts) |
||
Differential Analysis with limma | Bioinformatics (NGS data) |
||
Enrichment Analysis | Bioinformatics (genes/genomic regions) |
||
Genome Track Visualization | Bioinformatics (aligned BAM files) |
||
ATAC-seq Processing | Bioinformatics (ATAC-seq) |
||
scRNA-seq Processing using Seurat | Bioinformatics (scRNA-seq) |
||
Differential Analysis using Seurat | Bioinformatics (scRNA-seq) |
||
Perturbation Analysis using Mixscape from Seurat | Bioinformatics (scCRISPR-seq) |
Note
⭐️ Star and share modules you find valuable 📤 — help others discover them, and guide our future work!
Tip
For detailed instructions on the installation, configuration, and execution of modules, you can check out the wiki. Generic instructions are also shown in the modules' respective Snakmake workflow catalog entry.
“Absorb what is useful. Discard what is not. Add what is uniquely your own.” - Bruce Lee
You can (re-)use and combine pre-existing workflows within your projects by loading them as Modules since Snakemake 6. The combination of multiple modules into projects that analyze multiple datasets represents the overarching vision and power of MrBiomics.
Note
When applied to multiple datasets within a project, each dataset should have its own result directory within the project directory.
Three components are required to use a module within your Snakemake workflow (i.e., a project).
- Configuration: The
config/config.yaml
file has to point to the respective configuration files per dataset and workflow.#### Datasets and Workflows to include ### workflows: MyData: other_workflow: "config/MyData/MyData_other_workflow_config.yaml"
- Snakefile: Within the main Snakefile (
workflow/Snakefile
) we have to:- load all configurations;
- include the snakefiles that contain the dataset-specific loaded modules and rules (see next point);
- and add all modules' outputs to the target's rule
input
.
- Modules: Load the required modules and their rules within separate snakefiles (
*.smk
) in therule/
folder. Recommendation: Use one snakefile per dataset.module MyData_other_workflow: # here, plain paths, URLs and the special markers for code hosting providers (e.g., github) are possible. snakefile: "other_workflow/Snakefile" config: config["MyData_other_workflow"] use rule * from MyData_other_workflow as MyData_other_workflow_*
"Civilization advances by extending the number of important operations which we can perform without thinking of them." - Alfred North Whitehead, author of Principia Mathematica
Recipes are combinations of existing modules into end-to-end best practice analyses. They can be used as templates for standard analyses by leveraging existing modules, thereby enabling fast iterations and progression into the unknown. This represents "functional knowledge management". Every recipe is described and presented using a wiki page by application to a publicly available dataset.
Recipe | Description | # Modules |
---|---|---|
ATAC-seq Analysis | From raw BAM files to enrichemnts of differentially accessible regions. | 6(-7) |
RNA-seq Analysis | From raw BAM files to enrichemnts of differentially expressed genes. | 6(-7) |
Integrative ATAC-seq & RNA-seq Analysis | From count matrices to epigenetic potential and relative transcriptional abundance. | 7(-8) |
scRNA-seq Analysis | From count matrix to enrichemnts of differentially expressed genes. | 5(-6) |
scCRISPR-seq Analysis | From count matrix to knockout phenotype enrichemnts. | 6(-7) |
Usage: Process each dataset module by module. Check the results of each module to inform the configuration of the next module. This iterative method allows for quick initial completion, followed by refinement in subsequent iterations based on feedback from yourself or collaborators. Adjustments in later iterations are straightforward, requiring only changes to individual configurations or annotations. Ultimately you end up with a reproducible and readable end-to-end analysis for each dataset.
Important
For end-to-end analysis it is required that all configuration and annotation files exist before (re)running the workflow. If a module requires an annotation file generated by a previous module and it does not exist (yet), DAG construction fails with a MissingInputException
. This can happen in workflows where downstream module configuration relies on outputs from preceding modules. For example, a sample annotation file created by the atacseq_pipeline
module is used to configure downstream the spilterlize_integrate
module. The best practice is to run the workflow module by module and save the output required for configuration (e.g., sample annotation) externally. We recommend the workflow's config/
folder of the respective dataset. Thereby making the workflow re-runnable for future end-to-end execution, for iterations with changed parameters and reproducibility. Checkpoints are no solution as they only apply to rules, not to modules.
Note
⭐️ Star this repository and share recipes you find valuable 📤 — help others find them, and guide our future work!
- MrBiomics Wiki for instructions & tutorials
- GitHub list of MrBiomics modules
- My Data Science Setup - A tutorial for developing Snakemake workflows and beyond
- GitHub Page of this repository
- Curated and published workflows that could be used as modules:
- Software