From 584ff28f8abed9d9c73e2e5a2067ad678067975c Mon Sep 17 00:00:00 2001
From: souzadevinicius <souzadevinicius@gmail.com>
Date: Tue, 27 Aug 2024 05:40:10 +0100
Subject: [PATCH 1/5] Update PhEval Pipeline Documentation Fixes #351

---
 docs/pipeline.md | 271 +++++++++++++++--------------------------------
 1 file changed, 87 insertions(+), 184 deletions(-)

diff --git a/docs/pipeline.md b/docs/pipeline.md
index 94e4eba3e..23e219a7c 100644
--- a/docs/pipeline.md
+++ b/docs/pipeline.md
@@ -1,11 +1,19 @@
 # PhEval Pipeline
 
-### 1. Clone [PhEval](https://github.com/monarch-initiative/pheval)
+
+## TLDR
+
+
+The Pipeline presented on PhEval preprint (https://www.biorxiv.org/content/10.1101/2024.06.13.598672v1) was moved to a new repository - Monarch PhEval - https://github.com/monarch-initiative/monarch_pheval.
+
+**NOTE: The default Monarch PhEval pipeline, as proposed in the paper preprint, requires approximately 1 TB of disk space. Learn how to modify the pipeline configuration in the later sections of this documentation to customize the experiment.**
+
+### 1. Clone [Monarch PhEval](https://github.com/monarch-initiative/monarch_pheval)
   ```bash
-  git clone https://github.com/monarch-initiative/pheval.git
+  git clone https://github.com/monarch-initiative/monarch_pheval.git
   ```
 
-### 2. Installing PhEval dependencies
+### 2. Installing PhEval Pipeline dependencies
    Enter in the cloned folder and enter the following commands:
 
 ```bash
@@ -13,235 +21,130 @@ poetry shell
 poetry install
 ```
 
-### 3. Generate custom Makefile
-You must have Jinja2 installed, if you don't follow the steps [here](#installing-jinja-template)
-
-In resources folder are the following files responsible for makefile generation:
+### 3. Executing Pipeline
 
-📦resources  
- ┣ 📜Makefile.j2  
- ┣ 📜custom.Makefile  
- ┣ 📜generatemakefile.sh  
- ┗ 📜pheval-config.yaml  
-
-You must edit the `pheval-config.yaml` file setting the directory where you extracted exomiser and phenotype data. An example could be found [here](#pheval-configuration-file).
-After setting the pheval-config.yaml file
-
----
-
-```mermaid
-flowchart TD
-    inputs["prepare-inputs"]
-    sr1["Setting up Runners"]
-    corpora["prepare-corpora"]
-    scrambling["Scrambing Process"]
-    r1["run"]
-    inputs ===  sr1
-    sr1 === corpora
-    corpora === scrambling
-    scrambling === r1
-```
-
----
-
-## Data Flow
-
-```mermaid
-flowchart LR
-    vcf[("Phenopackets Original Data")]
-    pheno[("Scrambled Phenopackets")]
-    result["Phenotype Result"]
-    vcf -- prepare-corpora -->  pheno
-    pheno -- scramble factor e.g 0.5 -->  result
+```bash
+make pheval
 ```
 
+## Pipeline Description
 
-## Jinja Template PhEval Makefile Generator Requirements
-
-To generate a PhEval Makefile we use the [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) template engine.
-
-### Installing Jinja Template
-
-- Linux (Ubuntu): `sudo snap install j2`
+The Pipeline is divided in three main steps
 
-- Mac OS:
+### 1. Data Preparation Phase
 
----
-## PhEval Makefile Template (.j2 file)
+The data preparation phase, checks the completeness of the disease, gene and variant input data and optionally preparing simulated VCF files if required, gives the user the ability to randomise phenotypic profiles using the PhEval corpus scramble command utility, allowing for the assessment of how well VGPAs handle noise and less specific phenotypic profiles when making predict.
 
-📦resources  
- ┣ 📜**Makefile.j2**  
+### 2. Runner Phase
 
+The runner phase is structured into three stages: prepare, run, and post-process.
+ - The prepare step plays a crucial role in adapting the input data to meet the specific requirements of the tool. 
+ - In the run step, the VGPA is executed, applying the selected algorithm to the prepared data and generating the tool-specific outputs. Within the run stage, an essential task is the generation of input command files for the algorithm. These files serve as collections of individual commands, each tailored to run the targeted VGPA on specific samples. These commands are configured with the appropriate inputs, outputs and specific configuration settings, allowing for the automated and efficient processing of large corpora. 
+ - Finally, the post-processing step takes care of harmonising the tool-specific outputs into standardised PhEval TSV format, ensuring uniformity and ease of analysis of results from all VGPAs. In this context, the tool-specific output is condensed to provide only two essential elements, the entity of interest, which can either be a variant, gene, or disease, and its corresponding score. PhEval then assumes the responsibility of subsequent standardisation processes. This involves the reranking of the results in a uniform manner, ensuring that fair and comprehensive comparisons can be made between tools.
 
-*custom.Makefile* is the template that will be generated on the fly based on the *pheval-config.yaml*. Each of these configurations is filled using a syntax like this: ```{{ config.tool }}```. The value between the curly brackets is replaced by the corresponding configuration in the configuration file.
+### 3. Analysis Phase
 
----
+In the analysis phase, PhEval generates comprehensive statistical reports based on
+standardised outputs from the runner phase.
 
-## PhEval custom.Makefile
+## Customising PhEval Pipeline Experiments 
 
-📦resources  
- ┣ 📜**custom.Makefile**  
+The phEval pipeline is orchestrated using a Makefile strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.
 
----
-## PhEval generatemakefile.sh
+In the resources folder are the following files responsible for Makefile generation:
 
 📦resources  
- ┣ **📜generatemakefile.sh**  
-
-
-*generatemakefile.sh* is only a shortcut for Makefile rendering using the configuration file e.g.
+┣ 📜Makefile.j2  
+┣ 📜custom.Makefile  
+┣ 📜generatemakefile.sh  
+┗ 📜pheval-config.yaml  
 
-    bash ./resources/generatemakefile.sh
+Let's start describing `pheval-config.yaml` structure.
 
-## PhEval Configuration File
+### PhEval Configuration File
 
-In resources folder, there is a file named *pheval-config.yaml*, this file is responsible for storing the PhEval Makefile generation.
+#### Directories Section
 
-📦resources  
- ┗ **📜pheval-config.yaml**  
+The `data` and `tmp` properties are mandatory and must be specified in this section.
 
----
+- `data` property refers to the folder location where the necessary phenotypic data for the pipeline will be downloaded and extracted.
+- `tmp` property points to the folder where all temporary intermediate files will be generated.
 
-### Directories Section
 ```yaml
 directories:
+  data: data
   tmp: data/tmp
-  h2jar: ./h2-1.4.199.jar
-  phen2gene: ./Phen2Gene
-  exomiser: /home/data/exomiser/exomiser-cli-13.2.0-distribution/exomiser-cli-13.2.0
-  phenotype: /home/data/phenotype
-  workspace: /tmp/pheval
 ```
 
----
+#### Corpora Section
 
-### Configs Section
-```yaml
-configs:
-  - tool: phen2gene
-    version: 1.2.3
-    configuration: default
-  - tool: exomiser
-    version: 13.2.0
-    configuration: default
-    exomiser_db: semsim1
-```
-
-This section is responsible for setting up the configuration folder.
-All software declared in the configs section will be linked in this folder.
-In the configuration above, for example, we have one configuration for phen2gene and one for exomiser. In the [Directories Section](#directories-section), these two configurations must have one corresponding property set up.
-PhEval pipeline invokes the *prepare-inputs* goal, and in the preceding example, a configuration folder structure will be built that looks like this:
 
-📦configurations  
- ┣ 📂exomiser-13.2.0-default  
- ┗ 📂phen2gene-1.2.3-default  
+The `corpora` section specifies which corpus will be used in the experiment. In this example is defined [LIRICAL](https://pubmed.ncbi.nlm.nih.gov/32755546/) corpus, A small comparison corpus created for benchmarking the [LIRICAL](https://pubmed.ncbi.nlm.nih.gov/32755546/) system which contains 385 case reports.
 
+The user needs to specify corpus id and it must be equals to the corpora folder structure, e.g.
 
-Each of these folders is a symbolic link that points to the corresponding software folder indicated in the [Directories Section](#directories-section)
-
----
+📦corpora  
+ ┃ ┣ 📂lirical  
+ ┃ ┣ ┣ 📂small_version  
+ ┃ ┣ ┣ ┣ 📂phenopackets  
+ ┃ ┣ ┣ ┣ ┣ 📜PATIENT1.json  
+ ┃ ┣ ┣ ┣ ┣ 📜PATIENT2.json  
+ ┃ ┣ ┣ ┣ 📂vcf  
+ ┃ ┣ ┣ ┣ ┣ 📜PATIENT1.vcf.gz  
+ ┃ ┣ ┣ ┣ ┣ 📜PATIENT2.vcf.gz  
+ ┃ ┣ ┣ ┣ 📜corpus.yml  
+ ┃ ┣ ┣ ┣ 📜template_exome_hg19.vcf.gz  
 
-### Corpora Section
 ```yaml
 corpora:
   - id: lirical
-    scrambled:
-      - factor: 0.5
-      - factor: 0.7
-    custom_variants:
-      - id: no_phenotype
-  - id: phen2gene
-    scrambled:
-      - factor: 0.2
-      - factor: 0.9
-    custom_variants:
-      - id: no_phenotype
+    variant: small_version
 ```
 
-In this corpora section we can set up different experiments for corpus scrambling. Currently, PhEval provides corpora data from lirical, phen2gene, small_test and structural_variants
+#### Configs Section
 
 
-📦corpora  
- ┣ 📂lirical  
- ┣ 📂phen2gene  
- ┣ 📂small_test  
- ┗ 📂structural_variants  
+The `configs` section holds all custom configurations for the different VGPAs.
+It must declare:
+- tool: VGPA tool name.
+- id: it's an arbiratry unique identifier that will be used in the `runs` section
+- version: VGPA tool version
 
-
-The scramble property defines the magnitude of the scrambling factor during Phenopackets and VCF variants spiking process. Using the configuration in the example above, a corpora structure will be created like this:
-
-📦corpora  
- ┣ 📂lirical  
- ┃ ┗ 📂default  
- ┃ ┗ 📂scrambled-0.5  
- ┃ ┗ 📂scrambled-0.7  
- ┣ 📂phen2gene  
- ┃ ┗ 📂default  
- ┃ ┗ 📂scrambled-0.2  
- ┃ ┗ 📂scrambled-0.9  
-
-
----
-
-### Runs Section
 ```yaml
-runs:
-  - tool: exomiser
-    configuration: default
-    corpus: lirical
-    corpusvariant: scrambled-0.5
-    version: 13.2.0
+configs:
   - tool: phen2gene
-    configuration: default
-    corpus: phen2gene
-    corpusvariant: scrambled-0.2
+    id: phen2gene-1.2.3
     version: 1.2.3
 ```
 
-## Phen2Gen Specific Configuration
-
-
-The input directory `config.yaml` should be formatted like the example below and must be placed in `phen2gene: /pathtoPhen2Gene/Phen2Gene` declared in `pheval-config.yaml` file.
+`configs` section can also deal with special VGPA data preparation steps, for example,  Semantic Similarity ingestions into Exomiser phenotypic database e.g.
 
 ```yaml
-tool: phen2gene
-tool_version: 1.2.3
-phenotype_only: True
-tool_specific_configuration_options:
-  environment: local
-  phen2gene_python_executable: phen2gene.py
-  post_process:
-    score_order: descending
-```
-
-## Makefile Goals
-
-### make pheval
-
-this runs the entire pipeline including corpus preparation and pheval run
-
-
-	$(MAKE) prepare-inputs
-	$(MAKE) prepare-corpora
-	$(MAKE) pheval-run
-
-
-### make semsim
-
-generate all configured similarity profiles
-
-### make semsim-shuffle
-
-generate new ontology terms to the semsim process
-
-### make semsim-scramble
+  configs:
+  - tool: exomiser
+    id: exomiser-semsim-ingest-13.3.0
+    version: 13.3.0
+    phenotype: 2309
+    preprocessing:
+      - phenio-monarch-hp-hp.0.4.semsimian.sql
+```    
+`phenotype` property describes the Exomiser phenotype database version and the `preprocessing` section will execute SQL scripts into that phenotypic database.
 
-scramble semsim profile
 
-### make semsim-convert
+#### Runs Section
 
-convert all semsim profiles into exomiser SQL format
+The "runs" section will integrate all previously described sections and pass them to pheval VGPA for concrete execution.
 
-### make semsim-ingest
+- `tool` property specifies which runner will be called
+- `corpus` and `corpusvariant` must match properties declared on the [corpora section](#corpora-section).
+- `version` should correspond to the tool version
+- `configuration` must match the id described on the [configuration section](#configs-section).
 
-takes all the configured semsim profiles and loads them into the exomiser databases
+```yaml
+runs:
+  - tool: exomiser
+    corpus: lirical
+    corpusvariant: small_version
+    version: 13.3.0
+    configuration: exomiser-semsim-ingest-13.3.0
+```
\ No newline at end of file

From 691a6d05d2e214838de0af5fe707217b8356e46c Mon Sep 17 00:00:00 2001
From: souzadevinicius <souzadevinicius@gmail.com>
Date: Tue, 27 Aug 2024 05:48:50 +0100
Subject: [PATCH 2/5] Refering Customising PhEval Pipeline Experiments section
 in the TLDR note.

---
 docs/pipeline.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/pipeline.md b/docs/pipeline.md
index 23e219a7c..dc457ed23 100644
--- a/docs/pipeline.md
+++ b/docs/pipeline.md
@@ -6,7 +6,7 @@
 
 The Pipeline presented on PhEval preprint (https://www.biorxiv.org/content/10.1101/2024.06.13.598672v1) was moved to a new repository - Monarch PhEval - https://github.com/monarch-initiative/monarch_pheval.
 
-**NOTE: The default Monarch PhEval pipeline, as proposed in the paper preprint, requires approximately 1 TB of disk space. Learn how to modify the pipeline configuration in the later sections of this documentation to customize the experiment.**
+**NOTE: The default Monarch PhEval pipeline, as proposed in the paper preprint, requires approximately 1 TB of disk space. Learn how to modify the pipeline configuration [here](#customising-pheval-pipeline-experiments) to customize the experiments.**
 
 ### 1. Clone [Monarch PhEval](https://github.com/monarch-initiative/monarch_pheval)
   ```bash

From 2123063586eb4c86b9f92ed2811f30e9b79ec4c3 Mon Sep 17 00:00:00 2001
From: souzadevinicius <souzadevinicius@gmail.com>
Date: Tue, 27 Aug 2024 06:07:59 +0100
Subject: [PATCH 3/5] Adding Makefile generation command to the documentation

---
 docs/pipeline.md | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/docs/pipeline.md b/docs/pipeline.md
index dc457ed23..b4b46151d 100644
--- a/docs/pipeline.md
+++ b/docs/pipeline.md
@@ -49,7 +49,7 @@ standardised outputs from the runner phase.
 
 ## Customising PhEval Pipeline Experiments 
 
-The phEval pipeline is orchestrated using a Makefile strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.
+The phEval pipeline is orchestrated using a Makefile Jinja template strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.
 
 In the resources folder are the following files responsible for Makefile generation:
 
@@ -59,10 +59,12 @@ In the resources folder are the following files responsible for Makefile generat
 ┣ 📜generatemakefile.sh  
 ┗ 📜pheval-config.yaml  
 
-Let's start describing `pheval-config.yaml` structure.
+Let's begin by describing the `pheval-config.yaml` file and its structure.
 
 ### PhEval Configuration File
 
+This file is responsible define the experiment settings and will be used to generate the Makefile using a Jinja template which consumes this YAML configuration file.
+
 #### Directories Section
 
 The `data` and `tmp` properties are mandatory and must be specified in this section.
@@ -147,4 +149,16 @@ runs:
     corpusvariant: small_version
     version: 13.3.0
     configuration: exomiser-semsim-ingest-13.3.0
+```
+
+### Generating new Makefile based on PhEval configuration file
+
+📦resources  
+┣ 📜generatemakefile.sh  
+┗ 📜pheval-config.yaml  
+
+To generate a new Makefile, simply execute the `generatemakefile.sh` script, which encapsulates the Makefile rendering process dynamically filling it using the `pheval-config.yaml` configuration file.
+
+```bash
+./resources/generatemakefile.sh
 ```
\ No newline at end of file

From 89df6bb535df2c44496c381aa72454f8538d5221 Mon Sep 17 00:00:00 2001
From: souzadevinicius <souzadevinicius@gmail.com>
Date: Tue, 27 Aug 2024 11:09:52 +0100
Subject: [PATCH 4/5] Fixing some markdown syntax styles

---
 docs/pipeline.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/pipeline.md b/docs/pipeline.md
index b4b46151d..dd16a902d 100644
--- a/docs/pipeline.md
+++ b/docs/pipeline.md
@@ -4,7 +4,7 @@
 ## TLDR
 
 
-The Pipeline presented on PhEval preprint (https://www.biorxiv.org/content/10.1101/2024.06.13.598672v1) was moved to a new repository - Monarch PhEval - https://github.com/monarch-initiative/monarch_pheval.
+The Pipeline presented on [PhEval preprint](https://www.biorxiv.org/content/10.1101/2024.06.13.598672v1) was moved to a new repository - [Monarch PhEval](https://github.com/monarch-initiative/monarch_pheval).
 
 **NOTE: The default Monarch PhEval pipeline, as proposed in the paper preprint, requires approximately 1 TB of disk space. Learn how to modify the pipeline configuration [here](#customising-pheval-pipeline-experiments) to customize the experiments.**
 
@@ -38,6 +38,7 @@ The data preparation phase, checks the completeness of the disease, gene and var
 ### 2. Runner Phase
 
 The runner phase is structured into three stages: prepare, run, and post-process.
+
  - The prepare step plays a crucial role in adapting the input data to meet the specific requirements of the tool. 
  - In the run step, the VGPA is executed, applying the selected algorithm to the prepared data and generating the tool-specific outputs. Within the run stage, an essential task is the generation of input command files for the algorithm. These files serve as collections of individual commands, each tailored to run the targeted VGPA on specific samples. These commands are configured with the appropriate inputs, outputs and specific configuration settings, allowing for the automated and efficient processing of large corpora. 
  - Finally, the post-processing step takes care of harmonising the tool-specific outputs into standardised PhEval TSV format, ensuring uniformity and ease of analysis of results from all VGPAs. In this context, the tool-specific output is condensed to provide only two essential elements, the entity of interest, which can either be a variant, gene, or disease, and its corresponding score. PhEval then assumes the responsibility of subsequent standardisation processes. This involves the reranking of the results in a uniform manner, ensuring that fair and comprehensive comparisons can be made between tools.

From 5e727ae9230d651df8a11701e0db8ad103e127d8 Mon Sep 17 00:00:00 2001
From: souzadevinicius <souzadevinicius@gmail.com>
Date: Tue, 27 Aug 2024 11:51:56 +0100
Subject: [PATCH 5/5] Fixing configs section indentation

---
 docs/pipeline.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/pipeline.md b/docs/pipeline.md
index dd16a902d..b02da0721 100644
--- a/docs/pipeline.md
+++ b/docs/pipeline.md
@@ -50,7 +50,7 @@ standardised outputs from the runner phase.
 
 ## Customising PhEval Pipeline Experiments 
 
-The phEval pipeline is orchestrated using a Makefile Jinja template strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.
+The PhEval pipeline is orchestrated using a Makefile Jinja template strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.
 
 In the resources folder are the following files responsible for Makefile generation:
 
@@ -123,7 +123,7 @@ configs:
 `configs` section can also deal with special VGPA data preparation steps, for example,  Semantic Similarity ingestions into Exomiser phenotypic database e.g.
 
 ```yaml
-  configs:
+configs:
   - tool: exomiser
     id: exomiser-semsim-ingest-13.3.0
     version: 13.3.0