All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- removed certain features that showed no effect in the model (SHAP) This is the final version that has been submitted to the PEGS DREAM Challenge 2024
- Added challenge mode that trains the model on the full data (not splitted into test data)
- remove column "sampleID" from the telomere data frame (as it is not needed)
- Added counts of Structural Variants (SVs) - for certain genes (e.g., APOB) - as features to the model
- Removed features that showed no effect in the model (SHAP)
- Also print the SHAP values for the features to file
- Changed the model of HLA typing (combined alleles into one and added the genotype as a feature)
- Added methylation data as features to the model
- Added parameters for search space optimization
- Added environment.yml file to the repository (to ensure easy installation of the required packages using conda/mamba)
- Added Ancestry, Telomeric Content as features to the model
- Added the Polygenic Score as a feature to the model
- Changed to native XGBoost implementation
- merged family history of diseases from the survey data (e.g., Y/N)
- added family history of cancer as a feature (e.g., Y/N)
- Changed model optimization to use the optuna package
- Changed hypertuning to Bayesian Optimization
- Implemented Gradient Boost approach (XGBoost) that is currently (only) based on the survey (Health and Exposure)
- Exluded fields that have missing data (>20%) and probably not interesting for the prediction
- Changed Dockerfile to python-based image (which allows to install packages using pip seamlessly)
- Changed the input format such that it now accepts the full data path instead of individual files
- Initial release (that only randomly assign probabilities) for testing the submission