naive-bayes-multi-level-basic

NBC-Based Novelty Detection in Multi-Level Taxonomy

This project investigates Naive-Bayes novelty detection using k-mer counting across different taxonomic levels. The database used for this project consisted of 4634 unique species.

Methodology

Dataset creation

The first step of the project is focused on creating balanced training datasets. At the superkingdom level, training sets were generated by selecting 2 out of 4 available classes (archaea, eukaryota, bacteria, viruses) resulting in 6 unique combinations.

At lower taxonomic levels (phylum, class, order, family), classes containing fewer than 30 instances were first excluded from the database. Then, 50% of the total representatives were randomly sampled five times, resulting in five trials at each level for each k-mer length used.

k-mer counting

Models are trained on k-mer frequencies. All k-mer count files were generated using Jellyfish. The k-mers used in this project were of length 3, 6, 9, 12 and 15.

Testing data

The testing data consisted of 100 random reads from each class (species) in the database. The same testing sequence was used for all trials.

Post-data analysis and ROC/AUC generation

Each classification produces a CSV file with the logarithmic probability of each genome in the testing sequence. For each trial, genome sequences in the training data were labeled as "known", while those not present were labeled as "unknown", simplifying this multi-modal problem into a binary classification task. ROC/AUC curves as well as distribution plots were then generated to assess novelty detection.

Results

All scripts in this project were executed on Picotte, Drexel's main high-performance computing cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lower-level-classification		lower-level-classification
superkingdom-classification		superkingdom-classification
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

naive-bayes-multi-level-basic

Methodology

Dataset creation

k-mer counting

Testing data

Post-data analysis and ROC/AUC generation

Results

About

Releases

Packages

Languages

key-r-code/naive-bayes-multi-level-basic

Folders and files

Latest commit

History

Repository files navigation

naive-bayes-multi-level-basic

Methodology

Dataset creation

k-mer counting

Testing data

Post-data analysis and ROC/AUC generation

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages