NBC-Based Novelty Detection in Multi-Level Taxonomy
This project investigates Naive-Bayes novelty detection using k-mer counting across different taxonomic levels. The database used for this project consisted of 4634 unique species.
The first step of the project is focused on creating balanced training datasets. At the superkingdom level, training sets were generated by selecting 2 out of 4 available classes (archaea, eukaryota, bacteria, viruses) resulting in 6 unique combinations.
At lower taxonomic levels (phylum, class, order, family), classes containing fewer than 30 instances were first excluded from the database. Then, 50% of the total representatives were randomly sampled five times, resulting in five trials at each level for each k-mer length used.
Models are trained on k-mer frequencies. All k-mer count files were generated using Jellyfish. The k-mers used in this project were of length 3, 6, 9, 12 and 15.
The testing data consisted of 100 random reads from each class (species) in the database. The same testing sequence was used for all trials.
Each classification produces a CSV file with the logarithmic probability of each genome in the testing sequence. For each trial, genome sequences in the training data were labeled as "known", while those not present were labeled as "unknown", simplifying this multi-modal problem into a binary classification task. ROC/AUC curves as well as distribution plots were then generated to assess novelty detection.
All scripts in this project were executed on Picotte, Drexel's main high-performance computing cluster.