-
Notifications
You must be signed in to change notification settings - Fork 204
Metagenomics (answers; IMPACTT 2022)
These are answers for the 2022 IMPACTT bioinformatics workshop running from May 30th - June 1st.
Authors: Robyn Wright and Morgan Langille
This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.
The answers shown here are for the Metagenomic Taxonomic Composition Tutorial and the Metagenomic Functional Composition Tutorial.
Question 1: Based on the above line counts can you tell how many reads should be in the reverse FASTQs? Remember that a standard convention is to mark the reverse FASTQs with the characters R2
The number of reads should be the same in the reverse FASTQs as it is in the forward FASTQs.
Question 2: Take a look at kneaddata_read_counts.txt
: how many reads in sample CSM7KOMH were dropped?
You can look at the number of reads in the "final pair1" or "final pair2" columns for this - they should show 24,007 reads, meaning that 993 reads were dropped.
Question 3: It's reasonable that the processed data was provided by IBDMDB since their goal is to provide standardized data that can be used for many projects. However, why is it important that normally all raw data be uploaded to public repositories?
There are a few different reasons that it's important that all raw data be uploaded to repositories, primarily that someone else should be able to verify your results by repeating exactly what you did, but also the best standards for processing sequencing data are constantly evolving and someone else may want to look at something different in those samples. For example, you may have looked at only the bacterial reads in your samples, but someone else may be interested in the archaeal, fungal or viral reads.
Question 4: What does the text {1/.}.kraken.txt
and {1/.}.kreport
do in the above command? (Hint look at our above discussion of parallel).
It removes the directory (cat_reads/
) and file suffix (.fastq
) from the file names, and also adds on .kraken.txt
or .kreport
to these file names.
Question 5: How many more reads in each sample are classified using the full RefSeq database (RefSeqCompleteV205) than with the minikraken database (with no confidence threshold)?
*Hint: you can use the less
command to look at both the .kreport
and the .kraken.txt
files.
For example, if we look at the CSM7KOMH_0.0_minikraken.kreport
and CSM7KOMH_0.0_RefSeqCompleteV205.kreport
files, we can see that with the minikraken database, 33,117 reads or 68.97% of reads were classified, while with the RefSeqCompleteV205 database, 49,463 reads or 98.98% of reads were classified.
Question 6: How many more reads in each sample are classified using a confidence threshold of 0.0 compared with a confidence threshold of 0.1 for the minikraken database?
For example, if we look at the same samples but with a confidence threshold of 0.1 (CSM7KOMH_0.1_minikraken.kreport
and CSM7KOMH_0.1_RefSeqCompleteV205.kreport
), then we find that 26,558 reads or 55.31% of reads were classified with the minikraken database, while 49,409 reads or 98.87% of reads were classified with the RefSeqCompleteV205 database. So we can see that there was a fairly large decrease in the number of reads classified with the minikraken database when we went from a confidence threshold of 0.0 to 0.1, while with the RefSeqCompleteV205 database, this difference was very small. As you can see, we need to take into account both the database and the confidence threshold when choosing which to use for your samples. If you want to read more about this, you can do so in our preprint.
Question 7: Take a look at the raw output for the same sample but run with a confidence threshold of 0.1. Was the second read classified? Why or why not?
In this example, you should see that the second read with a confidence threshold of 0.0 reads:
C CB1APANXX170426:1:1101:10001:37312/1 Bacteroides (taxid 816) 101 0:57 816:5 0:4 816:1
And with a confidence threshold of 0.1 reads:
U CB1APANXX170426:1:1101:10001:37312/1 unclassified (taxid 0) 101 0:57 816:5 0:4 816:1
If we calculate the confidence as we had before for the taxid 816, which the read is initially classified as, then this gives us (5+1)/(57+5+4+1) = 6/67 = 0.089, so this is below the confidence threshold of 0.1. Because the other k-mers within the read were unclassified, this read becomes unclassified when we set the confidence threshold to 0.1.
Question 8: How would you go about figuring out the total number of species we found within our ten different samples?
You can simply count the number of lines within the Bracken summary files, like so:
wc -l bracken_out_merged/merged_output_minikraken.species.bracken
wc -l bracken_out_merged/merged_output_RefSeqCompleteV205.species.bracken
Remember that these files also have a header row, so we'll need to subtract one from each, giving 415 species in the MiniKraken-classified samples and 2,381 species in the RefSeqCompleteV205-classified samples.
Question 9: How does the number of species classified by each of the confidence thresholds differ, and how does it differ between the minikraken and the RefSeqCompleteV205 databases?
You can use some commands like this to see the number of species classified within each file, in different groups at a time:
wc -l bracken_out/*minikraken*
wc -l bracken_out/*RefSeqCompleteV205*
wc -l bracken_out/*0.0_minikraken*
wc -l bracken_out/*0.1_minikraken*
wc -l bracken_out/*0.0_RefSeqCompleteV205*
wc -l bracken_out/*0.1_RefSeqCompleteV205*
Question 1: How many protein sequences did the sequence CB1APANXX170426:1:1101:2664:65700/2
align with in the sample CSM7KOMH? Which alignment/alignments have the lowest E-value/highest bitscore?
There were two alignments to that sequence in sample CSM7KOMH. The first one, matching UniRef90_A0A081U0T2
, has a lower E-value (2.973E-11) than the other match, UniRef90_A0A2M9V2V5
(E-value of 4.088E-11), but the bitscore was the same for both of them (64). If we want to see only the alignments for a single sequence, we can use grep
, like so:
grep "CB1APANXX170426:1:1101:2664:65700/2" mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8
grep
will print out all lines of the file where the string that we gave it are found.
Question 2: What is the RPKM contributed to the sample CSM79HR8 for the EC:6.3.5.2 contributed by Bacteroides intestinalis?
You should be looking at the second column of the row named "EC:6.3.5.2|Bacteroides (taxid 816)", so this is 425.6836791354688.
Question 3: What is the name of the enzyme with the EC number EC:6.3.5.2?
EC:6.3.5.2: GMP synthase (glutamine-hydrolyzing).
Question 4: How many total pathways were identified?
We can use zcat
like we did previously to help us count this, although there aren't too many lines in this file so you would be able to just count them:
zcat pathways_out/path_abun_unstrat.tsv.gz | wc -l
This gives us 23, but remember that the header is also included in this count, so the answer is 22.
Question 5: how many taxa contribute to the pathway PANTO-PWY
?
You can just count these manually, showing us that there are 9, but we can also use the grep
command again to check this:
grep -c "PANTO-PWY" pathways_out/path_abun_strat.tsv
or grep "PANTO-PWY" pathways_out/path_abun_strat.tsv
In the first command, the -c
flag will count the number of times that this string is found, while leaving out the -c
will just print the lines where the string is found.
Question 6: Which samples is the Streptococcus genus found in?
The Streptococcus genus is only found in the CD samples.
Question 7: Are there any pathways that are only found in one sample and not the other?
This is a bit difficult to tell with so many lines (and would be even more difficult in a sample that hadn't had the reads down-sampled). If we want to, we can go back to the files and double check anything that we think we see there, either using the less
command, or we can also use the grep
command again:
grep "PWY-6385" pathways_out/path_abun_strat.tsv
grep - "CSM79HR8" pathways-stratified-SankeyFormat.txt
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].