-
Notifications
You must be signed in to change notification settings - Fork 204
CBW 2024 Advanced Module 1: Introduction to metagenomics and read‐based profiling
Bioinformatic Tool Citations
- Parallel
- FastQC
- Kneaddata
- Bowtie2
- Kraken2
- Bracken
- Kraken-biom
- MetaPhlAn 3.1
Copy the files to your working directory.
cp -r ~/CourseData/MIC_data/AMB_data/raw_data/ .
Sometimes in bioinformatics, the number of tasks you have to complete can get VERY large. Fortunately, there are several tools that can help us with this. One such tool is GNU Parallel. This tool can simplify the way in which we approach large tasks, and as the name suggests, it can iterate though many tasks in parallel, i.e. concurrently. We can use a simple command to demonstrate this.
parallel 'echo {}' ::: a b c
With the command above, the program contained within the quotation marks ' '
is echo
. This program is run 3 times, as there are 3 inputs listed after the :::
characters. What happens if there are multiple lists of inputs? Try the following:
parallel 'echo {}' ::: a b c ::: 1 2 3
Here, we have demonstrated how parallel
treats multiple inputs. It uses all combinations of one of each from a b c
and 1 2 3
. But, what if we wanted to use 2 inputs that were sorted in a specific order? This is where the --link
flag becomes particularly useful. Try the following:
parallel --link 'echo {}' ::: a b c ::: 1 2 3
In this case, the inputs are "linked", such that only one of each is used. If the lists are different lengths, parallel
will go back to the beginning of the shortest list and continue to use it until the longest list is completed. You do not have to run the following command, as the output is provided to demonstrate this.
$ parallel --link 'echo {}' ::: light dark ::: red blue green
> light red
> dark blue
> light green
Notice how light
appears a second time (on the third line of the output) to satisfy the length of the second list.
Another useful feature is specifying which inputs we give parallel are to go where. This can be done intuitively by using multiple brackets { }
containing numbers corresponding to the list we are interested in. Again, you do not have to run the following command, as the output is provided to demonstrate this.
$ parallel --link 'echo {1} {3}; echo {2} {3}' ::: one red ::: two blue ::: fish
> one fish
> two fish
> red fish
> blue fish
Finally, a handy feature is that parallel
accepts files as inputs. This is done slightly differently than before, as we need to use four colon characters ::::
instead of three. Parallel will then read each line of the file and treat its contents as a list. You can also mix this with the three-colon character lists :::
you are already familiar with. Using the following code, create a test file and use parallel
to run the echo
program:
echo -e "A\nB\nC" > test.txt
parallel --link 'echo {2} {1}' :::: test.txt ::: 1 2 3
And with that, you're ready to use parallel for all of your bioinformatic needs! We will continue to use it throughout this tutorial and show some additional features along the way. There is also a cheat-sheet here for quick reference.
First, make your desired output directory (if it doesn't already exist). Then, run FastQC as follows:
fastqc -t 4 raw_data/*fastq.gz -o fastqc_out
Go to http://##.uhn-hpc.ca/
(substituting ## for your student number) and navigate to your FastQC output directory. Click on the .html files to view the results for each sample. Let's look at the Per base sequence quality tab. This shows a boxplot representing the quality of the base call for each position of the read. In general, due to the inherent degradation of quality with increased sequence length, the quality scores will trend downward as the read gets longer. However, you may notice that for our samples this is not the case! This is because for the purpose of this tutorial, your raw data has already been trimmed.
More often, per base sequence quality will look like the following. The FastQC documentation provides examples of "good" and "bad" data. These examples are also shown below:
Which of the graphs does your data resemble more closely? What can we do if data fails the Per Base Sequence Quality module?
Now, you may have also noticed that most of the samples fail the "Per Base Sequence Content" module of FastQC. Let's look at our visualization:
This specific module plots out the proportion of each base position in a file, and raises a warning/error if there are large discrepancies in base proportions. In a given sequence, the lines for each base should run in parallel, indicating that the base calling represents proper nucleotide pairing. Additionally, the A and T lines may appear separate from the G and C lines, which is a consequence of the GC content of the sample. The relative amount of each base reflects the overall amount of the bases in the genome, and should not be imbalanced. When this is not the case, the sequence composition is biased. A common cause of this is the use of primers, which throws off our sequence content at the beginning of the read. Fortunately, although the module error stems from this bias, according to the FastQC documentation for this module it will not largely impact our downstream analysis.
The other modules are explained in depth in the FastQC Module Help Documentation
KneadData is a tool which "wraps" several programs to create a pipeline that can be executed with one command. Remember that these reads have already been trimmed for you - this is in fact one of KneadData's functionalities. For this tutorial though, we will use KneadData to filter our reads for contaminant sequences against a human database.
Kneaddata outputs many files for each sample we provide it. These include:
- paired sequences which match our database;
- singletons which match our database;
- paired sequences that do not match our database;
- singletons that do not match our database;
- and some others.
First, we want to activate our conda
environment where KneadData and our other tools for this tutorial are installed:
conda activate taxonomic
Then, run KneadData, using parallel, with the following command:
parallel -j 1 --eta --link 'kneaddata -i1 {1} -i2 {2} -o kneaddata_out -db ~/CourseData/MIC_data/tools/GRCh38_PhiX --bypass-trim --remove-intermediate-output' ::: raw_data/*R1_subsampled.fastq.gz ::: raw_data/*R2_subsampled.fastq.gz
While kneaddata
is running, consider the following:
The GRCh38_PhiX database is made up of the human genome. Of the four output files specified above, which should we choose for analyzing the microbiome?
You can check out all of the files that kneaddata
has produced by listing the contents of the output directory (there is a lot!). Take note of how the files are differentiated from one another, and try to identify some of the files we are interested in. Once kneaddata
is complete, we want to stitch our reads together into a single file. This is accomplished with a Perl script from our very own Microbiome Helper. For your convenience, it is already on your student instance.
with the Perl script, concatenate the reads into a single file:
perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_contam*.fastq
The script finds paired reads that match a given regex and outputs the combined files.
- We first specify that our program is to be run with Perl, and then provide the path to the program.
- The
-p
flag specifies how many processes to run in parallel. The default is to do one process at a time, so using-p 4
speeds things up. - The
--no_R_match
option tells the script that our read pairs are differentiated by*_1.fastq
instead of*_R1.fastq
. - The
-o
flag specifies the directory where we want the concatenated files to go. - Our regex matches the paired reads that do not align to the human database from the KneadData output. This is because the reads that aren't "contaminants" actually align to the human genome, so what we are left with could contain microbial reads.
- Consider that our files of interest are named something like
MSMB4LXW_R1_subsampled_kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq
. If we want to match all of our paired contaminant files with a regex, we can specify the string unique to those filenames_paired_contam
, and use wildcards*
to fill the parts of the filename that will change between samples.
- Consider that our files of interest are named something like
If the above does not work, you may need to install Perl:
conda install conda-forge::perl
If it still does not work or you already have Perl installed, you may get an error saying you require Parallel::ForkManager. Fix by executing the following inside your conda environment:
conda install bioconda::perl-parallel-forkmanager
First, we should see how many reads are in our concatenated samples. Since .fastq
files have 4 lines per read, we can divide the number of lines in the file by 4 to count the reads. Use the following command to check number of lines in the output files:
wc -l cat_reads/*
Woah! There's almost nothing left in most of these files! One even has zero reads! From this, we can infer that KneadData found the majority of our reads aligned to the human genome, leaving us with very few sequences to look for microbial reads. This is not entirely uncommon, however the fact that our input reads are actually subsets of much larger samples exacerbated this effect.
For the purpose of this tutorial, we will instead continue with the raw data instead of our filtered data. We are only doing this to demonstrate the tools for the purposes of this tutorial. This is NOT standard practice. In practice, you SHOULD NOT use unfiltered metagenomic data for taxonomic annotation.
To continue, we will concatenate the raw data, then unzip it (the ;
lets you enter multiple command lines that will execute in series when you press enter).
perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 -o cat_reads_full raw_data/*.fastq.gz;
gunzip cat_reads_full/*.gz
Now let's check how many reads we are dealing with:
wc -l cat_reads_full/*
How many reads are in each sample?
Now that we have our reads of interest, we want to understand what these reads are. To accomplish this, we use tools which annotate the reads based on different methods and databases. There are many tools which are capable of this, with varying degrees of speed and precision. For this tutorial, we will be using Kraken2 for fast exact k-mer matching against a database.
Our lab has also investigated which parameters impact tool performance in this Microbial Genomics paper. One of the most important factors is the contents of the database, which should include as many taxa as possible to avoid the reads being assigned an incorrect taxonomic label. Generally, the bigger and more inclusive database, the better. However, due to the constraints of our cloud instance, we will be using a "Standard 8GB" index provided by the Kraken2 developers. For your convenience, the database is already available on your instance.
First, you must create the appropriate output directories, or Kraken2 will not write any files. Using parallel
, we will then run Kraken2 for our concatenated reads as shown below:
parallel -j 1 --eta 'kraken2 --db ~/CourseData/MIC_data/tools/k2_standard_08gb --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport --confidence 0 {}' ::: cat_reads_full/*.fastq
This process can take some time. While this runs, let's learn about what our command is doing!
- We first specify our options for parallel, where:
- the
-j 1
option specifies that we want to run two jobs concurrently; - the
--eta
option will count down the jobs are they are completed; - after the program contained in quotation marks, we specify our input files with
:::
, and use a regex to match all of the concatenated, unzipped.fastq
files.
- the
- We then describe how we want
kraken
to run:- by first specifying the location of the database with the
--db
option; - then specifying the output directory for the raw kraken annotated reads;
- notice that we use a special form of the brackets here,
{/.}
, this is a special function ofparallel
that will remove both the file path and extension when substituting the input into ourkraken
command. This is useful when files are going into different directories, and when we want to change the extension.
- notice that we use a special form of the brackets here,
- similarly, we also specify the output of our "report" files with the
--report
option; - the
-confidence
option allows us to filter annotations below a certain threshold (a float between 0 and 1) into the unclassified node. We are using0
because our samples are already subset, however this should generally be higher. See our paper for more information. - and finally, we use the empty brackets
{}
forparallel
to tellkraken
what our desired input file is.
- by first specifying the location of the database with the
With Kraken2, we have annotated the reads in our sample with taxonomy information. If we want to use this to investigate diversity metrics, we need to find the abundances of taxa in our samples. This is done with Kraken2's companion tool, Bracken (Bayesian Reestimation of Abundance with KrakEN).
Let's run Bracken on our Kraken2 outputs!
parallel -j 2 --eta 'bracken -d ~/CourseData/MIC_data/tools/kraken2_standard_08gb -i {} -o bracken_out{/.}.species.bracken -r 100 -l S -t 1' ::: kraken2_kreport/*.kreport
Some notes about this command:
-
-d
specifies the database we want to use. It should be the same database we used when we ran Kraken2; -
-i
is our input file(s); -
-o
is where and what we want the output files to be; -
-r
is the read length to get all classifications for, the default is 100; -
-l
is the taxonomic level at which we want to estimate abundances; -
-t
is the number of reads required prior to abundance estimation to perform re-estimation
Another tool that is commonly used for taxonomic annotation of metagenomic sequences is MetaPhlAn. This tool is different from Kraken2 in that it uses a database of marker genes, instead of a collection of genomes. We will use MetaPhlAn 3 for this tutorial, but a newer version (4) utilizing a larger database is available. First, let's make an output folder. Then, we can run the following:
parallel -j 2 --eta 'metaphlan --input_type fastq --bowtie2db ~/CourseData/MIC_data/tools/CHOCOPhlAn_201901 --no_map -o metaphlan_out/{/.}.mpa {}' ::: cat_reads/*.fastq
Similar to Kraken2, MetaPhlAn will output individual files for each sample. We can use a utility script from MetaPhlAn to merge our outputs into one table.
python ~/CourseData/MIC_data/AMB_data/scripts/merge_metaphlan_tables.py metaphlan_out/*.mpa > metaphlan_out/mpa_merged_table.txt
If we view this new combined table, we will see three key things:
- First, the output format is different to that of Kraken2, where the full taxonomic lineages are expanded line by line.
- Second, MetaPhlAn only outputs relative abundances for each taxonomic node, whereas Kraken2 (before re-analysis with Bracken) will output absolute numbers of reads assigned to each taxonomic node.
- Third, the number of taxa that MetaPhlAn finds is much smaller than Kraken2. This is partially due to us using a low confidence threshold with Kraken2, but this discrepancy between the two tools tends to hold true at higher confidence thresholds as well. See our paper for more info about how these tools perform compared to each other.
After all of this, we are almost ready to create some profiles from our samples! The last step is to put everything into a format that R can easily handle. One standard format is the .biom
format, which is a recognized standard for the Earth Microbiome Project and is supported by the Genomics Standards Consortium.
To transform our data, we will use kraken-biom, a tool which takes the .kreport
files output by bracken
and creates a new, combined file in the .biom
format. This tool can also incorporate our sample metadata, and it is easier to merge this information now versus later.
First, we will have to copy the metadata file to our working directory:
cp ~/CourseData/MIC_data/AMB_data/amb_module1/mgs_metadata.tsv .
Let's have a look at this file, try reading it with cat
Now that we have our metadata, let's run kraken-biom
and merge our samples to one organized output file:
kraken-biom kraken2_kreport/*bracken_species.kreport -m mgs_metadata.tsv -o mgs.biom --fmt json
- The inputs are a positional argument, and
kraken-biom
will accept lists. We can match all of our desired bracken-corrected.kreport
files with the regexkraken2_kreport/*bracken_species.kreport
- the
-m
option is for specifying our metadata file. Note thatkraken-biom
is picky about the order of the metadata, so the entries in the list should be in the same order that your files are found. Typically this is in alphanumeric/dictionary order. - the
-o
option specifies our output file. - the
--fmt
option specifies that we want the output to be injson
format, which is not the default behavior of the program. JSON is a text-based version of the BIOM format, as opposed to HDF5 which is a binary file format.
Now, with all of that, we should have our final mgs.biom
file ready to go! You can check out what this file looks like (it's a single line so head
will not work here). It can be cumbersome to look at, but there are patterns. Fortunately, R is much better at reading these files than we are!
knitr::opts_chunk$set(echo = TRUE)
library(phyloseq)
library(vegan)
library(ggplot2)
library(taxonomizr)
#setwd("~/workspace/")
#data<-import_biom("~/CourseData/MIC_data/AMB_data/amb_module1/mgs.biom", header = TRUE)
data<-import_biom("ben/mgs_correction.biom", header = TRUE)
We can see what the phyloseq object looks like by using:
View(data)
With this, we can see that it is a special object that should contain otu_table
, tax_table
, and sam_data
objects within it. What do each of these objects within data
represent?
Next, we will do some housekeeping with the phyloseq object we have created with the import-biom()
function. If we look at the tax_table
with the following:
View(data@tax_table)
We get something like this:
Rank1 | Rank2 | Rank3 | Rank4 | Rank5 | Rank6 | Rank7 | |
---|---|---|---|---|---|---|---|
820 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__uniformis |
2755405 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__sp. CACC 737 |
2528203 | k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__Bacteroidaceae | g__Bacteroides | s__sp. A1C1 |
... | ... | ... | ... | ... | ... | ... | ... |
We can see that the table contains all of our information, but the labels need some work. So, we can modify the contents of the tax_table
as follows:
#Rename the taxa names in the taxa table.
data@[email protected] <- substring(data@[email protected], 4)
#Rename columns of taxa table.
colnames(data@[email protected])<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
bacteria<-subset_taxa(data, Kingdom == "Bacteria")
So, what's this doing to our data?
- The first command strips all the taxonomic names of their first 3 characters, which removes those pesky leading
k__
strings. We are directly modifying the vectors within the data frame with thesubstr()
command. The parameters of this command tell R we only want the strings starting at the 4th character, which skips the letter and two underscores (i.e. `k__). - The second command creates a character vector of the taxonomic levels to replace the "Ranks" in the table, and replaces those column names.
- The third command subsets all of the taxa in our data to only bacteria, by selecting only taxa with
"Bacteria"
as theirKingdom
column entry in thetax_table
.
Now if you run View(data@tax_table)
, the resulting output should look much cleaner!
And, a quick way to check if this worked is to execute the following command in your console. This looks for unique values in the "Kingdom" row of the taxonomy table, after coercing it to a data frame:
unique(as.data.frame(bacteria@[email protected])$Kingdom)
Which should only return 1 unique string:
[1] "Bacteria"
An important step in analyzing sequencing data is rarefaction. Rarefaction involves randomly subsetting samples so that all samples have even sequencing depth
A quick note on rarefying and rarefaction: (click to expand)
It is worth knowing that the practice of rarefaction has been called into question in the past. However, the current thinking is that "rarefaction is the most robust approach to control for uneven sequencing effort when considered across a variety of alpha and beta diversity metrics."
Generally, it is a good idea to start by manually removing rare taxa. It is common to remove taxa that have less than 20 reads across all samples in our dataset. To do this, we will use the prune_taxa()
command from phyloseq.
If we view the otu_table
of Bacteria
, we will see as we scroll through that there are many taxa that appear sparsely across the different samples.
View(bacteria@otu_table)
We are not particularly interested in these rare taxa, so a quick way to deal with them is to "prune" taxa from our samples that have less than "n" reads across all samples. We can first look at the taxa sums in the form of a histogram.
#View the histogram of taxa sums in our "Bacteria" dataset.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums before Pruning")
With this, we see that most frequently, the sum of all reads for a given taxa is in the 0-20 bin. This means that there are lots of taxa with low abundances (less than 20 reads total) in our dataset. So, we can prune these rare taxa with a built-in phyloseq command called prune_taxa
:
#Prune rare taxa from the dataset. This removes taxa that have less than 20 occurances across all samples.
bacteria<-prune_taxa(taxa_sums(bacteria)>=20,bacteria)
Some notes about this command:
- The first argument we give to
prune_taxa()
is the condition we want met for the taxa to 'pass' this filtering. - We are looking for the sum of the taxa (found by
taxa_sums
) in our phyloseq objectbacteria
to be greater than 20. - The second argument is the phyloseq object we are pruning, which will still be
bacteria
. - The result of this command is overwrites our previous
bacteria
R object.
With that, we can re-visit our histogram and see what this pruning has done:
#View the histogram of taxa sums after we remove the rare taxa.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums after Pruning")
From the scale of the y-axes, we can see that this has pruned many of the rare taxa. This can be verified by viewing the otu_table
with View(bacteria@otu_table)
.
To rarefy our dataset, we must first visualize the rarefaction curve of our samples using the vegan package. To do this, we need to create a dataframe that vegan can work with from our Phyloseq object bacteria
.
#Visualize the rarefaction curve of our data.
rarecurve(as.data.frame(t(otu_table(bacteria))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))
Some notes about this command:
- We apply a number of transformations to
bacteria
beforerarecurve
works on it:- the otu_table of
bacteria
is returned by theotu_table(bacteria)
command; - the otu_table is then tranposed by
t()
such that the rows and columns are switched, because this is the formatrarecurve
expects; - this transposed otu_table is then coerced to a dataframe by
as.data.frame()
so thatrarecurve
can read it.
- the otu_table of
- The remainder of the
rarecurve
parameters control how the output is displayed.
Looking at the rarefaction curve, we can see that the number of species for all of the samples eventually begins to plateau, which is a good sign! This tells us that we have reached a sequencing depth where more reads does not improve the number of taxa we find in our sample. However, with the labels, it can be difficult to see exactly what these sample sizes are, so the following code will print it out for us:
print(c("Minimum sample size:",min(sample_sums(bacteria)), "Maximum sample size:", max(sample_sums(bacteria))))
So, we know that at a minimum we must rarefy to the smaller number. However, considering some samples plateau much before this number, we should choose a smaller cutoff. There is not a strict method to choosing a cutoff, but we should remember we want to include abundant taxa and exclude rare taxa. With this in mind, it is acceptable to rarefy our samples to a size of 10000
reads. Use the following lines of code to achieve this:
#Set seed for reproducibility. Rarefy to a sample size equal to or smaller than the minimum sample sum.
set.seed(711)
rarefied<-rarefy_even_depth(bacteria, rngseed = FALSE, sample.size = 10000, trimOTUs = TRUE)
#See that the samples are all the same size now.
rarecurve(as.data.frame(t(otu_table(rarefied))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))
Some notes about these commands:
- We first set the seed to some number, in this case
711
. This is for reproducibility, since otherwise,rarefy_even_depth()
would use a random seed by default. This means that if you ran the same code twice without setting the seed, you would get two different rarefied subsets. - The
rarefy_even_depth()
command needs to know to not use a random seed (rngseed = FALSE
), that our cutoff is 10000 (sample.size = 10000
), and that we are rarefying thebacteria_pruned
phyloseq object. - The
trimOTUs = TRUE
parameter ofrarefy_even_depth()
means that if a taxa is subsampled to an abundance of 0 across all samples, that taxa is removed from the OTU table. Having taxa with 0 reads can mess things up later in the analysis.
Why do we prune rare taxa before rarefying?
Excellent! Now that our data is imported, formatted, and rarefied, we can finally look at some diversity metrics!
Alpha diversity is a metric that evaluates the different types of taxa in a given sample. Different alpha diversity methods typically use calculations based on different components of the sample, which are:
- Richness: the number of taxa in a sample.
- Evenness: the distribution of abundances of the taxa in a sample (i.e. similarities/differences in read quantity per taxa).
Some methods use only one of these components, some will use a combination of both, and others use neither. There are many indices to choose from, but some common ones are:
- Observed taxa: The number of different taxa (richness)
- Chao1 index: another measure of richness, more sensitive to rare taxa
- Shannon index: a combination of richness and evenness, with more weight to the richness component.
- Simpson index: a combination of richness and evenness, with more weight to the evenness component.
- Fisher's diversity index / Fisher's alpha was one of the first methods to calculate the relationship between the number of different taxa and the abundance of individual taxa
Try adding or changing the measures to see how they compare to one another. Also, try changing value of "x" to different (categorical) metadata variables. How can you use the View()
command to see what metadata you can choose from?
#Plot alpha diversity with several metrics.
plot_richness(physeq = rarefied, x="diagnosis", color = "diagnosis", measures = c("Observed", "Shannon", "Fisher")) + geom_boxplot() +theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Your plots will look something like the following:
We can see that the groups are not identical, and that the different indices yield different plots. As well, some indices are similar to each other (like observed taxa and Fisher's alpha). Since we see differences in both the Shannon and Simpson plots, we can say that there are differences in both richness and evenness between our CD and non-IDB sample groups.
Beta diversity is another useful metric we use to compare microbiome compositions. Specifically, beta diversity is used to evaluate the differences between entire samples in a dataset. You can imagine that each sample is compared to one another, and the "distance" between these samples is calculated, such that we get a symmetric matrix of pairwise comparisons.
Sample A | Sample B | Sample C | |
---|---|---|---|
Sample A | 1 | A vs B | A vs C |
Sample B | B vs A | 1 | B vs C |
Sample C | C vs A | C vs B | 1 |
To capture beta diversity, we use some metric to calculate these pairwise distances between samples. There are several to choose from, but the most common include:
- Bray-Curtis dissimilarity: takes into account presence/absence and abundance of taxa
- Jaccard distance: takes into account only presence/absence of taxa
- Weighted UniFrac: accounts for phylogenetic distances between present and absent taxa, weighted by taxa abundance
- Unweighted UniFrac: accounts for phylogenetic distances between present and absent taxa
Once we decide on the metric(s) to use, we then have to consider how to ordinate our data. Ordination is the method by which we take our data points in multidimensional space and project them them to lower dimensional, i.e. 2D space. This allows us to mo re intuitively
Common ordination methods include:
- Principal Coordinate Analysis (PCoA) uses eigenvalue decomposition of the distance matrix for normally distributed data
- Non-metric Multidimensional Scaling (NMDS) uses iterative rank-order calculations of the distance matrix for non-normally distributed data
Any combinations of the above metrics and methods can be used, depending on the type of data to be analyzed. For our dataset, we will be using the Bray-Curtis dissimilarity and PCoA method. Fortunately, all of these can easily be implemented in R.
First, we will create a relative abundance table from our OTU table. This can be done by transforming our sample counts to a percentage (float) between 0 and 1 using the transform_sample_counts()
function from phyloseq:
percentages <- transform_sample_counts(rarefied, function(x) x*100 / sum(x))
Now, we have a new phyloseq object percentages
in which the abundances are relative. With this object we can create an ordination by selecting our method
and distance
metric. Fortunately, the ordinate
function from phyloseq can do this transformation in one step:
ordination<-ordinate(physeq = percentages, method = "PCoA", distance = "bray")
We are ready to plot our ordination! In this step, we have to specify what data we are using and what our ordination object is. Additionally, we can select our metadata group of interest with the color
parameter.
plot_ordination(physeq = percentages, ordination = ordination, color="diagnosis") +
geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) + ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")
Your plot should look something like the following:
Try substituting the color
parameter with some different categories from our metadata. What does this do to the plot?
As well, you can change the parameters of the ordination, then run the plot_ordination()
function again. How does this change the plot?
#convert to percentages
#percentages <- transform_sample_counts(rarefied, function(x) x*100 / sum(x))
#agglomerate taxa up to Family level. Then, convert to df with psmelt.
percentages_glom <- tax_glom(percentages, taxrank = 'Family')
percentages_df <- psmelt(percentages_glom)
percentages_df$Family[percentages_df$Abundance < 1] <- "Family < 1.0 abundance"
percentages_df$Family <- as.factor(percentages_df$Family)
#select colours
phylum_colors_rel<- colorRampPalette(brewer.pal(11,"Spectral"))(length(levels(percentages_df$Family)))
#Plot relative abundance
relative_plot <- ggplot(data=percentages_df, aes(x=ID, y=Abundance, fill=Family))+
ggtitle("Relative Abundance", subtitle = "Family Level")+
xlab("Sample ID")+
ylab("Abundance (%)")+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_rel)+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
relative_plot
#prepareDatabase(sqlFile = 'accessionTaxa.sql', tmpDir = "ben/", getAccessions = FALSE)
prepareDatabase(sqlFile = 'accessionTaxa.sql', tmpDir = "/scratch/benfish/", getAccessions = FALSE)
#Agglomerate taxa of the same type to the genus level.
hm <- tax_glom(rarefied, taxrank = "Genus")
#Reduce our dataset to the top 20 taxa.
hm<-prune_taxa(names(sort(taxa_sums(hm),TRUE)[1:20]), hm)
#Get the genus name corresponding to the TaxID in the phyloseq object
genera<-getTaxonomy(ids = taxa_names(hm), sqlFile = "accessionTaxa.sql", desiredTaxa = c("genus"))
#To create the heatmap, get the otu table into dataframe format.
hm<-as.matrix(otu_table(hm))
#Change the taxonomy labels from TaxIDs to scientific names.
for(i in 1:length(rownames(hm))){
rownames(hm)[i]<-genera[i]
}
#change column names to sample IDs
colnames(hm)<-rarefied@sam_data$ID
#format hm to logarithmic scale.
hm<-log2(hm)
#remove "-Inf" values and replace with 0.
hm[hm=="-Inf"]<-0
#define colour palette for heatmap.
col <- colorRampPalette(brewer.pal(9, "RdPu"))(256)
#define colours for heatmap "annotation," which will indicate some metadata category. Each value must have a corresponding character and colour in the character vector.
colha <- list(Diagnosis = c("CD" = "#E41A1C", "nonIBD" = "#377EB8"))
#generate the annotation bar for the heatmap.
ha <- HeatmapAnnotation(Diagnosis = rarefied@sam_data$diagnosis, col = colha, show_legend = TRUE)
#ha<-HeatmapAnnotation(foo = anno_block(gp = gpar(fill = 1:10)),
#column_km = 2)
#sort the dendrogram that accompanies the heatmap.
col_dend = dendsort(hclust(dist(t(hm))))
Heatmap(hm, cluster_columns = col_dend, show_row_dend = FALSE, col = col,
row_names_gp = gpar(fontsize = 25),
top_annotation = ha,
row_dend_width = unit(30,"mm"),
heatmap_legend_param = list(title = "Log(2)\nAbundance", at = c(0,5,10,15), labels = c("0","5","10","15")))
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].