CBW 2024 Advanced Module 1: Introduction to metagenomics and read‐based profiling

A Crash Course in GNU Parallel
Quality Control
- Visualization with FastQC
- Filtering with Kneaddata
Generating Taxonomic Profiles

Setup, Importing and Formatting Data
Remove Rare Taxa and Rarefy
- Pruning
- Rarefy
Alpha Diversity
Beta Diversity

Bioinformatic Tool Citations

Parallel
FastQC
Kneaddata
Bowtie2
Kraken2
Bracken
Kraken-biom
MetaPhlAn 3.1

Working on the Command Line

Copy the files to your working directory.

cp -r ~/CourseData/MIC_data/AMB_data/raw_data/ .

A Crash Course in GNU Parallel

Sometimes in bioinformatics, the number of tasks you have to complete can get VERY large. Fortunately, there are several tools that can help us with this. One such tool is GNU Parallel. This tool can simplify the way in which we approach large tasks, and as the name suggests, it can iterate though many tasks in parallel, i.e. concurrently. We can use a simple command to demonstrate this.

parallel 'echo {}' ::: a b c

With the command above, the program contained within the quotation marks ' ' is echo. This program is run 3 times, as there are 3 inputs listed after the ::: characters. What happens if there are multiple lists of inputs? Try the following:

parallel 'echo {}' ::: a b c ::: 1 2 3

Here, we have demonstrated how parallel treats multiple inputs. It uses all combinations of one of each from a b c and 1 2 3. But, what if we wanted to use 2 inputs that were sorted in a specific order? This is where the --link flag becomes particularly useful. Try the following:

parallel --link 'echo {}' ::: a b c ::: 1 2 3

In this case, the inputs are "linked", such that only one of each is used. If the lists are different lengths, parallel will go back to the beginning of the shortest list and continue to use it until the longest list is completed. You do not have to run the following command, as the output is provided to demonstrate this.

$ parallel --link 'echo {}' ::: light dark ::: red blue green
> light red
> dark blue
> light green

Notice how light appears a second time (on the third line of the output) to satisfy the length of the second list.

Another useful feature is specifying which inputs we give parallel are to go where. This can be done intuitively by using multiple brackets { } containing numbers corresponding to the list we are interested in. Again, you do not have to run the following command, as the output is provided to demonstrate this.

$ parallel --link 'echo {1} {3}; echo {2} {3}' ::: one red ::: two blue ::: fish
> one fish
> two fish
> red fish
> blue fish

Finally, a handy feature is that parallel accepts files as inputs. This is done slightly differently than before, as we need to use four colon characters :::: instead of three. Parallel will then read each line of the file and treat its contents as a list. You can also mix this with the three-colon character lists ::: you are already familiar with. Using the following code, create a test file and use parallel to run the echo program:

echo -e "A\nB\nC" > test.txt
parallel --link 'echo {2} {1}' :::: test.txt ::: 1 2 3

And with that, you're ready to use parallel for all of your bioinformatic needs! We will continue to use it throughout this tutorial and show some additional features along the way. There is also a cheat-sheet here for quick reference.

Quality Control

Visualization with FastQC

First, make your desired output directory (if it doesn't already exist). Then, run FastQC as follows:

fastqc -t 4 raw_data/*fastq.gz -o fastqc_out

Go to http://##.uhn-hpc.ca/ (substituting ## for your student number) and navigate to your FastQC output directory. Click on the .html files to view the results for each sample. Let's look at the Per base sequence quality tab. This shows a boxplot representing the quality of the base call for each position of the read. In general, due to the inherent degradation of quality with increased sequence length, the quality scores will trend downward as the read gets longer. However, you may notice that for our samples this is not the case! This is because for the purpose of this tutorial, your raw data has already been trimmed.

More often, per base sequence quality will look like the following. The FastQC documentation provides examples of "good" and "bad" data. These examples are also shown below:

Which of the graphs does your data resemble more closely? What can we do if data fails the Per Base Sequence Quality module?

Now, you may have also noticed that most of the samples fail the "Per Base Sequence Content" module of FastQC. Let's look at our visualization:

This specific module plots out the proportion of each base position in a file, and raises a warning/error if there are large discrepancies in base proportions. In a given sequence, the lines for each base should run in parallel, indicating that the base calling represents proper nucleotide pairing. Additionally, the A and T lines may appear separate from the G and C lines, which is a consequence of the GC content of the sample. The relative amount of each base reflects the overall amount of the bases in the genome, and should not be imbalanced. When this is not the case, the sequence composition is biased. A common cause of this is the use of primers, which throws off our sequence content at the beginning of the read. Fortunately, although the module error stems from this bias, according to the FastQC documentation for this module it will not largely impact our downstream analysis.

The other modules are explained in depth in the FastQC Module Help Documentation

Filtering with KneadData

KneadData is a tool which "wraps" several programs to create a pipeline that can be executed with one command. Remember that these reads have already been trimmed for you - this is in fact one of KneadData's functionalities. For this tutorial though, we will use KneadData to filter our reads for contaminant sequences against a human database.

Kneaddata outputs many files for each sample we provide it. These include:

paired sequences which match our database;
singletons which match our database;
paired sequences that do not match our database;
singletons that do not match our database;
and some others.

First, we want to activate our conda environment where KneadData and our other tools for this tutorial are installed:

conda activate taxonomic

Then, run KneadData, using parallel, with the following command:

parallel -j 1 --eta --link 'kneaddata -i1 {1} -i2 {2} -o kneaddata_out -db ~/CourseData/MIC_data/tools/GRCh38_PhiX --bypass-trim --remove-intermediate-output' ::: raw_data/*R1_subsampled.fastq.gz ::: raw_data/*R2_subsampled.fastq.gz

While kneaddata is running, consider the following: The GRCh38_PhiX database is made up of the human genome. Of the four output files specified above, which should we choose for analyzing the microbiome?

You can check out all of the files that kneaddata has produced by listing the contents of the output directory (there is a lot!). Take note of how the files are differentiated from one another, and try to identify some of the files we are interested in. Once kneaddata is complete, we want to stitch our reads together into a single file. This is accomplished with a Perl script from our very own Microbiome Helper. For your convenience, it is already on your student instance.

with the Perl script, concatenate the reads into a single file:

perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_contam*.fastq

The script finds paired reads that match a given regex and outputs the combined files.

We first specify that our program is to be run with Perl, and then provide the path to the program.
The -p flag specifies how many processes to run in parallel. The default is to do one process at a time, so using -p 4 speeds things up.
The --no_R_match option tells the script that our read pairs are differentiated by *_1.fastq instead of *_R1.fastq.
The -o flag specifies the directory where we want the concatenated files to go.
Our regex matches the paired reads that do not align to the human database from the KneadData output. This is because the reads that aren't "contaminants" actually align to the human genome, so what we are left with could contain microbial reads.
- Consider that our files of interest are named something like MSMB4LXW_R1_subsampled_kneaddata_GRCh38_PhiX_bowtie2_paired_contam_1.fastq. If we want to match all of our paired contaminant files with a regex, we can specify the string unique to those filenames _paired_contam, and use wildcards * to fill the parts of the filename that will change between samples.

If the above does not work, you may need to install Perl:

conda install conda-forge::perl

If it still does not work or you already have Perl installed, you may get an error saying you require Parallel::ForkManager. Fix by executing the following inside your conda environment:

conda install bioconda::perl-parallel-forkmanager

Generating Taxonomic Profiles

First, we should see how many reads are in our concatenated samples. Since .fastq files have 4 lines per read, we can divide the number of lines in the file by 4 to count the reads. Use the following command to check number of lines in the output files:

wc -l cat_reads/*

Woah! There's almost nothing left in most of these files! One even has zero reads! From this, we can infer that KneadData found the majority of our reads aligned to the human genome, leaving us with very few sequences to look for microbial reads. This is not entirely uncommon, however the fact that our input reads are actually subsets of much larger samples exacerbated this effect.

For the purpose of this tutorial, we will instead continue with the raw data instead of our filtered data. We are only doing this to demonstrate the tools for the purposes of this tutorial. This is NOT standard practice. In practice, you SHOULD NOT use unfiltered metagenomic data for taxonomic annotation.

To continue, we will concatenate the raw data, then unzip it (the ; lets you enter multiple command lines that will execute in series when you press enter).

perl ~/CourseData/MIC_data/AMB_data/scripts/concat_paired_end.pl -p 4 -o cat_reads_full raw_data/*.fastq.gz;
gunzip cat_reads_full/*.gz

Now let's check how many reads we are dealing with:

wc -l cat_reads_full/*

How many reads are in each sample?

Annotation with Kraken2/Bracken

Now that we have our reads of interest, we want to understand what these reads are. To accomplish this, we use tools which annotate the reads based on different methods and databases. There are many tools which are capable of this, with varying degrees of speed and precision. For this tutorial, we will be using Kraken2 for fast exact k-mer matching against a database.

Our lab has also investigated which parameters impact tool performance in this Microbial Genomics paper. One of the most important factors is the contents of the database, which should include as many taxa as possible to avoid the reads being assigned an incorrect taxonomic label. Generally, the bigger and more inclusive database, the better. However, due to the constraints of our cloud instance, we will be using a "Standard 8GB" index provided by the Kraken2 developers. For your convenience, the database is already available on your instance.

First, you must create the appropriate output directories, or Kraken2 will not write any files. Using parallel, we will then run Kraken2 for our concatenated reads as shown below:

parallel -j 1 --eta 'kraken2 --db ~/CourseData/MIC_data/tools/k2_standard_08gb --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport --confidence 0 {}' ::: cat_reads_full/*.fastq

This process can take some time. While this runs, let's learn about what our command is doing!

We first specify our options for parallel, where:
- the -j 1 option specifies that we want to run two jobs concurrently;
- the --eta option will count down the jobs are they are completed;
- after the program contained in quotation marks, we specify our input files with :::, and use a regex to match all of the concatenated, unzipped .fastq files.
We then describe how we want kraken to run:
- by first specifying the location of the database with the --db option;
- then specifying the output directory for the raw kraken annotated reads;
  - notice that we use a special form of the brackets here, {/.}, this is a special function of parallel that will remove both the file path and extension when substituting the input into our kraken command. This is useful when files are going into different directories, and when we want to change the extension.
- similarly, we also specify the output of our "report" files with the --report option;
- the -confidence option allows us to filter annotations below a certain threshold (a float between 0 and 1) into the unclassified node. We are using 0 because our samples are already subset, however this should generally be higher. See our paper for more information.
- and finally, we use the empty brackets {} for parallel to tell kraken what our desired input file is.

With Kraken2, we have annotated the reads in our sample with taxonomy information. If we want to use this to investigate diversity metrics, we need to find the abundances of taxa in our samples. This is done with Kraken2's companion tool, Bracken (Bayesian Reestimation of Abundance with KrakEN).

Let's run Bracken on our Kraken2 outputs!

parallel -j 2 --eta 'bracken -d ~/CourseData/MIC_data/tools/kraken2_standard_08gb -i {} -o bracken_out{/.}.species.bracken -r 100 -l S -t 1' ::: kraken2_kreport/*.kreport

Some notes about this command:

-d specifies the database we want to use. It should be the same database we used when we ran Kraken2;
-i is our input file(s);
-o is where and what we want the output files to be;
-r is the read length to get all classifications for, the default is 100;
-l is the taxonomic level at which we want to estimate abundances;
-t is the number of reads required prior to abundance estimation to perform re-estimation

Annotation with MetaPhlAn

Another tool that is commonly used for taxonomic annotation of metagenomic sequences is MetaPhlAn. This tool is different from Kraken2 in that it uses a database of marker genes, instead of a collection of genomes. We will use MetaPhlAn 3 for this tutorial, but a newer version (4) utilizing a larger database is available. First, let's make an output folder. Then, we can run the following:

parallel -j 2 --eta 'metaphlan --input_type fastq --bowtie2db ~/CourseData/MIC_data/tools/CHOCOPhlAn_201901 --no_map -o metaphlan_out/{/.}.mpa {}' ::: cat_reads/*.fastq

Similar to Kraken2, MetaPhlAn will output individual files for each sample. We can use a utility script from MetaPhlAn to merge our outputs into one table.

python ~/CourseData/MIC_data/AMB_data/scripts/merge_metaphlan_tables.py metaphlan_out/*.mpa > metaphlan_out/mpa_merged_table.txt

If we view this new combined table, we will see three key things:

First, the output format is different to that of Kraken2, where the full taxonomic lineages are expanded line by line.
Second, MetaPhlAn only outputs relative abundances for each taxonomic node, whereas Kraken2 (before re-analysis with Bracken) will output absolute numbers of reads assigned to each taxonomic node.
Third, the number of taxa that MetaPhlAn finds is much smaller than Kraken2. This is partially due to us using a low confidence threshold with Kraken2, but this discrepancy between the two tools tends to hold true at higher confidence thresholds as well. See our paper for more info about how these tools perform compared to each other.

Preparing our files for Analysis

After all of this, we are almost ready to create some profiles from our samples! The last step is to put everything into a format that R can easily handle. One standard format is the .biom format, which is a recognized standard for the Earth Microbiome Project and is supported by the Genomics Standards Consortium.

To transform our data, we will use kraken-biom, a tool which takes the .kreport files output by bracken and creates a new, combined file in the .biom format. This tool can also incorporate our sample metadata, and it is easier to merge this information now versus later.

First, we will have to copy the metadata file to our working directory:

cp ~/CourseData/MIC_data/AMB_data/amb_module1/mgs_metadata.tsv .

Let's have a look at this file, try reading it with cat

Now that we have our metadata, let's run kraken-biom and merge our samples to one organized output file:

kraken-biom kraken2_kreport/*bracken_species.kreport -m mgs_metadata.tsv -o mgs.biom --fmt json

The inputs are a positional argument, and kraken-biom will accept lists. We can match all of our desired bracken-corrected .kreport files with the regex kraken2_kreport/*bracken_species.kreport
the -m option is for specifying our metadata file. Note that kraken-biom is picky about the order of the metadata, so the entries in the list should be in the same order that your files are found. Typically this is in alphanumeric/dictionary order.
the -o option specifies our output file.
the --fmt option specifies that we want the output to be in json format, which is not the default behavior of the program. JSON is a text-based version of the BIOM format, as opposed to HDF5 which is a binary file format.

Now, with all of that, we should have our final mgs.biom file ready to go! You can check out what this file looks like (it's a single line so head will not work here). It can be cumbersome to look at, but there are patterns. Fortunately, R is much better at reading these files than we are!

Working in RStudio

Setup, Importing and Formatting Data

knitr::opts_chunk$set(echo = TRUE)
library(phyloseq)
library(vegan)
library(ggplot2)
library(taxonomizr)

#setwd("~/workspace/")
#data<-import_biom("~/CourseData/MIC_data/AMB_data/amb_module1/mgs.biom", header = TRUE)
data<-import_biom("ben/mgs_correction.biom", header = TRUE)

We can see what the phyloseq object looks like by using:

View(data)

With this, we can see that it is a special object that should contain otu_table, tax_table, and sam_data objects within it. What do each of these objects within data represent?

Next, we will do some housekeeping with the phyloseq object we have created with the import-biom() function. If we look at the tax_table with the following:

View(data@tax_table)

We get something like this:

	Rank1	Rank2	Rank3	Rank4	Rank5	Rank6	Rank7
820	k__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__uniformis
2755405	k__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__sp. CACC 737
2528203	k__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__sp. A1C1
...	...	...	...	...	...	...	...

We can see that the table contains all of our information, but the labels need some work. So, we can modify the contents of the tax_table as follows:

#Rename the taxa names in the taxa table.
data@[email protected] <- substring(data@[email protected], 4)
#Rename columns of taxa table.
colnames(data@[email protected])<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
bacteria<-subset_taxa(data, Kingdom == "Bacteria")

So, what's this doing to our data?

The first command strips all the taxonomic names of their first 3 characters, which removes those pesky leading k__ strings. We are directly modifying the vectors within the data frame with the substr() command. The parameters of this command tell R we only want the strings starting at the 4th character, which skips the letter and two underscores (i.e. `k__).
The second command creates a character vector of the taxonomic levels to replace the "Ranks" in the table, and replaces those column names.
The third command subsets all of the taxa in our data to only bacteria, by selecting only taxa with "Bacteria" as their Kingdom column entry in the tax_table.

Now if you run View(data@tax_table), the resulting output should look much cleaner!

And, a quick way to check if this worked is to execute the following command in your console. This looks for unique values in the "Kingdom" row of the taxonomy table, after coercing it to a data frame:

unique(as.data.frame(bacteria@[email protected])$Kingdom)

Which should only return 1 unique string:

[1] "Bacteria"

Remove Rare Taxa and Rarefy

An important step in analyzing sequencing data is rarefaction. Rarefaction involves randomly subsetting samples so that all samples have even sequencing depth

A quick note on rarefying and rarefaction: (click to expand)

It is worth knowing that the practice of rarefaction has been called into question in the past. However, the current thinking is that "rarefaction is the most robust approach to control for uneven sequencing effort when considered across a variety of alpha and beta diversity metrics."

Generally, it is a good idea to start by manually removing rare taxa. It is common to remove taxa that have less than 20 reads across all samples in our dataset. To do this, we will use the prune_taxa() command from phyloseq.

Pruning

If we view the otu_table of Bacteria, we will see as we scroll through that there are many taxa that appear sparsely across the different samples.

View(bacteria@otu_table)

We are not particularly interested in these rare taxa, so a quick way to deal with them is to "prune" taxa from our samples that have less than "n" reads across all samples. We can first look at the taxa sums in the form of a histogram.

#View the histogram of taxa sums in our "Bacteria" dataset.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums before Pruning")

With this, we see that most frequently, the sum of all reads for a given taxa is in the 0-20 bin. This means that there are lots of taxa with low abundances (less than 20 reads total) in our dataset. So, we can prune these rare taxa with a built-in phyloseq command called prune_taxa:

#Prune rare taxa from the dataset. This removes taxa that have less than 20 occurances across all samples.
bacteria<-prune_taxa(taxa_sums(bacteria)>=20,bacteria)

Some notes about this command:

The first argument we give to prune_taxa() is the condition we want met for the taxa to 'pass' this filtering.
We are looking for the sum of the taxa (found by taxa_sums) in our phyloseq object bacteria to be greater than 20.
The second argument is the phyloseq object we are pruning, which will still be bacteria.
The result of this command is overwrites our previous bacteria R object.

With that, we can re-visit our histogram and see what this pruning has done:

#View the histogram of taxa sums after we remove the rare taxa.
hist(taxa_sums(bacteria), breaks = 2000, xlim = c(0,1000), main = "Taxa Sums after Pruning")

From the scale of the y-axes, we can see that this has pruned many of the rare taxa. This can be verified by viewing the otu_table with View(bacteria@otu_table).

Rarefy

To rarefy our dataset, we must first visualize the rarefaction curve of our samples using the vegan package. To do this, we need to create a dataframe that vegan can work with from our Phyloseq object bacteria.

#Visualize the rarefaction curve of our data.
rarecurve(as.data.frame(t(otu_table(bacteria))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))

Some notes about this command:

We apply a number of transformations to bacteria before rarecurve works on it:
- the otu_table of bacteria is returned by the otu_table(bacteria) command;
- the otu_table is then tranposed by t() such that the rows and columns are switched, because this is the format rarecurve expects;
- this transposed otu_table is then coerced to a dataframe by as.data.frame() so that rarecurve can read it.
The remainder of the rarecurve parameters control how the output is displayed.

Looking at the rarefaction curve, we can see that the number of species for all of the samples eventually begins to plateau, which is a good sign! This tells us that we have reached a sequencing depth where more reads does not improve the number of taxa we find in our sample. However, with the labels, it can be difficult to see exactly what these sample sizes are, so the following code will print it out for us:

print(c("Minimum sample size:",min(sample_sums(bacteria)), "Maximum sample size:", max(sample_sums(bacteria))))

So, we know that at a minimum we must rarefy to the smaller number. However, considering some samples plateau much before this number, we should choose a smaller cutoff. There is not a strict method to choosing a cutoff, but we should remember we want to include abundant taxa and exclude rare taxa. With this in mind, it is acceptable to rarefy our samples to a size of 10000 reads. Use the following lines of code to achieve this:

#Set seed for reproducibility. Rarefy to a sample size equal to or smaller than the minimum sample sum.
set.seed(711)
rarefied<-rarefy_even_depth(bacteria, rngseed = FALSE, sample.size = 10000, trimOTUs = TRUE)

#See that the samples are all the same size now.
rarecurve(as.data.frame(t(otu_table(rarefied))), step=50,cex=0.5,label=TRUE, ylim = c(1,150))

Some notes about these commands:

We first set the seed to some number, in this case 711. This is for reproducibility, since otherwise, rarefy_even_depth() would use a random seed by default. This means that if you ran the same code twice without setting the seed, you would get two different rarefied subsets.
The rarefy_even_depth() command needs to know to not use a random seed (rngseed = FALSE), that our cutoff is 10000 (sample.size = 10000), and that we are rarefying the bacteria_pruned phyloseq object.
The trimOTUs = TRUE parameter of rarefy_even_depth() means that if a taxa is subsampled to an abundance of 0 across all samples, that taxa is removed from the OTU table. Having taxa with 0 reads can mess things up later in the analysis.

Why do we prune rare taxa before rarefying?

Excellent! Now that our data is imported, formatted, and rarefied, we can finally look at some diversity metrics!

Alpha Diversity

Alpha diversity is a metric that evaluates the different types of taxa in a given sample. Different alpha diversity methods typically use calculations based on different components of the sample, which are:

Richness: the number of taxa in a sample.
Evenness: the distribution of abundances of the taxa in a sample (i.e. similarities/differences in read quantity per taxa).

Some methods use only one of these components, some will use a combination of both, and others use neither. There are many indices to choose from, but some common ones are:

Observed taxa: The number of different taxa (richness)
Chao1 index: another measure of richness, more sensitive to rare taxa
Shannon index: a combination of richness and evenness, with more weight to the richness component.
Simpson index: a combination of richness and evenness, with more weight to the evenness component.
Fisher's diversity index / Fisher's alpha was one of the first methods to calculate the relationship between the number of different taxa and the abundance of individual taxa

Try adding or changing the measures to see how they compare to one another. Also, try changing value of "x" to different (categorical) metadata variables. How can you use the View() command to see what metadata you can choose from?

#Plot alpha diversity with several metrics.
plot_richness(physeq = rarefied, x="diagnosis", color = "diagnosis", measures = c("Observed", "Shannon", "Fisher")) + geom_boxplot() +theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Your plots will look something like the following:

We can see that the groups are not identical, and that the different indices yield different plots. As well, some indices are similar to each other (like observed taxa and Fisher's alpha). Since we see differences in both the Shannon and Simpson plots, we can say that there are differences in both richness and evenness between our CD and non-IDB sample groups.

Beta Diversity

Beta diversity is another useful metric we use to compare microbiome compositions. Specifically, beta diversity is used to evaluate the differences between entire samples in a dataset. You can imagine that each sample is compared to one another, and the "distance" between these samples is calculated, such that we get a symmetric matrix of pairwise comparisons.

	Sample A	Sample B	Sample C
Sample A	1	A vs B	A vs C
Sample B	B vs A	1	B vs C
Sample C	C vs A	C vs B	1

To capture beta diversity, we use some metric to calculate these pairwise distances between samples. There are several to choose from, but the most common include:

Bray-Curtis dissimilarity: takes into account presence/absence and abundance of taxa
Jaccard distance: takes into account only presence/absence of taxa
Weighted UniFrac: accounts for phylogenetic distances between present and absent taxa, weighted by taxa abundance
Unweighted UniFrac: accounts for phylogenetic distances between present and absent taxa

Once we decide on the metric(s) to use, we then have to consider how to ordinate our data. Ordination is the method by which we take our data points in multidimensional space and project them them to lower dimensional, i.e. 2D space. This allows us to mo re intuitively

Common ordination methods include:

Principal Coordinate Analysis (PCoA) uses eigenvalue decomposition of the distance matrix for normally distributed data
Non-metric Multidimensional Scaling (NMDS) uses iterative rank-order calculations of the distance matrix for non-normally distributed data

Any combinations of the above metrics and methods can be used, depending on the type of data to be analyzed. For our dataset, we will be using the Bray-Curtis dissimilarity and PCoA method. Fortunately, all of these can easily be implemented in R.

Preparing the Data

First, we will create a relative abundance table from our OTU table. This can be done by transforming our sample counts to a percentage (float) between 0 and 1 using the transform_sample_counts() function from phyloseq:

percentages <- transform_sample_counts(rarefied, function(x) x*100 / sum(x))

Now, we have a new phyloseq object percentages in which the abundances are relative. With this object we can create an ordination by selecting our method and distance metric. Fortunately, the ordinate function from phyloseq can do this transformation in one step:

ordination<-ordinate(physeq = percentages, method = "PCoA", distance = "bray")

Plot the Data

We are ready to plot our ordination! In this step, we have to specify what data we are using and what our ordination object is. Additionally, we can select our metadata group of interest with the color parameter.

plot_ordination(physeq = percentages, ordination = ordination, color="diagnosis") +
  geom_point(size=10, alpha=0.1) + geom_point(size=5) + stat_ellipse(type = "t", linetype = 2) + theme(text = element_text(size = 20)) +  ggtitle("Beta Diversity", subtitle = "Bray-Curtis dissimilarity")

Your plot should look something like the following:

Try substituting the color parameter with some different categories from our metadata. What does this do to the plot?

As well, you can change the parameters of the ordination, then run the plot_ordination() function again. How does this change the plot?


#convert to percentages
#percentages <- transform_sample_counts(rarefied, function(x) x*100 / sum(x))
#agglomerate taxa up to Family level. Then, convert to df with psmelt.
percentages_glom <- tax_glom(percentages, taxrank = 'Family')
percentages_df <- psmelt(percentages_glom)
percentages_df$Family[percentages_df$Abundance < 1] <- "Family < 1.0 abundance"
percentages_df$Family <- as.factor(percentages_df$Family)

#select colours
phylum_colors_rel<- colorRampPalette(brewer.pal(11,"Spectral"))(length(levels(percentages_df$Family)))

#Plot relative abundance 
relative_plot <- ggplot(data=percentages_df, aes(x=ID, y=Abundance, fill=Family))+
  ggtitle("Relative Abundance", subtitle = "Family Level")+
  xlab("Sample ID")+
  ylab("Abundance (%)")+
  geom_bar(aes(), stat="identity", position="stack")+
  scale_fill_manual(values = phylum_colors_rel)+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

relative_plot


#prepareDatabase(sqlFile = 'accessionTaxa.sql', tmpDir = "ben/", getAccessions = FALSE)
prepareDatabase(sqlFile = 'accessionTaxa.sql', tmpDir = "/scratch/benfish/", getAccessions = FALSE)

#Agglomerate taxa of the same type to the genus level.
hm <- tax_glom(rarefied, taxrank = "Genus")

#Reduce our dataset to the top 20 taxa.
hm<-prune_taxa(names(sort(taxa_sums(hm),TRUE)[1:20]), hm)

#Get the genus name corresponding to the TaxID in the phyloseq object
genera<-getTaxonomy(ids = taxa_names(hm), sqlFile = "accessionTaxa.sql", desiredTaxa = c("genus"))

#To create the heatmap, get the otu table into dataframe format. 
hm<-as.matrix(otu_table(hm))

#Change the taxonomy labels from TaxIDs to scientific names.
for(i in 1:length(rownames(hm))){
  rownames(hm)[i]<-genera[i]
}

#change column names to sample IDs
colnames(hm)<-rarefied@sam_data$ID

#format hm to logarithmic scale.
hm<-log2(hm)
#remove "-Inf" values and replace with 0.
hm[hm=="-Inf"]<-0

#define colour palette for heatmap.
col <- colorRampPalette(brewer.pal(9, "RdPu"))(256)
#define colours for heatmap "annotation," which will indicate some metadata category. Each value must have a corresponding character and colour in the character vector.
colha <- list(Diagnosis = c("CD" = "#E41A1C", "nonIBD" = "#377EB8"))

#generate the annotation bar for the heatmap.
ha <- HeatmapAnnotation(Diagnosis = rarefied@sam_data$diagnosis, col = colha, show_legend = TRUE)

#ha<-HeatmapAnnotation(foo = anno_block(gp = gpar(fill = 1:10)),
    #column_km = 2)

#sort the dendrogram that accompanies the heatmap.
col_dend = dendsort(hclust(dist(t(hm))))

Heatmap(hm, cluster_columns = col_dend, show_row_dend = FALSE, col = col,
                 row_names_gp = gpar(fontsize = 25),
                 top_annotation = ha,
                 row_dend_width = unit(30,"mm"),
                 heatmap_legend_param = list(title = "Log(2)\nAbundance", at = c(0,5,10,15), labels = c("0","5","10","15")))

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].