--- title: "Phyloseq tutorial" author: "Daniel Vaulot" date: '`r format(Sys.time(), "%d %m %Y")`' header-includes: - \usepackage{color, fancyvrb} output: rmdformats::readthedown: highlight: kate number_sections : yes pdf_document: toc: yes toc_depth: 2 number_sections : yes --- ```{r knitr_init, echo=FALSE, warning=FALSE, cache=TRUE} library(knitr) library(rmdformats) library("kableExtra") ## Global options # The following line is necessary to make sure that # everything is not reevaluated each time the file is knit # Note : in the case of this report it is necessary to leave cache= FALSE options(max.print="75") knitr::opts_chunk$set(fig.width=8, fig.height=6, eval=TRUE, cache=TRUE, echo=TRUE, prompt=FALSE, tidy=TRUE, comment=NA, message=FALSE, warning=FALSE) opts_knit$set(width=85) ``` # Aim This document explains the use of the phyloseq R library to analyze metabarcoding data. # Phyloseq R library * Phyloseq web site : https://joey711.github.io/phyloseq/index.html * See in particular tutorials for - importing data: https://joey711.github.io/phyloseq/import-data.html - heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html # Data ```{r, echo=FALSE, fig.cap="Carbom cruise", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/carbom_cruise.png") ``` This tutorial uses a reduced metabarcoding dataset obtained by C. Ribeiro and A. Lopes dos Santos. This dataset originates from the CARBOM cruise in 2013 off Brazil and corresponds to the 18S V4 region amplified on flow cytometry sorted samples (see pptx file for details) and sequenced on an Illumina run 2*250 bp analyzed with mothur. ## References for data * Gérikas Ribeiro, C., Lopes dos Santos, A., Marie, D., Helena Pellizari, V., Pereira Brandini, F., and Vaulot, D. (2016). Pico and nanoplankton abundance and carbon stocks along the Brazilian Bight. PeerJ 4, e2587. doi:10.7717/peerj.2587. * Gérikas Ribeiro, C., Marie, D., Lopes dos Santos, A., Pereira Brandini, F., and Vaulot, D. (2016). Estimating microbial populations by flow cytometry: Comparison between instruments. Limnol. Oceanogr. Methods 14, 750–758. doi:10.1002/lom3.10135. * Gérikas Ribeiro C, Lopes dos Santos A, Marie D, Brandini P, Vaulot D. (2018). Relationships between photosynthetic eukaryotes and nitrogen-fixing cyanobacteria off Brazil. ISME J in press. ```{r, echo=FALSE, fig.cap="Carbom paper in ISME", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/carbom_isme.png") ``` # To be added * Exercices # Prerequisites to be installed * R : https://pbil.univ-lyon1.fr/CRAN/ * R studio : https://www.rstudio.com/products/rstudio/download/#download * Download this tutorial from GitHub : https://github.com/vaulot/R_tutorials/archive/master.zip * Download and install the following libraries by running under R studio the following lines ```R install.packages("dplyr") # To manipulate dataframes install.packages("readxl") # To read Excel files into R install.packages("ggplot2") # for high quality graphics source("https://bioconductor.org/biocLite.R") biocLite("phyloseq") ``` # Gettin started * Transfer the files that you downloaded from https://github.com/vaulot/R_tutorials/archive/master.zip to a directory on the computer or server you are using "C:/R_tutorial" or "home/R_tutorial" * Navigate to the /R_tutorial/phyloseq directory. You should see the following files : ```console Phyloseq_tutorial.Rmd ``` * Double click on the file and this should open R Studio ```{r, echo=FALSE, fig.cap="R studio", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/r_studio.png") ``` * Run the "R chunks" # Step by step ## Load necessary libraries ```{r libraries, message=FALSE} library("phyloseq") library("ggplot2") # graphics library("readxl") # necessary to import the data from Excel file library("dplyr") # filter and reformat data frames library("tibble") # Needed for converting column to row names ``` ## Read the data and create phyloseq objects Change your working directory to where the files are located Three tables are needed * OTU * Taxonomy * Samples They are read from a single Excel file where each sheet contains one of the tables ```{r, echo=FALSE, fig.cap="Table OTU - OTU abundance", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/table_otu.png") ``` ```{r, echo=FALSE, fig.cap="Table Taxo - OTU taxonomy", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/table_taxo.png") ``` ```{r, echo=FALSE, fig.cap="Table Samples", out.width = '100%', fig.align= "center"} knitr::include_graphics("../img/table_sample.png") ``` ```{r} otu_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "OTU matrix") tax_mat<- read_excel("../data/CARBOM data.xlsx", sheet = "Taxonomy table") samples_df <- read_excel("../data/CARBOM data.xlsx", sheet = "Samples") ``` Phyloseq objects need to have row.names * define the row names from the otu column ```{r} otu_mat <- otu_mat %>% tibble::column_to_rownames("otu") ``` * Idem for the two other matrixes ```{r, warning=FALSE} tax_mat <- tax_mat %>% tibble::column_to_rownames("otu") samples_df <- samples_df %>% tibble::column_to_rownames("sample") ``` Transform into matrixes otu and tax tables (sample table can be left as data frame) ```{r} otu_mat <- as.matrix(otu_mat) tax_mat <- as.matrix(tax_mat) ``` Transform to phyloseq objects ```{r} OTU = otu_table(otu_mat, taxa_are_rows = TRUE) TAX = tax_table(tax_mat) samples = sample_data(samples_df) carbom <- phyloseq(OTU, TAX, samples) carbom ``` Visualize data ```{r} sample_names(carbom) rank_names(carbom) sample_variables(carbom) ``` Keep only samples to be analyzed ```{r} carbom <- subset_samples(carbom, Select_18S_nifH =="Yes") carbom ``` Keep only photosynthetic taxa ```{r} carbom <- subset_taxa(carbom, Division %in% c("Chlorophyta", "Dinophyta", "Cryptophyta", "Haptophyta", "Ochrophyta", "Cercozoa")) carbom <- subset_taxa(carbom, !(Class %in% c("Syndiniales", "Sarcomonadea"))) carbom ``` Normalize number of reads in each sample using median sequencing depth. ```{r} total = median(sample_sums(carbom)) standf = function(x, t=total) round(t * (x / sum(x))) carbom = transform_sample_counts(carbom, standf) ``` The number of reads used for normalization is **`r sprintf("%.0f", total)`**. \pagebreak ## Bar graphs Basic bar graph based on Division ```{r} plot_bar(carbom, fill = "Division") ``` \pagebreak Make the bargraph nicer by removing OTUs boundaries. This is done by adding ggplot2 modifier. ```{r} plot_bar(carbom, fill = "Division") + geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack") ``` \pagebreak Regroup together Pico vs Nano samples ```{r, warning=FALSE} carbom_fraction <- merge_samples(carbom, "fraction") plot_bar(carbom_fraction, fill = "Division") + geom_bar(aes(color=Division, fill=Division), stat="identity", position="stack") ``` \pagebreak Keep only Chlorophyta and use color according to genus. Do separate panels Pico vs Nano and Surface vs Deep samples. ```{r} carbom_chloro <- subset_taxa(carbom, Division %in% c("Chlorophyta")) plot_bar(carbom_chloro, x="Genus", fill = "Genus", facet_grid = level~fraction) + geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") ``` \pagebreak ## Heatmaps A basic heatmap using the default parameters. ```{r} plot_heatmap(carbom, method = "NMDS", distance = "bray") ``` \pagebreak It is very very cluttered. It is better to only consider the most abundant OTUs for heatmaps. For example one can only take OTUs that represent at least 20% of reads in at least one sample. Remember we normalized all the sampples to median number of reads (total). We are left with only 33 OTUS which makes the reading much more easy. ```{r} carbom_abund <- filter_taxa(carbom, function(x) sum(x > total*0.20) > 0, TRUE) carbom_abund otu_table(carbom_abund)[1:8, 1:5] plot_heatmap(carbom_abund, method = "NMDS", distance = "bray") ``` \pagebreak It is possible to use different distances and different multivaraite methods. For example Jaccard distance and MDS and label OTUs with Class, order by Class. We can also change the Palette (the default palette is a bit ugly...). ```{r} plot_heatmap(carbom_abund, method = "MDS", distance = "(A+B-2*J)/(A+B-J)", taxa.label = "Class", taxa.order = "Class", trans=NULL, low="beige", high="red", na.value="beige") ``` Many different built-in distances can be used ```{r} dist_methods <- unlist(distanceMethodList) print(dist_methods) ``` You can also buid your own distances. For vectors x and y the "quadratic" terms are J = sum(x*y), A = sum(x^2), B = sum(y^2) and "minimum" terms are J = sum(pmin(x,y)), A = sum(x) and B = sum(y), and "binary" terms are either of these after transforming data into binary form (shared number of species, and number of species for each row). Somes examples : * A+B-2*J "quadratic" squared Euclidean * A+B-2*J "minimum" Manhattan * (A+B-2*J)/(A+B) "minimum" Bray-Curtis * (A+B-2*J)/(A+B) "binary" Sørensen * (A+B-2*J)/(A+B-J) "binary" Jaccard \pagebreak Another strategy is to do a heatmap for a specific taxonomy group. For example we can taget the Chlorophyta and then label the OTUs using the Genus. ```{r} plot_heatmap(carbom_chloro, method = "NMDS", distance = "bray", taxa.label = "Genus", taxa.order = "Genus", low="beige", high="red", na.value="beige") ``` \pagebreak ## Alpha diversity Plot Chao1 richness estimator and Shannon diversity estimator. ```{r} plot_richness(carbom, measures=c("Chao1", "Shannon")) ``` \pagebreak Regroup together samples from the same fraction. ```{r, warning=FALSE} plot_richness(carbom, measures=c("Chao1", "Shannon"), x="level", color="fraction") ``` \pagebreak ## Ordination Do multivariate analysis based on Bray-Curtis distance and NMDS ordination. ```{r} carbom.ord <- ordinate(carbom, "NMDS", "bray") ``` \pagebreak Plot **OTUs** ```{r} plot_ordination(carbom, carbom.ord, type="taxa", color="Class", shape= "Division", title="OTUs") ``` \pagebreak A bit confusing, so make it more easy to visualize by breaking according to taxonomic division. ```{r} plot_ordination(carbom, carbom.ord, type="taxa", color="Class", title="OTUs", label="Class") + facet_wrap(~Division, 3) ``` \pagebreak Now display **samples** and enlarge the points to make it more easy to read. ```{r} plot_ordination(carbom, carbom.ord, type="samples", color="fraction", shape="level", title="Samples") + geom_point(size=3) ``` \pagebreak Display both samples and OTUs but in 2 different panels. ```{r} plot_ordination(carbom, carbom.ord, type="split", color="Class", shape="level", title="biplot", label = "station") + geom_point(size=3) ``` \pagebreak ## Network analysis Simple network analysis ```{r} plot_net(carbom, distance = "(A+B-2*J)/(A+B)", type = "taxa", maxdist = 0.7, color="Class", point_label="Genus") ``` \pagebreak This is quite confusing. Let us make it more simple by using only major OTUs ```{r} plot_net(carbom_abund, distance = "(A+B-2*J)/(A+B)", type = "taxa", maxdist = 0.8, color="Class", point_label="Genus") ```