--- title: "03.00-sRNAseq-gene-expression-DESeq2" author: "Sam White" date: "2025-01-31" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib --- # Background This will run [DEseq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@love2014] to determine if any of the miRNAs identified by ShortStack in [02.00-ShortStack-31bp-fastp-merged.md](https://github.com/RobertsLab/project-clam-oa/blob/2616d45a5a0bbef23081c5824bfe5bad6b89d227/code/02.00-ShortStack-31bp-fastp-merged.md) analysis are differentially expressed between control/treatment. This was initially run with a log2 fold change threshold set to 1 (which is equivalent to a 2-fold change in expression), but that returned 0 differentially expressed miRNAs. As such, this was run again with the log2 fold change threshold set to 0. ::: callout-note ShortStack identified 37 miRNAs. ::: This notebook will also run [DEseq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@love2014] on _all_ sRNAs to identify any differential expression. ::: callout-important This includes the 37 miRNAs identified by ShortStack. ::: ## Inputs - [`Counts.txt`](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt): ShortStack counts matrix. Includes all clusters, including those that were *not* categorized as miRNAs. - [`DESeq2-coldata.tab`](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab): Two column table with sample ID and treatment. This file is also an output from this notebook. - [ManilaOA2023_shortRNASeq_Meta.csv](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/data/ManilaOA2023_shortRNASeq_Meta.csv): Metadata file for this sRNA-seq data. ## Outputs - [`DE-miRNAs.fdr-0.05.lfc-0.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DE-miRNAs.fdr-0.05.lfc-0.tab): Tab-delimited list of differentially expressed miRNA "clusters" from the ShortStack `Results.txt` file. - [`DE-sRNAs.fdr-0.05.lfc-0.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DE-sRNAs.fdr-0.05.lfc-0.tab): Tab-delimited list of differentially expressed sRNA "clusters" from the ShortStack `Results.txt` file. Since this is any sRNA, results may also include miRNAs identified by ShortStack. - [`DESeq2-coldata.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab): Two column table with sample ID and treatment. Needed as input to DEseq2. - [`deseq2.miRNAs.fdr-0.05.lfc-0.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.fdr-0.05.lfc-0.table.csv): DEseq2 miRNA output table of results with adjusted p-value <= 0.05 and a log₂ fold change value = 0. - [`deseq2.miRNAs.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.table.csv): Unfiltered DEseq2 miRNA output table containing all results with mean expression, fold change in expression, and adjusted p-values for all input samples. - [`deseq2.sRNAs.fdr-0.05.lfc-0.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.fdr-0.05.lfc-0.table.csv): DEseq2 sRNA output table of results with adjusted p-value <= 0.05 and a log₂ fold change value = 0. - [`deseq2.sRNAs.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.table.csv): Unfiltered DEseq2 sRNA output table containing all results with mean expression, fold change in expression, and adjusted p-values for all input samples. ```{r setup, include=FALSE} library("RColorBrewer") library("DESeq2") library("ggplot2") library("knitr") library("pheatmap") library("tidyverse") knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` # Set R variables ```{r set-variables, eval=TRUE} # Define the output directory path output_dir <- "../output/03.00-sRNAseq-gene-expression-DESeq2/" # Set desired false discovery rate threshold (i.e. adjusted p-value, padj) fdr <- 0.05 # Set log2 fold change threshold (a value of '1' is equal to a fold change of '2') log2fc <- 0 ``` # Load count data Load in the sRNA count matrix generated using ShortStack 4.1.1. Keep in mind this data includes counts of all sRNAs, not just miRNAs. Counts generated in `02.00-ShortStack-31bp-fastp-merged`. ## Select only miRNAs IDd by ShortStack ```{r load-miRNA-counts, eval=TRUE} # Read in sRNA counts data miRNA_seq_counts_shortstack <- read_delim("../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt", delim="\t") miRNA_seq_counts_shortstack <- miRNA_seq_counts_shortstack %>% filter(MIRNA == "Y") str(miRNA_seq_counts_shortstack) ``` ## All sRNA-seq counts ```{r load-sRNA-counts, eval=TRUE} # Read in sRNA counts data srna_seq_counts_all <- read_delim("../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt", delim="\t") str(srna_seq_counts_all) ``` # Create DESeq2 Column Data ## Read in metadata CSV ```{r read-metadata, eval=TRUE} # Load metadata metadata <- read.csv("../data/ManilaOA2023_shortRNASeq_Meta.csv", header = TRUE) str(metadata) ``` ## Extract sample names ```{r extract-sample-names, eval=TRUE} sample_names <- colnames(miRNA_seq_counts_shortstack) %>% str_subset("^\\d+-") %>% str_extract("^\\d+") str(sample_names) ``` ## Select sample name and treatment ```{r filter-metadata, eval=TRUE} sample_treatment_df <- metadata %>% select(ID_simple, treatment) # Set sample names as rownames rownames(sample_treatment_df) <- sample_treatment_df$ID_simple sample_treatment_df$ID_simple <- NULL str(sample_treatment_df) ``` ## Write DEseq coldata to file ```{r write-deseq-coldata, eval=TRUE} write.table( sample_treatment_df, file = "../output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab", sep = "\t", quote = FALSE, col.names = NA ) ``` # SHORTSTACK miRNAS ## Count data munging ### Fix col names and convert to matrix ```{r miRNA-count-data-munging, eval=TRUE} coldata <- sample_treatment_df # Remove excess portions of sample column names to just "sample###" colnames(miRNA_seq_counts_shortstack) <- sub("-fastp-adapters-polyG-31bp-merged_condensed", "", colnames(miRNA_seq_counts_shortstack)) # Keep just the counts and cluster names as matrix miRNA_seq_counts_matrix <- as.matrix(miRNA_seq_counts_shortstack %>% select(-Coords, -MIRNA) %>% column_to_rownames(var = "Name")) str(miRNA_seq_counts_matrix) ``` ### Take only samples present in coldata ```{r miRNA-extract-samples, eval=TRUE} miRNA_common_cols <- intersect(colnames(miRNA_seq_counts_matrix), rownames(sample_treatment_df)) miRNA_seq_counts_matrix <- miRNA_seq_counts_matrix[, miRNA_common_cols] str(miRNA_seq_counts_matrix) ``` ### Reorder matrix cols to match coldata ```{r miRNA-reorder-matrix-cols, eval=TRUE} miRNA_ord <- match(rownames(sample_treatment_df), colnames(miRNA_seq_counts_matrix)) miRNA_seq_counts_matrix_sorted <- miRNA_seq_counts_matrix[, miRNA_ord] str(miRNA_seq_counts_matrix_sorted) ``` ### Verify rownames match ```{r miRNA-check-rownames, eval=TRUE} all(rownames(coldata) == colnames(miRNA_seq_counts_matrix_sorted)) ``` ## Create DESeq2 data set ### Initialize DEseq2 data set ```{r miRNA-create-deseq2-data-set, eval=TRUE} miRNA_dds <- DESeqDataSetFromMatrix(countData = miRNA_seq_counts_matrix_sorted, colData = coldata, design = ~ treatment) miRNA_dds ``` ### Add cluster column as "gene" feature ```{r miRNA-add-cluaster-as-gene-feature, eval=TRUE} miRNA_featureData <- data.frame(gene=rownames(miRNA_seq_counts_matrix_sorted)) mcols(miRNA_dds) <- DataFrame(mcols(miRNA_dds), miRNA_featureData) mcols(miRNA_dds) ``` ### Set factor levels ```{r miRNA-set-factor-levels, eval=TRUE} miRNA_dds$treatment <- factor(miRNA_dds$treatment, levels = c("control", "treatment")) ``` ## DESeq analysis ### DEseq ```{r miRNA-deseq, eval=TRUE} miRNA_dds <- DESeq(miRNA_dds) ``` ### DEseq Results ```{r miRNA-assign-results, eval=TRUE} miRNA_res <- results(miRNA_dds, alpha = fdr, lfcThreshold = log2fc) miRNA_res summary(miRNA_res) table(miRNA_res$padj < fdr) ``` ### Write DDS results tables to CSVs ```{r miRNA-write-dds-results-table, eval=TRUE} write.csv(miRNA_res, file = paste0(output_dir, "deseq2", ".miRNAs", ".table.csv"), row.names = TRUE, quote = FALSE) # Subset based on adjusted p-value miRNA_resSig <- subset(miRNA_res, padj < fdr) miRNA_resSig write.csv(miRNA_resSig, file = paste0(output_dir, "deseq2", ".miRNAs", ".fdr-", fdr, ".lfc-", log2fc, ".table.csv"), row.names = TRUE, quote = FALSE) ``` ### Identify Differentially Expressed miRNAs ```{r miRNA-differentially-expressed, engine='bash', eval=TRUE} # Extract the first column values (excluding the header) clusters=$(cut -d',' -f1 "../output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.fdr-0.05.lfc-0.table.csv" | tail -n +2) # Loop through each cluster and search in the Results.txt file for cluster in $clusters; do grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt" done \ | tee ../output/03.00-sRNAseq-gene-expression-DESeq2/DE-miRNAs.fdr-0.05.lfc-0.tab echo "" echo "--------------------------------------------------------------" echo "" for cluster in $clusters; do grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt" \ | awk '{print $21}' done ``` ## Variance stabilizing transformations (VST) - Here we transform counts using a variance stabilizing transformation (VST), since we have many samples. ```{r miRNA-VST, eval=TRUE} miRNA_vsd <- varianceStabilizingTransformation(miRNA_dds, blind=FALSE) ``` ## Plotting ### Sample distances ```{r miRNA-plot-sample-distances, eval=TRUE} miRNA_sampleDists <- dist(t(assay(miRNA_vsd))) miRNA_sampleDistMatrix <- as.matrix( miRNA_sampleDists ) rownames(miRNA_sampleDistMatrix) <- paste( miRNA_vsd$colony.id, miRNA_vsd$time.point, sep = " - " ) colnames(miRNA_sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) pheatmap(miRNA_sampleDistMatrix, clustering_distance_rows = miRNA_sampleDists, clustering_distance_cols = miRNA_sampleDists, col = colors) ``` ### PCA Visualize sample clustering via PCA (after transformation) ```{r miRNA-pca, eval=TRUE} # PCA with points color coded by time point plotPCA(miRNA_vsd, intgroup = c("treatment")) ``` ### Heatmap of 37 ShortStack miRNAs ```{r miRNA-heatmap, eval=TRUE} miRNA_counts <- order(rowMeans(counts(miRNA_dds,normalized=TRUE)), decreasing=TRUE)[1:37] annotation = colData(miRNA_dds) %>% as.data.frame() %>% select(treatment) pheatmap(assay(miRNA_vsd)[miRNA_counts,], cluster_rows=FALSE, show_rownames=TRUE, cluster_cols=TRUE, annotation_col = annotation) ``` # ALL sRNA ## Count data munging ### Fix col names and convert to matrix ```{r sRNA-count-data-munging, eval=TRUE} coldata <- sample_treatment_df # Remove excess portions of sample column names to just "sample###" colnames(srna_seq_counts_all) <- sub("-fastp-adapters-polyG-31bp-merged_condensed", "", colnames(srna_seq_counts_all)) # Keep just the counts and cluster names as matrix srna_seq_counts_matrix <- as.matrix(srna_seq_counts_all %>% select(-Coords, -MIRNA) %>% column_to_rownames(var = "Name")) str(srna_seq_counts_matrix) ``` ### Take only samples present in coldata ```{r sRNA-extract-samples, eval=TRUE} sRNA_common_cols <- intersect(colnames(srna_seq_counts_matrix), rownames(sample_treatment_df)) srna_seq_counts_matrix <- srna_seq_counts_matrix[, sRNA_common_cols] str(srna_seq_counts_matrix) ``` ### Reorder matrix cols to match coldata ```{r sRNA-reorder-matrix-cols, eval=TRUE} sRNA_ord <- match(rownames(sample_treatment_df), colnames(srna_seq_counts_matrix)) srna_seq_counts_matrix_sorted <- srna_seq_counts_matrix[, sRNA_ord] str(srna_seq_counts_matrix_sorted) ``` ### Verify rownames match ```{r sRNA-check-rownames, eval=TRUE} all(rownames(coldata) == colnames(srna_seq_counts_matrix_sorted)) ``` ## Create DESeq2 data set ### Initialize DEseq2 data set ```{r sRNA-create-deseq2-data-set, eval=TRUE} sRNA_dds <- DESeqDataSetFromMatrix(countData = srna_seq_counts_matrix_sorted, colData = coldata, design = ~ treatment) sRNA_dds ``` ### Add cluster column as "gene" feature ```{r sRNA-add-cluaster-as-gene-feature, eval=TRUE} sRNA_featureData <- data.frame(gene=rownames(srna_seq_counts_matrix_sorted)) mcols(sRNA_dds) <- DataFrame(mcols(sRNA_dds), sRNA_featureData) mcols(sRNA_dds) ``` ### Set factor levels ```{r sRNA-set-factor-levels, eval=TRUE} sRNA_dds$treatment <- factor(sRNA_dds$treatment, levels = c("control", "treatment")) ``` ## DESeq analysis ### DEseq ```{r sRNA-deseq, eval=TRUE} sRNA_dds <- DESeq(sRNA_dds) ``` ### DEseq Results ```{r sRNA-assign-results, eval=TRUE} sRNA_res <- results(sRNA_dds, alpha = fdr, lfcThreshold = log2fc) sRNA_res summary(sRNA_res) table(sRNA_res$padj < fdr) ``` ### Write DDS results tables to CSVs ```{r sRNA-write-dds-results-table, eval=TRUE} write.csv(sRNA_res, file = paste0(output_dir, "deseq2", ".sRNAs", ".table.csv"), row.names = TRUE, quote = FALSE) # Subset based on adjusted p-value sRNA_resSig <- subset(sRNA_res, padj < fdr) sRNA_resSig write.csv(sRNA_resSig, file = paste0(output_dir, "deseq2", ".sRNAs", ".fdr-", fdr, ".lfc-", log2fc, ".table.csv"), row.names = TRUE, quote = FALSE) ``` ### Identify Differentially Expressed sRNAs ```{r sRNA-differentially-expressed, engine='bash', eval=TRUE} # Extract the first column values (excluding the header) clusters=$(cut -d',' -f1 "../output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.fdr-0.05.lfc-0.table.csv" | tail -n +2) # Loop through each cluster and search in the Results.txt file for cluster in $clusters; do grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt" done > ../output/03.00-sRNAseq-gene-expression-DESeq2/DE-sRNAs.fdr-0.05.lfc-0.tab ``` ## Variance stabilizing transformations (VST) - Here we transform counts using a variance stabilizing transformation (VST), since we have many samples. ```{r sRNA-VST, eval=TRUE} sRNA_vsd <- varianceStabilizingTransformation(sRNA_dds, blind=FALSE) ``` ## Plotting ### Sample distances ```{r sRNA-plot-sample-distances, eval=TRUE} sRNA_sampleDists <- dist(t(assay(sRNA_vsd))) sRNA_sampleDistMatrix <- as.matrix( sRNA_sampleDists ) rownames(sRNA_sampleDistMatrix) <- paste( sRNA_vsd$colony.id, sRNA_vsd$time.point, sep = " - " ) colnames(sRNA_sampleDistMatrix) <- NULL colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) pheatmap(sRNA_sampleDistMatrix, clustering_distance_rows = sRNA_sampleDists, clustering_distance_cols = sRNA_sampleDists, col = colors) ``` ### PCA Visualize sample clustering via PCA (after transformation) ```{r sRNA-pca, eval=TRUE} # PCA with points color coded by time point plotPCA(sRNA_vsd, intgroup = c("treatment")) ``` ### Heatmap of Top 50 sRNAs ```{r sRNA-heatmap, eval=TRUE} sRNA_counts_top50 <- order(rowMeans(counts(sRNA_dds,normalized=TRUE)), decreasing=TRUE)[1:50] annotation = colData(sRNA_dds) %>% as.data.frame() %>% select(treatment) pheatmap(assay(sRNA_vsd)[sRNA_counts_top50,], cluster_rows=FALSE, show_rownames=FALSE, cluster_cols=TRUE, annotation_col = annotation) ``` # REFERENCES