---
title: "03.00-sRNAseq-gene-expression-DESeq2"
author: "Sam White"
date: "2025-01-31"
output:
bookdown::html_document2:
theme: cosmo
toc: true
toc_float: true
number_sections: true
code_folding: show
code_download: true
github_document:
toc: true
number_sections: true
html_document:
theme: cosmo
toc: true
toc_float: true
number_sections: true
code_folding: show
code_download: true
bibliography: references.bib
---
# Background
This will run [DEseq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@love2014] to determine if any of the miRNAs identified by ShortStack in [02.00-ShortStack-31bp-fastp-merged.md](https://github.com/RobertsLab/project-clam-oa/blob/2616d45a5a0bbef23081c5824bfe5bad6b89d227/code/02.00-ShortStack-31bp-fastp-merged.md) analysis are differentially expressed between control/treatment.
This was initially run with a log2 fold change threshold set to 1 (which is equivalent to a 2-fold change in expression), but that returned 0 differentially expressed miRNAs. As such, this was run again with the log2 fold change threshold set to 0.
::: callout-note
ShortStack identified 37 miRNAs.
:::
This notebook will also run [DEseq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@love2014] on _all_ sRNAs to identify any differential expression.
::: callout-important
This includes the 37 miRNAs identified by ShortStack.
:::
## Inputs
- [`Counts.txt`](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt): ShortStack counts matrix. Includes all clusters, including those that were *not* categorized as miRNAs.
- [`DESeq2-coldata.tab`](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab): Two column table with sample ID and treatment. This file is also an output from this notebook.
- [ManilaOA2023_shortRNASeq_Meta.csv](https://github.com/RobertsLab/project-clam-oa/blob/d380ed9a88c874db5434c9ec74f6ab1de042bad5/data/ManilaOA2023_shortRNASeq_Meta.csv): Metadata file for this sRNA-seq data.
## Outputs
- [`DE-miRNAs.fdr-0.05.lfc-0.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DE-miRNAs.fdr-0.05.lfc-0.tab): Tab-delimited list of differentially expressed miRNA "clusters" from the ShortStack `Results.txt` file.
- [`DE-sRNAs.fdr-0.05.lfc-0.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DE-sRNAs.fdr-0.05.lfc-0.tab): Tab-delimited list of differentially expressed sRNA "clusters" from the ShortStack `Results.txt` file. Since this is any sRNA, results may also include miRNAs identified by ShortStack.
- [`DESeq2-coldata.tab`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab): Two column table with sample ID and treatment. Needed as input to DEseq2.
- [`deseq2.miRNAs.fdr-0.05.lfc-0.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.fdr-0.05.lfc-0.table.csv): DEseq2 miRNA output table of results with adjusted p-value <= 0.05 and a log2 fold change value = 0.
- [`deseq2.miRNAs.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.table.csv): Unfiltered DEseq2 miRNA output table containing all results with mean expression, fold change in expression, and adjusted p-values for all input samples.
- [`deseq2.sRNAs.fdr-0.05.lfc-0.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.fdr-0.05.lfc-0.table.csv): DEseq2 sRNA output table of results with adjusted p-value <= 0.05 and a log2 fold change value = 0.
- [`deseq2.sRNAs.table.csv`](https://github.com/RobertsLab/project-clam-oa/blob/3a9131cc5af82e756990d735b1ad56aecc32008c/output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.table.csv): Unfiltered DEseq2 sRNA output table containing all results with mean expression, fold change in expression, and adjusted p-values for all input samples.
```{r setup, include=FALSE}
library("RColorBrewer")
library("DESeq2")
library("ggplot2")
library("knitr")
library("pheatmap")
library("tidyverse")
knitr::opts_chunk$set(
echo = TRUE, # Display code chunks
eval = FALSE, # Evaluate code chunks
warning = FALSE, # Hide warnings
message = FALSE, # Hide messages
comment = "" # Prevents appending '##' to beginning of lines in code output
)
```
# Set R variables
```{r set-variables, eval=TRUE}
# Define the output directory path
output_dir <- "../output/03.00-sRNAseq-gene-expression-DESeq2/"
# Set desired false discovery rate threshold (i.e. adjusted p-value, padj)
fdr <- 0.05
# Set log2 fold change threshold (a value of '1' is equal to a fold change of '2')
log2fc <- 0
```
# Load count data
Load in the sRNA count matrix generated using ShortStack 4.1.1. Keep in mind this data includes counts of all sRNAs, not just miRNAs.
Counts generated in `02.00-ShortStack-31bp-fastp-merged`.
## Select only miRNAs IDd by ShortStack
```{r load-miRNA-counts, eval=TRUE}
# Read in sRNA counts data
miRNA_seq_counts_shortstack <- read_delim("../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt", delim="\t")
miRNA_seq_counts_shortstack <- miRNA_seq_counts_shortstack %>% filter(MIRNA == "Y")
str(miRNA_seq_counts_shortstack)
```
## All sRNA-seq counts
```{r load-sRNA-counts, eval=TRUE}
# Read in sRNA counts data
srna_seq_counts_all <- read_delim("../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Counts.txt", delim="\t")
str(srna_seq_counts_all)
```
# Create DESeq2 Column Data
## Read in metadata CSV
```{r read-metadata, eval=TRUE}
# Load metadata
metadata <- read.csv("../data/ManilaOA2023_shortRNASeq_Meta.csv", header = TRUE)
str(metadata)
```
## Extract sample names
```{r extract-sample-names, eval=TRUE}
sample_names <- colnames(miRNA_seq_counts_shortstack) %>%
str_subset("^\\d+-") %>%
str_extract("^\\d+")
str(sample_names)
```
## Select sample name and treatment
```{r filter-metadata, eval=TRUE}
sample_treatment_df <- metadata %>%
select(ID_simple, treatment)
# Set sample names as rownames
rownames(sample_treatment_df) <- sample_treatment_df$ID_simple
sample_treatment_df$ID_simple <- NULL
str(sample_treatment_df)
```
## Write DEseq coldata to file
```{r write-deseq-coldata, eval=TRUE}
write.table(
sample_treatment_df,
file = "../output/03.00-sRNAseq-gene-expression-DESeq2/DESeq2-coldata.tab",
sep = "\t",
quote = FALSE,
col.names = NA
)
```
# SHORTSTACK miRNAS
## Count data munging
### Fix col names and convert to matrix
```{r miRNA-count-data-munging, eval=TRUE}
coldata <- sample_treatment_df
# Remove excess portions of sample column names to just "sample###"
colnames(miRNA_seq_counts_shortstack) <- sub("-fastp-adapters-polyG-31bp-merged_condensed", "", colnames(miRNA_seq_counts_shortstack))
# Keep just the counts and cluster names as matrix
miRNA_seq_counts_matrix <- as.matrix(miRNA_seq_counts_shortstack %>% select(-Coords, -MIRNA) %>% column_to_rownames(var = "Name"))
str(miRNA_seq_counts_matrix)
```
### Take only samples present in coldata
```{r miRNA-extract-samples, eval=TRUE}
miRNA_common_cols <- intersect(colnames(miRNA_seq_counts_matrix),
rownames(sample_treatment_df))
miRNA_seq_counts_matrix <- miRNA_seq_counts_matrix[, miRNA_common_cols]
str(miRNA_seq_counts_matrix)
```
### Reorder matrix cols to match coldata
```{r miRNA-reorder-matrix-cols, eval=TRUE}
miRNA_ord <- match(rownames(sample_treatment_df), colnames(miRNA_seq_counts_matrix))
miRNA_seq_counts_matrix_sorted <- miRNA_seq_counts_matrix[, miRNA_ord]
str(miRNA_seq_counts_matrix_sorted)
```
### Verify rownames match
```{r miRNA-check-rownames, eval=TRUE}
all(rownames(coldata) == colnames(miRNA_seq_counts_matrix_sorted))
```
## Create DESeq2 data set
### Initialize DEseq2 data set
```{r miRNA-create-deseq2-data-set, eval=TRUE}
miRNA_dds <- DESeqDataSetFromMatrix(countData = miRNA_seq_counts_matrix_sorted,
colData = coldata,
design = ~ treatment)
miRNA_dds
```
### Add cluster column as "gene" feature
```{r miRNA-add-cluaster-as-gene-feature, eval=TRUE}
miRNA_featureData <- data.frame(gene=rownames(miRNA_seq_counts_matrix_sorted))
mcols(miRNA_dds) <- DataFrame(mcols(miRNA_dds), miRNA_featureData)
mcols(miRNA_dds)
```
### Set factor levels
```{r miRNA-set-factor-levels, eval=TRUE}
miRNA_dds$treatment <- factor(miRNA_dds$treatment, levels = c("control", "treatment"))
```
## DESeq analysis
### DEseq
```{r miRNA-deseq, eval=TRUE}
miRNA_dds <- DESeq(miRNA_dds)
```
### DEseq Results
```{r miRNA-assign-results, eval=TRUE}
miRNA_res <- results(miRNA_dds, alpha = fdr, lfcThreshold = log2fc)
miRNA_res
summary(miRNA_res)
table(miRNA_res$padj < fdr)
```
### Write DDS results tables to CSVs
```{r miRNA-write-dds-results-table, eval=TRUE}
write.csv(miRNA_res, file = paste0(output_dir, "deseq2", ".miRNAs", ".table.csv"), row.names = TRUE, quote = FALSE)
# Subset based on adjusted p-value
miRNA_resSig <- subset(miRNA_res, padj < fdr)
miRNA_resSig
write.csv(miRNA_resSig, file = paste0(output_dir, "deseq2", ".miRNAs", ".fdr-", fdr, ".lfc-", log2fc, ".table.csv"), row.names = TRUE, quote = FALSE)
```
### Identify Differentially Expressed miRNAs
```{r miRNA-differentially-expressed, engine='bash', eval=TRUE}
# Extract the first column values (excluding the header)
clusters=$(cut -d',' -f1 "../output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.miRNAs.fdr-0.05.lfc-0.table.csv" | tail -n +2)
# Loop through each cluster and search in the Results.txt file
for cluster in $clusters;
do
grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt"
done \
| tee ../output/03.00-sRNAseq-gene-expression-DESeq2/DE-miRNAs.fdr-0.05.lfc-0.tab
echo ""
echo "--------------------------------------------------------------"
echo ""
for cluster in $clusters;
do
grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt" \
| awk '{print $21}'
done
```
## Variance stabilizing transformations (VST)
- Here we transform counts using a variance stabilizing transformation (VST), since we have many samples.
```{r miRNA-VST, eval=TRUE}
miRNA_vsd <- varianceStabilizingTransformation(miRNA_dds, blind=FALSE)
```
## Plotting
### Sample distances
```{r miRNA-plot-sample-distances, eval=TRUE}
miRNA_sampleDists <- dist(t(assay(miRNA_vsd)))
miRNA_sampleDistMatrix <- as.matrix( miRNA_sampleDists )
rownames(miRNA_sampleDistMatrix) <- paste( miRNA_vsd$colony.id, miRNA_vsd$time.point, sep = " - " )
colnames(miRNA_sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(miRNA_sampleDistMatrix,
clustering_distance_rows = miRNA_sampleDists,
clustering_distance_cols = miRNA_sampleDists,
col = colors)
```
### PCA
Visualize sample clustering via PCA (after transformation)
```{r miRNA-pca, eval=TRUE}
# PCA with points color coded by time point
plotPCA(miRNA_vsd, intgroup = c("treatment"))
```
### Heatmap of 37 ShortStack miRNAs
```{r miRNA-heatmap, eval=TRUE}
miRNA_counts <- order(rowMeans(counts(miRNA_dds,normalized=TRUE)),
decreasing=TRUE)[1:37]
annotation = colData(miRNA_dds) %>% as.data.frame() %>% select(treatment)
pheatmap(assay(miRNA_vsd)[miRNA_counts,],
cluster_rows=FALSE,
show_rownames=TRUE,
cluster_cols=TRUE,
annotation_col = annotation)
```
# ALL sRNA
## Count data munging
### Fix col names and convert to matrix
```{r sRNA-count-data-munging, eval=TRUE}
coldata <- sample_treatment_df
# Remove excess portions of sample column names to just "sample###"
colnames(srna_seq_counts_all) <- sub("-fastp-adapters-polyG-31bp-merged_condensed", "", colnames(srna_seq_counts_all))
# Keep just the counts and cluster names as matrix
srna_seq_counts_matrix <- as.matrix(srna_seq_counts_all %>% select(-Coords, -MIRNA) %>% column_to_rownames(var = "Name"))
str(srna_seq_counts_matrix)
```
### Take only samples present in coldata
```{r sRNA-extract-samples, eval=TRUE}
sRNA_common_cols <- intersect(colnames(srna_seq_counts_matrix),
rownames(sample_treatment_df))
srna_seq_counts_matrix <- srna_seq_counts_matrix[, sRNA_common_cols]
str(srna_seq_counts_matrix)
```
### Reorder matrix cols to match coldata
```{r sRNA-reorder-matrix-cols, eval=TRUE}
sRNA_ord <- match(rownames(sample_treatment_df), colnames(srna_seq_counts_matrix))
srna_seq_counts_matrix_sorted <- srna_seq_counts_matrix[, sRNA_ord]
str(srna_seq_counts_matrix_sorted)
```
### Verify rownames match
```{r sRNA-check-rownames, eval=TRUE}
all(rownames(coldata) == colnames(srna_seq_counts_matrix_sorted))
```
## Create DESeq2 data set
### Initialize DEseq2 data set
```{r sRNA-create-deseq2-data-set, eval=TRUE}
sRNA_dds <- DESeqDataSetFromMatrix(countData = srna_seq_counts_matrix_sorted,
colData = coldata,
design = ~ treatment)
sRNA_dds
```
### Add cluster column as "gene" feature
```{r sRNA-add-cluaster-as-gene-feature, eval=TRUE}
sRNA_featureData <- data.frame(gene=rownames(srna_seq_counts_matrix_sorted))
mcols(sRNA_dds) <- DataFrame(mcols(sRNA_dds), sRNA_featureData)
mcols(sRNA_dds)
```
### Set factor levels
```{r sRNA-set-factor-levels, eval=TRUE}
sRNA_dds$treatment <- factor(sRNA_dds$treatment, levels = c("control", "treatment"))
```
## DESeq analysis
### DEseq
```{r sRNA-deseq, eval=TRUE}
sRNA_dds <- DESeq(sRNA_dds)
```
### DEseq Results
```{r sRNA-assign-results, eval=TRUE}
sRNA_res <- results(sRNA_dds, alpha = fdr, lfcThreshold = log2fc)
sRNA_res
summary(sRNA_res)
table(sRNA_res$padj < fdr)
```
### Write DDS results tables to CSVs
```{r sRNA-write-dds-results-table, eval=TRUE}
write.csv(sRNA_res, file = paste0(output_dir, "deseq2", ".sRNAs", ".table.csv"), row.names = TRUE, quote = FALSE)
# Subset based on adjusted p-value
sRNA_resSig <- subset(sRNA_res, padj < fdr)
sRNA_resSig
write.csv(sRNA_resSig, file = paste0(output_dir, "deseq2", ".sRNAs", ".fdr-", fdr, ".lfc-", log2fc, ".table.csv"), row.names = TRUE, quote = FALSE)
```
### Identify Differentially Expressed sRNAs
```{r sRNA-differentially-expressed, engine='bash', eval=TRUE}
# Extract the first column values (excluding the header)
clusters=$(cut -d',' -f1 "../output/03.00-sRNAseq-gene-expression-DESeq2/deseq2.sRNAs.fdr-0.05.lfc-0.table.csv" | tail -n +2)
# Loop through each cluster and search in the Results.txt file
for cluster in $clusters;
do
grep "$cluster" "../output/02.00-ShortStack-31bp-fastp-merged/ShortStack_out/Results.txt"
done > ../output/03.00-sRNAseq-gene-expression-DESeq2/DE-sRNAs.fdr-0.05.lfc-0.tab
```
## Variance stabilizing transformations (VST)
- Here we transform counts using a variance stabilizing transformation (VST), since we have many samples.
```{r sRNA-VST, eval=TRUE}
sRNA_vsd <- varianceStabilizingTransformation(sRNA_dds, blind=FALSE)
```
## Plotting
### Sample distances
```{r sRNA-plot-sample-distances, eval=TRUE}
sRNA_sampleDists <- dist(t(assay(sRNA_vsd)))
sRNA_sampleDistMatrix <- as.matrix( sRNA_sampleDists )
rownames(sRNA_sampleDistMatrix) <- paste( sRNA_vsd$colony.id, sRNA_vsd$time.point, sep = " - " )
colnames(sRNA_sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sRNA_sampleDistMatrix,
clustering_distance_rows = sRNA_sampleDists,
clustering_distance_cols = sRNA_sampleDists,
col = colors)
```
### PCA
Visualize sample clustering via PCA (after transformation)
```{r sRNA-pca, eval=TRUE}
# PCA with points color coded by time point
plotPCA(sRNA_vsd, intgroup = c("treatment"))
```
### Heatmap of Top 50 sRNAs
```{r sRNA-heatmap, eval=TRUE}
sRNA_counts_top50 <- order(rowMeans(counts(sRNA_dds,normalized=TRUE)),
decreasing=TRUE)[1:50]
annotation = colData(sRNA_dds) %>% as.data.frame() %>% select(treatment)
pheatmap(assay(sRNA_vsd)[sRNA_counts_top50,],
cluster_rows=FALSE,
show_rownames=FALSE,
cluster_cols=TRUE,
annotation_col = annotation)
```
# REFERENCES