Gene expression summary for Pocillopora tuahiniensis sRNA-seq data.

  • trimmed reads generated in deep-dive project

  • Reads aligned to Pocillopora meandrina transcriptome, details here

0.0.1 Install and load packages

library(tidyverse)
library(ggplot2)
library(reshape2)
library(pheatmap)
library(RColorBrewer)

1 sRNA

1.1 Load count data

Load in the sRNA count matrix generated using ShortStack. Keep in mind this data includes counts of all sRNAs, not just miRNAs. Also note, while we have 5 samples from which RNA was sequenced, only 3 of those were sequenced for sRNA in P. evermanni.

# Read in sRNA counts data
Ptuh_counts_sRNA_data_OG <- read_delim("../../../deep-dive/F-Pmea/output/13.2.1-Pmea-sRNAseq-ShortStack-31bp-fastp-merged-cnidarian_miRBase/ShortStack_out/Counts.txt", delim="\t") 
head(Ptuh_counts_sRNA_data_OG)
# A tibble: 6 × 8
  Coords               Name  MIRNA sRNA-POC-47-S1-TP2-f…¹ sRNA-POC-48-S1-TP2-f…²
  <chr>                <chr> <chr>                  <dbl>                  <dbl>
1 Pocillopora_meandri… Clus… N                       1568                   1337
2 Pocillopora_meandri… Clus… N                         24                     51
3 Pocillopora_meandri… Clus… N                        244                    526
4 Pocillopora_meandri… Clus… N                        166                    167
5 Pocillopora_meandri… Clus… N                          4                     16
6 Pocillopora_meandri… Clus… N                         35                     37
# ℹ abbreviated names: ¹​`sRNA-POC-47-S1-TP2-fastp-adapters-polyG-31bp-merged`,
#   ²​`sRNA-POC-48-S1-TP2-fastp-adapters-polyG-31bp-merged`
# ℹ 3 more variables:
#   `sRNA-POC-50-S1-TP2-fastp-adapters-polyG-31bp-merged` <dbl>,
#   `sRNA-POC-53-S1-TP2-fastp-adapters-polyG-31bp-merged` <dbl>,
#   `sRNA-POC-57-S1-TP2-fastp-adapters-polyG-31bp-merged` <dbl>

1.2 Count data munging

Ptuh_counts_sRNA <- Ptuh_counts_sRNA_data_OG

# Remove excess portions of sample column names to just "sample###"
colnames(Ptuh_counts_sRNA) <- sub("-S1-TP2-fastp-adapters-polyG-31bp-merged", "", colnames(Ptuh_counts_sRNA))
colnames(Ptuh_counts_sRNA) <- sub("sRNA-POC-", "sample", colnames(Ptuh_counts_sRNA))

# Keep just the counts and cluster names
Ptuh_counts_sRNA <- Ptuh_counts_sRNA %>% select("sample47", "sample48", "sample50", "sample53", "sample57","Name")

# I'm not going to be doing any removal of low-count sRNAs for now

# Make the cluster names our new row names
Ptuh_counts_sRNA <- Ptuh_counts_sRNA %>% column_to_rownames(var = "Name")

head(Ptuh_counts_sRNA)
          sample47 sample48 sample50 sample53 sample57
Cluster_1     1568     1337     1563     3006     3339
Cluster_2       24       51       67       54       91
Cluster_3      244      526      433      397      949
Cluster_4      166      167      273      318      333
Cluster_5        4       16       26       66       73
Cluster_6       35       37       33       58      123

1.3 Expression levels

Plot histograms of the expression levels in each sample

# Melt the count matrix into long format
Ptuh_counts_sRNA_melted <- melt(Ptuh_counts_sRNA, variable.name = "sample", value.name = "counts")

# Plot the expression level histograms for each sample
ggplot(Ptuh_counts_sRNA_melted, aes(x = counts)) +
  geom_histogram(binwidth = 1, fill = "#7A2048", color = "black") +
  scale_x_log10() +  # Optional: Log-transform the x-axis for better visualization
  facet_wrap(~sample, scales = "free_y") +
  labs(title = "Gene Expression Level Histogram for Each Sample",
       x = "Expression Level (Counts)",
       y = "Frequency") +
  theme_minimal()

1.4 Transcript counts

First let’s check the total number of transcripts in each sample – keep in mind this expression data has not been normalized yet, so there may be different totals for each sample

# Calculate the total number of transcripts for each sample
total_transcripts <- colSums(Ptuh_counts_sRNA)

# Create a data frame for plotting
total_transcripts_df <- data.frame(sample = names(total_transcripts),
                                   totals = total_transcripts)

# Plot the total number of transcripts for each sample
ggplot(total_transcripts_df, aes(x = sample, y = totals)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Transcripts per Sample",
       x = "Sample",
       y = "Total Transcripts") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

Now let’s check the number of unique transcripts in each sample – that is, how many unique sRNAs are expressed in each sample? This should be pretty much the same across samples, even without normalization.

# Calculate the number of unique transcripts (non-zero counts) for each sample
unique_transcripts <- colSums(Ptuh_counts_sRNA > 0)

# Create a data frame for plotting
unique_transcripts_df <- data.frame(sample = names(unique_transcripts),
                                    uniques = unique_transcripts)

# Plot the total number of unique transcripts for each sample
ggplot(unique_transcripts_df, aes(x = sample, y = uniques)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Unique Expressed Transcripts per Sample",
       x = "Sample",
       y = "Unique Transcripts") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

2 miRNA

2.1 Load miRNA metadata

The ShortStack output Results.txt includes all clusters of sRNA reads, including those not annotated as valid miRNAs. Now that we’ve looked at all the sRNAs a bit, let’s focus in on those classified as miRNAs.

# Join with full metadata sheet, which only contains valid miRNAs
Ptuh_metadata_miRNA <- read_csv("../../../deep-dive/DEF-cross-species/output/10-shortRNA-ShortStack-comparison/Ptuh_results_mature_named.csv") 

Ptuh_counts_sRNA <- rownames_to_column(Ptuh_counts_sRNA, var = "Name")

Ptuh_counts_miRNA <- left_join(Ptuh_metadata_miRNA, Ptuh_counts_sRNA, by = c("Name" = "Name"))

# Keep just the counts and given miRNA names (e.g., based on match to previously described miRNA)
Ptuh_counts_miRNA <- Ptuh_counts_miRNA %>% select("sample47", "sample48", "sample50", "sample53", "sample57","given_miRNA_name")

# Make the miRNA names our new row names
Ptuh_counts_miRNA <- Ptuh_counts_miRNA %>% column_to_rownames(var = "given_miRNA_name")

head(Ptuh_counts_miRNA)
                  sample47 sample48 sample50 sample53 sample57
ptuh-mir-novel-32   142599   101636   141199    84760    52034
ptuh-mir-novel-1      2201      960     1596      944      621
ptuh-mir-novel-2       824      527      640      641      337
ptuh-mir-novel-7        74       26       58       94       15
ptuh-mir-novel-3      3949     1963     2559     2213      678
ptuh-mir-novel-4      6841     6398     8987     7176     1555

2.2 Expression levels

Plot histograms of the miRNA expression levels in each sample

# Melt the count matrix into long format
Ptuh_counts_miRNA_melted <- melt(Ptuh_counts_miRNA, variable.name = "sample", value.name = "counts")

# Plot the expression level histograms for each sample
ggplot(Ptuh_counts_miRNA_melted, aes(x = counts)) +
  geom_histogram(binwidth = 1, fill = "#7A2048", color = "black") +
  scale_x_log10() +  # Optional: Log-transform the x-axis for better visualization
  facet_wrap(~sample, scales = "free_y") +
  labs(title = "miRNA Expression Level Histogram for Each Sample",
       x = "Expression Level (Counts)",
       y = "Frequency") +
  theme_minimal()

2.3 miRNA counts

First let’s check the total number of miRNAs in each sample – keep in mind this expression data has not been normalized yet, so there may be different totals for each sample

# Calculate the total number of transcripts for each sample
total_miRNA <- colSums(Ptuh_counts_miRNA)

# Create a data frame for plotting
total_miRNA_df <- data.frame(sample = names(total_miRNA),
                                   totals = total_miRNA)

# Plot the total number of transcripts for each sample
ggplot(total_miRNA_df, aes(x = sample, y = totals)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of miRNAs per Sample",
       x = "Sample",
       y = "Total miRNAs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

Huh, it’s interesting that samples 47, 48, and 50 had lower total sRNA transcript counts but higher total miRNA counts.

Now let’s check the number of unique miRNAs in each sample – This should be pretty much the same across samples, even without normalization.

# Calculate the number of unique transcripts (non-zero counts) for each sample
unique_miRNA <- colSums(Ptuh_counts_miRNA > 0)

# Create a data frame for plotting
unique_miRNA_df <- data.frame(sample = names(unique_miRNA),
                                    uniques = unique_miRNA)

# Plot the total number of unique transcripts for each sample
ggplot(unique_miRNA_df, aes(x = sample, y = uniques)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Unique Expressed miRNAs per Sample",
       x = "Sample",
       y = "Unique miRNA") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

2.4 Heatmap

heat_colors <- rev(brewer.pal(12, "RdYlBu"))

pheatmap(Ptuh_counts_miRNA,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = TRUE,
         color = heat_colors,
         fontsize_row = 8,
         fontsize_col = 8)

Well… there’s like 2 miRNAs with much higher expression than the others, which is making visualizing relative differences difficult. Let’s redo the heatmap, normalizing each row to view relative difference in expression between samples (scale='row')

pheatmap(Ptuh_counts_miRNA,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = TRUE,
         scale = 'row',
         color = heat_colors,
         fontsize_row = 8,
         fontsize_col = 8)

---
title: "03.1-Ptuh-sRNA-summary"
author: "Kathleen Durkin"
date: "2024-09-06"
always_allow_html: true
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    toc_depth: 3
    number_sections: true
    html_preview: true 
---

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = TRUE,         # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

Gene expression summary for *Pocillopora tuahiniensis* sRNA-seq data.

-   trimmed reads generated in `deep-dive` project

-   Reads aligned to *Pocillopora meandrina* transcriptome, details [here](https://github.com/urol-e5/deep-dive/blob/main/F-Ptuh/code/12-Ptuh-RNAseq-kallisto.md)


### Install and load packages

```{r load_libraries, inlcude = TRUE}
library(tidyverse)
library(ggplot2)
library(reshape2)
library(pheatmap)
library(RColorBrewer)
```


# sRNA

## Load count data

Load in the sRNA count matrix generated using ShortStack. Keep in mind this data includes counts of all sRNAs, not just miRNAs. Also note, while we have 5 samples from which RNA was sequenced, only 3 of those were sequenced for sRNA in P. evermanni.

```{r load-sRNA-counts}
# Read in sRNA counts data
Ptuh_counts_sRNA_data_OG <- read_delim("../../../deep-dive/F-Pmea/output/13.2.1-Pmea-sRNAseq-ShortStack-31bp-fastp-merged-cnidarian_miRBase/ShortStack_out/Counts.txt", delim="\t") 
head(Ptuh_counts_sRNA_data_OG)
```

## Count data munging

```{r sRNA-count-data-munging}
Ptuh_counts_sRNA <- Ptuh_counts_sRNA_data_OG

# Remove excess portions of sample column names to just "sample###"
colnames(Ptuh_counts_sRNA) <- sub("-S1-TP2-fastp-adapters-polyG-31bp-merged", "", colnames(Ptuh_counts_sRNA))
colnames(Ptuh_counts_sRNA) <- sub("sRNA-POC-", "sample", colnames(Ptuh_counts_sRNA))

# Keep just the counts and cluster names
Ptuh_counts_sRNA <- Ptuh_counts_sRNA %>% select("sample47", "sample48", "sample50", "sample53", "sample57","Name")

# I'm not going to be doing any removal of low-count sRNAs for now

# Make the cluster names our new row names
Ptuh_counts_sRNA <- Ptuh_counts_sRNA %>% column_to_rownames(var = "Name")

head(Ptuh_counts_sRNA)
```


## Expression levels

Plot histograms of the expression levels in each sample

```{r expression-level-histograms}
# Melt the count matrix into long format
Ptuh_counts_sRNA_melted <- melt(Ptuh_counts_sRNA, variable.name = "sample", value.name = "counts")

# Plot the expression level histograms for each sample
ggplot(Ptuh_counts_sRNA_melted, aes(x = counts)) +
  geom_histogram(binwidth = 1, fill = "#7A2048", color = "black") +
  scale_x_log10() +  # Optional: Log-transform the x-axis for better visualization
  facet_wrap(~sample, scales = "free_y") +
  labs(title = "Gene Expression Level Histogram for Each Sample",
       x = "Expression Level (Counts)",
       y = "Frequency") +
  theme_minimal()
```

## Transcript counts

First let's check the total number of transcripts in each sample -- keep in mind this expression data has *not* been normalized yet, so there may be different totals for each sample
```{r transcript-counts-plot}
# Calculate the total number of transcripts for each sample
total_transcripts <- colSums(Ptuh_counts_sRNA)

# Create a data frame for plotting
total_transcripts_df <- data.frame(sample = names(total_transcripts),
                                   totals = total_transcripts)

# Plot the total number of transcripts for each sample
ggplot(total_transcripts_df, aes(x = sample, y = totals)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Transcripts per Sample",
       x = "Sample",
       y = "Total Transcripts") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability
```

Now let's check the number of unique transcripts in each sample -- that is, how many unique sRNAs are expressed in each sample? This should be pretty much the same across samples, even without normalization.

```{r total-unique-transcripts-plot}
# Calculate the number of unique transcripts (non-zero counts) for each sample
unique_transcripts <- colSums(Ptuh_counts_sRNA > 0)

# Create a data frame for plotting
unique_transcripts_df <- data.frame(sample = names(unique_transcripts),
                                    uniques = unique_transcripts)

# Plot the total number of unique transcripts for each sample
ggplot(unique_transcripts_df, aes(x = sample, y = uniques)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Unique Expressed Transcripts per Sample",
       x = "Sample",
       y = "Unique Transcripts") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability
```
   

# miRNA

## Load miRNA metadata

The ShortStack output Results.txt includes all clusters of sRNA reads, including those not annotated as valid miRNAs. Now that we've looked at all the sRNAs a bit, let's focus in on those classified as miRNAs.

```{r miRNA-count-data-munging}

# Join with full metadata sheet, which only contains valid miRNAs
Ptuh_metadata_miRNA <- read_csv("../../../deep-dive/DEF-cross-species/output/10-shortRNA-ShortStack-comparison/Ptuh_results_mature_named.csv") 

Ptuh_counts_sRNA <- rownames_to_column(Ptuh_counts_sRNA, var = "Name")

Ptuh_counts_miRNA <- left_join(Ptuh_metadata_miRNA, Ptuh_counts_sRNA, by = c("Name" = "Name"))

# Keep just the counts and given miRNA names (e.g., based on match to previously described miRNA)
Ptuh_counts_miRNA <- Ptuh_counts_miRNA %>% select("sample47", "sample48", "sample50", "sample53", "sample57","given_miRNA_name")

# Make the miRNA names our new row names
Ptuh_counts_miRNA <- Ptuh_counts_miRNA %>% column_to_rownames(var = "given_miRNA_name")

head(Ptuh_counts_miRNA)
```

## Expression levels

Plot histograms of the miRNA expression levels in each sample

```{r miRNA-expression-level-histograms}
# Melt the count matrix into long format
Ptuh_counts_miRNA_melted <- melt(Ptuh_counts_miRNA, variable.name = "sample", value.name = "counts")

# Plot the expression level histograms for each sample
ggplot(Ptuh_counts_miRNA_melted, aes(x = counts)) +
  geom_histogram(binwidth = 1, fill = "#7A2048", color = "black") +
  scale_x_log10() +  # Optional: Log-transform the x-axis for better visualization
  facet_wrap(~sample, scales = "free_y") +
  labs(title = "miRNA Expression Level Histogram for Each Sample",
       x = "Expression Level (Counts)",
       y = "Frequency") +
  theme_minimal()
```

## miRNA counts

First let's check the total number of miRNAs in each sample -- keep in mind this expression data has *not* been normalized yet, so there may be different totals for each sample
```{r miRNA-counts-plot}
# Calculate the total number of transcripts for each sample
total_miRNA <- colSums(Ptuh_counts_miRNA)

# Create a data frame for plotting
total_miRNA_df <- data.frame(sample = names(total_miRNA),
                                   totals = total_miRNA)

# Plot the total number of transcripts for each sample
ggplot(total_miRNA_df, aes(x = sample, y = totals)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of miRNAs per Sample",
       x = "Sample",
       y = "Total miRNAs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability
```
Huh, it's interesting that samples 47, 48, and 50 had lower total sRNA transcript counts but higher total miRNA counts.


Now let's check the number of unique miRNAs in each sample -- This should be pretty much the same across samples, even without normalization.

```{r total-unique-miRNA-plot}
# Calculate the number of unique transcripts (non-zero counts) for each sample
unique_miRNA <- colSums(Ptuh_counts_miRNA > 0)

# Create a data frame for plotting
unique_miRNA_df <- data.frame(sample = names(unique_miRNA),
                                    uniques = unique_miRNA)

# Plot the total number of unique transcripts for each sample
ggplot(unique_miRNA_df, aes(x = sample, y = uniques)) +
  geom_bar(stat = "identity", fill = "#7A2048", color = "black") +
  labs(title = "Total Number of Unique Expressed miRNAs per Sample",
       x = "Sample",
       y = "Unique miRNA") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability
```

## Heatmap

```{r miRNA-heatmap}
heat_colors <- rev(brewer.pal(12, "RdYlBu"))

pheatmap(Ptuh_counts_miRNA,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = TRUE,
         color = heat_colors,
         fontsize_row = 8,
         fontsize_col = 8)
```
Well... there's like 2 miRNAs with much higher expression than the others, which is making visualizing relative differences difficult. Let's redo the heatmap, normalizing each row to view relative difference in expression between samples (`scale='row'`)

```{r miRNA-heatmap-rowscale}
pheatmap(Ptuh_counts_miRNA,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = TRUE,
         show_colnames = TRUE,
         scale = 'row',
         color = heat_colors,
         fontsize_row = 8,
         fontsize_col = 8)
```