# Pairwise Differential Gene Expression
Sarah Tanja
2025-05-08

- [Goals](#goals)
- [Load libraries](#load-libraries)
- [Load count matrix & metadata](#load-count-matrix--metadata)
- [Select groups](#select-groups)
- [Setup metadata](#setup-metadata)
- [Make DESeq Objects](#make-deseq-objects)
  - [Estimate size factors](#estimate-size-factors)
- [Visualizations](#visualizations)
  - [Count transformations](#count-transformations)
  - [PCA](#pca)
    - [4 hpf](#4-hpf)
    - [9 hpf](#9-hpf)
    - [14 hpf](#14-hpf)
  - [Heatmap](#heatmap)
    - [4 hpf](#4-hpf-1)
    - [9 hpf](#9-hpf-1)
    - [14 hpf](#14-hpf-1)
- [Run DESeq2](#run-deseq2)
  - [4 hpf](#4-hpf-2)
  - [9 hpf](#9-hpf-2)
  - [14 hpf](#14-hpf-2)

# Goals

This quick analysis pairs down the complexity of multiple embryonic
stages and multiple levels of leachate to a direct comparison of our
control group to our highest exposure group. We did not find any
significantly differently expressed genes across all embryonic stages,
we believe, due to the gene expression in our dataset is clustered by
stage, and the effect of leachate was ‘drowned out’ by the much larger
patterns of embryonic stage. This code breaks it down to look at control
vs high within each stage of embryonic development. This reduces our n
to 5 for each group and we have less power with this analysis, but it
will help us get a fuller picture of our data and what it’s telling us.

# Load libraries

``` r
library(tidyverse)
library(DESeq2)
library(vsn)
library(pheatmap)
library(RColorBrewer)
```

# Load count matrix & metadata

We want to look at the `control` group (embryos developed in 0.22 um
filtered seawater) vs. the `high` group in each embryonic stage. To do
this let’s start by filtering the count matrix to only the samples of
interest.

``` r
gcm_filt_4 <- read_csv("../output/06_tidyup/gcm_filt_4.csv")
gcm_filt_9 <- read_csv("../output/06_tidyup/gcm_filt_9.csv")
gcm_filt_14 <- read_csv("../output/06_tidyup/gcm_filt_14.csv")
```

The genes are the rows. The samples are the columns.

# Select groups

Remove low & mid so we just compare control and high

``` r
gcm_filt_4_cvh <- gcm_filt_4 %>% 
  dplyr::select(-matches("L|M"))

gcm_filt_9_cvh <- gcm_filt_9 %>% 
  dplyr::select(-matches("L|M"))

gcm_filt_14_cvh <- gcm_filt_14 %>% 
  dplyr::select(-matches("L|M"))
```

# Setup metadata

> It is absolutely critical that the columns of the count matrix and the
> rows of the column data (information about samples) are in the same
> order. DESeq2 will not make guesses as to which column of the count
> matrix belongs to which row of the column data, these must be provided
> to DESeq2 already in consistent order.

``` r
metadata_4 <- read_csv("../metadata/metadata_4.csv")
metadata_9 <- read_csv("../metadata/metadata_9.csv")
metadata_14 <- read_csv("../metadata/metadata_14.csv")
```

The samples are rows

remove low & mid

``` r
metadata_4_cvh <- metadata_4 %>% 
  filter(grepl("C|H", sample_name))

metadata_9_cvh <- metadata_9 %>% 
  filter(grepl("C|H", sample_name))

metadata_14_cvh <- metadata_14 %>% 
  filter(grepl("C|H", sample_name))
```

confirm metadata sample_name matches gene count matrix columns. The
following code checks that all the sample names in the metadata
dataframe match the column names in the gene count matrix. It should
return ‘TRUE’

``` r
all(metadata_4_cvh$sample_name == colnames(gcm_filt_4_cvh))
```

    [1] TRUE

``` r
all(metadata_9_cvh$sample_name == colnames(gcm_filt_9_cvh))
```

    [1] TRUE

``` r
all(metadata_14_cvh$sample_name == colnames(gcm_filt_14_cvh))
```

    [1] TRUE

# Make DESeq Objects

Make [DESeq Dataset Object from Count
Matrix](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input)
Create a DESeqDataSet design from gene count matrix and treatment
conditions.

``` r
#Set DESeq2 design for 1 factor with 2 levels
dds_4_cvh <- DESeqDataSetFromMatrix(countData = gcm_filt_4_cvh,
                              colData = metadata_4_cvh, 
                              design = ~ 1 + leachate)

dds_9_cvh <- DESeqDataSetFromMatrix(countData = gcm_filt_9_cvh,
                              colData = metadata_9_cvh, 
                              design = ~ 1 + leachate)

dds_14_cvh <- DESeqDataSetFromMatrix(countData = gcm_filt_14_cvh,
                              colData = metadata_14_cvh, 
                              design = ~ 1 + leachate)
```

## Estimate size factors

``` r
# Estimate size factors
dds_4_cvh <- estimateSizeFactors(dds_4_cvh)
dds_9_cvh <- estimateSizeFactors(dds_9_cvh)
dds_14_cvh <- estimateSizeFactors(dds_14_cvh)
```

# Visualizations

> In order to test for differential expression, we operate on raw counts
> and use discrete distributions… However for visualization or
> clustering – we work with transformed versions of the count data.

## Count transformations

``` r
# This is now the variance stabilized transformed DESeq Data
vsd_4_cvh <- vst(dds_4_cvh, blind=FALSE)
vsd_9_cvh <- vst(dds_9_cvh, blind=FALSE)
vsd_14_cvh <- vst(dds_14_cvh, blind=FALSE)

# This is the r-log transformed DESeq Data
#rld_cvh <- rlog(dds_cvh, blind=FALSE)
```

``` r
meanSdPlot(assay(vsd_4_cvh))
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-10-1.png)

``` r
meanSdPlot(assay(vsd_9_cvh))
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-10-2.png)

``` r
meanSdPlot(assay(vsd_14_cvh))
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-10-3.png)

``` r
#meanSdPlot(assay(rld_cvh))
```

## PCA

### 4 hpf

``` r
# Perform PCA on vst transformed count data
pca_4_cvh <- prcomp(t(assay(vsd_4_cvh)), scale = TRUE)

pca_4_cvh_plot <- plotPCA(vsd_4_cvh, intgroup=c("leachate"), returnData=TRUE)

percentVar <- round(100 * attr(pca_4_cvh_plot, "percentVar"))

# Make the plot
PCA_4_CvH_plot <- ggplot(pca_4_cvh_plot, aes(PC1, PC2, 
                              color=leachate)) +
                              #geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              geom_point(size=3) +
                              xlab(paste0("PC1: ",percentVar[1],"% variance")) +
                              ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
                              coord_fixed() + 
                              geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              theme_minimal()

# Display the plot
PCA_4_CvH_plot
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-12-1.png)

``` r
# Save the plot as a png
ggsave("../output/08_deg/PCA_4_control_vs_high.png", plot = PCA_4_CvH_plot, width = 8, height = 6, dpi = 600)
```

### 9 hpf

``` r
# Perform PCA on vst transformed count data
pca_9_cvh <- prcomp(t(assay(vsd_9_cvh)), scale = TRUE)

pca_9_cvh_plot <- plotPCA(vsd_9_cvh, intgroup=c("leachate"), returnData=TRUE)

percentVar <- round(100 * attr(pca_9_cvh_plot, "percentVar"))

# Make the plot
PCA_9_CvH_plot <- ggplot(pca_9_cvh_plot, aes(PC1, PC2, 
                              color=leachate)) +
                              #geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              geom_point(size=3) +
                              xlab(paste0("PC1: ",percentVar[1],"% variance")) +
                              ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
                              coord_fixed() + 
                              geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              theme_minimal()

# Display the plot
PCA_9_CvH_plot
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-14-1.png)

### 14 hpf

``` r
# Perform PCA on vst transformed count data
pca_14_cvh <- prcomp(t(assay(vsd_14_cvh)), scale = TRUE)

pca_14_cvh_plot <- plotPCA(vsd_14_cvh, intgroup=c("leachate"), returnData=TRUE)

percentVar <- round(100 * attr(pca_14_cvh_plot, "percentVar"))

# Make the plot
PCA_14_CvH_plot <- ggplot(pca_14_cvh_plot, aes(PC1, PC2, 
                              color=leachate)) +
                              #geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              geom_point(size=3) +
                              xlab(paste0("PC1: ",percentVar[1],"% variance")) +
                              ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
                              coord_fixed() + 
                              geom_text(aes(label = name), size = 3) +  # Add labels from the 'name' column
                              theme_minimal()

# Display the plot
PCA_14_CvH_plot
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-15-1.png)

## Heatmap

### 4 hpf

``` r
# 1. Select the top 50 genes by mean expression
top_genes <- order(rowMeans(counts(dds_4_cvh, normalized = TRUE)), decreasing = TRUE)[1:50]

# 2. Extract vs-transformed matrix
vst_matrix <- assay(vsd_4_cvh)[top_genes, ]

# 3. Extract only leachate annotation
annotation_df <- as.data.frame(colData(dds_4_cvh)[, "leachate", drop = FALSE])

# Ensure it's a factor in correct order: control, high
annotation_df$leachate <- factor(annotation_df$leachate, levels = c("control", "high"))

# 4. Order samples by pvc_leachate_level
ordered_cols <- order(annotation_df$leachate)

# Reorder matrix and annotation
vst_matrix_ordered <- vst_matrix[, ordered_cols]
annotation_ordered <- annotation_df[ordered_cols, , drop = FALSE]

# 5. Plot heatmap
heatmap_colors <- colorRampPalette(rev(brewer.pal(n = 9, "RdBu")))(100)

pheatmap(
  vst_matrix_ordered,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  show_rownames = FALSE,
  annotation_col = annotation_ordered,
  color = heatmap_colors,
  main = "Top 50 Most Expressed Genes (Control vs High) at 4 hpf"
)
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-16-1.png)

### 9 hpf

``` r
# 1. Select the top 50 genes by mean expression
top_genes <- order(rowMeans(counts(dds_9_cvh, normalized = TRUE)), decreasing = TRUE)[1:50]

# 2. Extract vs-transformed matrix
vst_matrix <- assay(vsd_9_cvh)[top_genes, ]

# 3. Extract only leachate annotation
annotation_df <- as.data.frame(colData(dds_9_cvh)[, "leachate", drop = FALSE])

# Ensure it's a factor in correct order: control, high
annotation_df$leachate <- factor(annotation_df$leachate, levels = c("control", "high"))

# 4. Order samples by pvc_leachate_level
ordered_cols <- order(annotation_df$leachate)

# Reorder matrix and annotation
vst_matrix_ordered <- vst_matrix[, ordered_cols]
annotation_ordered <- annotation_df[ordered_cols, , drop = FALSE]

# 5. Plot heatmap
heatmap_colors <- colorRampPalette(rev(brewer.pal(n = 9, "RdBu")))(100)

pheatmap(
  vst_matrix_ordered,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  show_rownames = FALSE,
  annotation_col = annotation_ordered,
  color = heatmap_colors,
  main = "Top 50 Most Expressed Genes (Control vs High) at 9 hpf"
)
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-17-1.png)

### 14 hpf

``` r
# 1. Select the top 50 genes by mean expression
top_genes <- order(rowMeans(counts(dds_14_cvh, normalized = TRUE)), decreasing = TRUE)[1:50]

# 2. Extract vs-transformed matrix
vst_matrix <- assay(vsd_14_cvh)[top_genes, ]

# 3. Extract only leachate annotation
annotation_df <- as.data.frame(colData(dds_14_cvh)[, "leachate", drop = FALSE])

# Ensure it's a factor in correct order: control, high
annotation_df$leachate <- factor(annotation_df$leachate, levels = c("control", "high"))

# 4. Order samples by pvc_leachate_level
ordered_cols <- order(annotation_df$leachate)

# Reorder matrix and annotation
vst_matrix_ordered <- vst_matrix[, ordered_cols]
annotation_ordered <- annotation_df[ordered_cols, , drop = FALSE]

# 5. Plot heatmap
heatmap_colors <- colorRampPalette(rev(brewer.pal(n = 9, "RdBu")))(100)

pheatmap(
  vst_matrix_ordered,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  show_rownames = FALSE,
  annotation_col = annotation_ordered,
  color = heatmap_colors,
  main = "Top 50 Most Expressed Genes (Control vs High) at 14 hpf"
)
```

![](08.3_deg_pairwise_files/figure-commonmark/unnamed-chunk-18-1.png)

# Run DESeq2

## 4 hpf

``` r
# Wald test for leachate
dds_4_cvh_deseq <- DESeq(dds_4_cvh, minReplicatesForReplace = Inf)

# Count DEGs due to interaction
DESeq2_4_results <- results(dds_4_cvh_deseq, lfcThreshold = 0, alpha = 0.05)

summary(DESeq2_4_results)
```


    out of 14183 with nonzero total read count
    adjusted p-value < 0.05
    LFC > 0 (up)       : 0, 0%
    LFC < 0 (down)     : 0, 0%
    outliers [1]       : 31, 0.22%
    low counts [2]     : 0, 0%
    (mean count < 1)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

## 9 hpf

``` r
# Wald test for leachate
dds_9_cvh_deseq <- DESeq(dds_9_cvh, minReplicatesForReplace = Inf)

# Count DEGs due to interaction
DESeq2_9_results <- results(dds_9_cvh_deseq, lfcThreshold = 0, alpha = 0.05)

summary(DESeq2_9_results)
```


    out of 17368 with nonzero total read count
    adjusted p-value < 0.05
    LFC > 0 (up)       : 0, 0%
    LFC < 0 (down)     : 0, 0%
    outliers [1]       : 26, 0.15%
    low counts [2]     : 0, 0%
    (mean count < 1)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

## 14 hpf

``` r
# Wald test for leachate
dds_14_cvh_deseq <- DESeq(dds_14_cvh, minReplicatesForReplace = Inf)

# Count DEGs due to interaction
DESeq2_14_results <- results(dds_14_cvh_deseq, lfcThreshold = 0, alpha = 0.05)

summary(DESeq2_14_results)
```


    out of 21447 with nonzero total read count
    adjusted p-value < 0.05
    LFC > 0 (up)       : 0, 0%
    LFC < 0 (down)     : 0, 0%
    outliers [1]       : 303, 1.4%
    low counts [2]     : 0, 0%
    (mean count < 2)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

> [!IMPORTANT]
>
> There are zero differentially expressed genes between the control and
> high treatment groups within all developmental stages.