---
title: "Step 9: Multivariate Differential Gene Expression"
subtitle: "Following MarineOmics Functional Genomics Tutorial step-by-step`"
author: "Sarah Tanja"
date: 2024-12-02
date-modified: today
format:
  gfm: default  
toc: true
toc-depth: 2
link-external-icon: true
link-external-newwindow: true
reference-location: margin
citation-location: margin
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = TRUE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

# Goals

To follow the [MarineOmics Fitting Multifactorial Models of Differential Expression](https://marineomics.github.io/DGE_comparison_v2.html) walkthrough step-by-step with the gene count matrix from the [coral-embryo-RNAseq](https://github.com/sarahtanja/coral-embryo-RNAseq) project. I've copied and pasted many of the tutorial instructions and explanations in order for me to understand my data in context. Anything copied from [MarineOmics](https://marineomics.github.io/index.html) is in blockquotes. I have simply replaced their `pC02` environmental condition variable with `leachate` and their `day` time variable with hours post-fertilization, `hpf` .

# Install packages

```{r}
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "vsn", "edgeR", "arrayQualityMetrics"))
```

```{r}
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('tidyverse')
if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE')
  install.packages('DESeq2')
if ("edgeR" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('edgeR')
if ("ape" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('ape')
if ("vegan" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('vegan')
if ("GGally" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('GGally')
if ("arrayQualityMetrics" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('arrayQualityMetrics')
if ("rgl" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('rgl')
if ("MASS" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('MASS')
if ("data.table" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('data.table')
if ("plyr" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('plyr')
if ("lmtest" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('lmtest')
if ("reshape2" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('reshape2')
if ("Rmisc" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('Rmisc')
if ("statmod" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('statmod')
if ("stringr" %in% rownames(installed.packages()) == 'FALSE') 
  install.packages('stringr')
```

# Load packages

```{r}
# Load packages
 library(DESeq2)
library(edgeR)
library(tidyverse)
library(ape)
library(vegan)
library(GGally)
library(arrayQualityMetrics)
library(rgl)
#library(adegenet)
library(MASS)
library(data.table)
library(plyr)
library(lmtest)
library(reshape2)
library(Rmisc)
#library(lmerTest)
library(statmod)
library(stringr)
```

# Import gene counts

In this count matrix, each row represents a gene, each column a sequenced RNA library, and the values give the estimated counts of fragments that were assigned to the respective gene in each library by `HISAT2`

-   `check.names = FALSE` removed the 'X' from in front of sample id's in the column headers

```{r}
gcm <- read.csv("../output/05_count/gene_count_matrix.csv", row.names="gene_id", check.names = FALSE)
```

::: callout-note
`gcm` is the unfiltered gene count matrix of all 63 samples
:::

-   The raw unfiltered, untransformed, count matrix shows counts of whole numbers (integers)
-   Each column is a sample
-   Each row is a gene
-   A sample id describes parental crosses (the first string of numbers), the pollution exposure (C, L, M, H), and the embryonic stage (4, 9, 14)

```{r}
gcm_filt <- read_csv("../output/06_tidyup/gcm_filt.csv")
```

::: callout-note
`gcm_filt` is the gene count matrix that has been filtered to remove low gene counts, and 2 outlier samples ( `131415L4` and `789C4` ) that had poor HISAT alignment and did not group in the exploratory PCA
:::

# Import metadata

```{r}
metadata <- read_csv("../metadata/metadata.csv")
```

::: callout-note
`metadata` is all 63 samples
:::

```{r}
metadata_filt <- read_csv("../metadata/metadata_filt.csv")

# Reorder factors
metadata_filt$embryonic_stage <- as.factor(metadata_filt$embryonic_stage) %>% fct_relevel("cleavage", "prawnchip", "earlygastrula")
 
metadata_filt$pvc_leachate_level <- as.factor(metadata_filt$pvc_leachate_level) %>% fct_relevel("control", "low", "mid", "high")

metadata_filt$hpf <- as.factor(metadata_filt$hpf) %>%  fct_relevel("4", "9", "14")
```

::: callout-note
`metadata_filt` is prefiltered to remove low gene counts and has 2 outlier samples removed, containing info for 61 samples
:::

# Filter and visualize read counts

#### Plot distribution of unfiltered read counts across all samples

```{r distr_unfiltered_reads}

ggplot(data = data.frame(rowMeans(gcm)),
  aes(x = rowMeans.gcm.)) +
  geom_histogram(fill = "grey", binwidth = 5) +
  xlim(0, 500) +
  ylim(0, 5000) +
  theme_classic() +
  labs(title = "Distribution of unfiltered reads") +
  labs(y = "Density", x = "Raw read counts",
  title = "Read count distribution: untransformed, unnormalized, unfiltered")
  
```

::: callout-note
As you can see in the above plot, the raw distribution of all read counts takes on a left-skewed negative binomial distribution.
:::

> Most DE packages assume that read counts possess a negative binomial distribution. The negative binomial distribution is an extension of distributions for binary variables such as the Poisson distribution, allowing for estimations of "equidispersion" and "overdispersion", equal and greater-than-expected variation in expression attributed to biological variability. - [MarineOmics](https://marineomics.github.io/DGE_comparison_v2.html#:~:text=library(lmerTest)-,Filter%20and%20visualize%20read%20counts,-Before%20model%20fitting)

> While looking at the distribution of your raw reads is useful, these are not in fact the data you will be inputting to tests of differential expression. Let's plot the distribution of filtered reads normalized by library size, expressed as log2 counts per million reads (logCPM). These are the reads we will use in our test, and after we plot their distribution, we will conduct one more, slightly more robust, test of our data's fit to the negative binomial distribution using residuals from fitted models. - [MarineOmics](https://marineomics.github.io/DGE_comparison_v2.html#:~:text=While%20looking%20at,from%20fitted%20models.)

#### Distribution of normalized, filtered read counts

```{r}
# Make a DGEList object for edgeR
y <- DGEList(counts = gcm_filt, remove.zeros = TRUE)

# Let's remove samples with less than 0.5 cpm (this is ~10 counts in the count file) in fewer then 3/60 samples
keep <- rowSums(cpm(y) > .5) >= 3

table(keep)
```

::: callout-important
22,632 genes were retained after removing outlier samples and filtering for low gene counts
:::

```{r}
# Set keep.lib.sizes = F and recalculate library sizes after filtering
y <- y[keep, keep.lib.sizes = FALSE]

y <- calcNormFactors(y)

# Calculate logCPM
df_log <- cpm(y, log = TRUE, prior.count = 2)

# Plot distribution of filtered logCPM values
ggplot(data = data.frame(rowMeans(df_log)), 
       aes(x = rowMeans.df_log.) ) +
  geom_histogram(fill = "grey") +
  theme_classic() +
  labs(y = "Density", x = "Filtered read counts (logCPM)",
       title = "Distribution of normalized, filtered read counts")

```

::: callout-note
| Region of Plot             | Interpretation                                        |
|-------------------------|-----------------------------------------------|
| **Left tail** (steep drop) | Near-zero or low-expression genes, NB distribution    |
| **Middle bulk**            | Actively expressed genes --- roughly normal in logCPM |
| **Right tail**             | Highly expressed genes (often a small fraction)       |
:::

[![Example of filtered read counts from MarineOmics tutorial](marine-omics-filt-counts-dist.jpg)](https://marineomics.github.io/FUN_01_DGE_comparison_v2_files/figure-html/unnamed-chunk-5-1.png)

# MDS plot visualizing experimental factors

> Before analyzing our data, it is essential that we look at the multivariate relationships between our samples based on transcriptome-wide expression levels. Below is example code and output for a principal coordinates analysis (PCOA) plot that visualizes multifactorial RNA-seq replicates according to two predictor variables across major and minor latent variables or PCOA axes. - [MarineOmics](https://marineomics.github.io/DGE_comparison_v2.html#Filter_and_visualize_read_counts:~:text=negative%20binomial%20GLMs.-,MDS%20plot%20visualizing%20experimental%20factors,-Before%20analyzing%20our)

```{r}
# Export pcoa loadings
dds.pcoa = pcoa(vegdist(t(df_log <- cpm(y, log = TRUE, prior.count = 2)),
                          method = "euclidean") / 1000)

# Create df of MDS vector loading
scores <- dds.pcoa$vectors

## Plot pcoa loadings of each sample, grouped by hpf and pvc_leachate_level

# Calculate % variation explained by each eigenvector
percent <- dds.pcoa$values$Eigenvalues
cumulative_percent_variance <- (percent / sum( percent)) * 100

# Prepare information for pcoa plot, then plot
color <- c("steelblue1", "tomato1", "goldenrod1")
par(mfrow = c(1, 1))
plot(
  scores[, 1],
  scores[, 2],
  cex = .5,
  cex.axis = 1,
  cex.lab = 1.25,
  xlab = paste("PC1, ", round(cumulative_percent_variance[1], 2), "%"),
  ylab = paste("PC2, ", round(cumulative_percent_variance[2], 2), "%")
  )

# Add visual groupings to pcoa plot
ordihull(
  scores,
  as.factor(metadata_filt$hpf),
  border = NULL,
  lty = 2,
  lwd = .5,
  label = F,
  col = color,
  draw = "polygon",
  alpha = 100,
  cex = .5
  )

ordispider(scores, as.factor(metadata_filt$group), label = F) # Vectors connecting samples in same hpf x leachate group

ordilabel(scores, cex = 0.5) # Label sample IDs
```

::: callout-note
Our data groups largely by hours post fertilization (hpf), the blue group is 4 hpf, red is 9 hpf, and yellow is 14 hpf.
:::

```{r}
logCPM.pca <- prcomp(t (df_log))
logCPM.pca.proportionvariances <-
((logCPM.pca$sdev ^ 2) / (sum(logCPM.pca$sdev ^ 2))) * 100

## Do treatment groups fully segregate? Wrap samples by hpf x leachate, not just hpf
# Replot using logCPM.pca
plot(
  logCPM.pca$x,
  type = "n",
  main = NA,
  xlab = paste("PC1, ", round(logCPM.pca.proportionvariances[1], 2), "%"),
  ylab = paste("PC2, ", round(logCPM.pca.proportionvariances[2], 2), "%")
  )

points(logCPM.pca$x,
       col = "black",
       pch = 16,
       cex = 1)
       colors2 <-
       c("steelblue1",
         "dodgerblue2",
         "tomato1",
         "coral",
         "goldenrod1",
         "goldenrod3")
       
       ordihull(
       logCPM.pca$x,
       metadata_filt$group,
       border = NULL,
       lty = 2,
       lwd = .5,
       col = colors2,
       draw = "polygon",
       alpha = 75,
       cex = .5,
       label = T
       )
```

::: callout-note
Difficult to ascertain any leachate patterns within hours post fertilization (embryonic stage) groupings
:::

# Non-linear effects

> One of the simplest non-linear relationships that can be fitted to the expression of a transcript across an continuous variable is a **second-order polynomial, otherwise known as a quadratic function, which can be expressed as yi=μ+β1x2+β2xyi=μ+β1x2+β2x where yy is the abundance of a given transcript (ii), μμ is the intercept, and yy is the continuous variable.** For the parabola generated by fitting a second-order polynomial, β1β1 \> 0 opens the parabola upwards while β1β1 \< 0 opens the parabola downwards. The vertex of the parabola is controlled by β2β2 such that when β1β1 is negative, greater β2β2 values result in the vertex falling at higher values of xx.
>
> Quadratic polynomials applied to phenotypic performance curves commonly possess negative β1β1 values with positive β2β2 values: a downard-opening parabola with a positive vertex. However, **quadtratic curves fitted to gene expression data can take on a variety of postiive or negative forms similar to exponential curves, saturating curves, and parabolas.** For instance, the expression of a gene may peak an intermediate level of an environmental level before crashing or it may exponentially decline across that variable. Such trends may better model changes in the transcription of a gene compared to a linear model. To get started, we will fit non-linear second order polynomials before testing for whether model predictions for a given gene are significantly improved by a non-linear linear model. - [MarineOmics](https://marineomics.github.io/DGE_comparison_v2.html#Filter_and_visualize_read_counts:~:text=for%20differential%20expression.-,Non%2Dlinear%20effects,-Gene%20expression%20traits)

![From Rivera et al. 2021 \"Transcriptomic resilience and timing. (a) Gene expression reaction norms of four strategies during recovery after a stressor. We use triangles again for patterns that may confer tolerance and circles for patterns associated with stress sensitivity. While all triangle paths show a return to baseline (resilience) the pink (frontloading) and yellow (dampening) are also depicting differences in baseline and plasticity and are therefore labelled differently. (b) Adapted from the rolling ball analogy commonly used for ecological resilience and depicted in Hodgson et al. (2015). Each ball represents a gene showing a color-matched expression pattern in (a). Landscapes represent expression possibilities during a stress event. In the absence of stress, the ball will settle in a trough, representing baseline expression levels. Elasticity (rate of return to the baseline) is represented by the size of the arrow (i.e., larger arrows have faster rates of return). Pink dotted line is the expression landscape for the frontloaded ball. (c) Using Torres et al. (2016) loops through disease space as an alternative framework of an organism\'s path through stress response and recovery. The color gradient represents the resulting phenotype for a given path through stress and recovery space, though x-and y-axis can denote any two parameters that are correlated but with a time lag.\"](rivera2021-nonlinear-concept.jpg)

## Non-linear effects: example in edgeR

> Let's fit a second-order polynomial for the effect of leachate/leachate using edgeR. Using differential expression tests, we will then determine whether leachate/leachate affected a gene's rate of change in expression and expression maximum by applying differential expression tests to β1 and β2 parameters. By testing for differential expression attributed to intereactions between time and β1 or β2 , we will then test for whether these parameters were significantly different across exposure times such that 0.5 hpfs and 7 hpfs of acclimation to leachate/leachate altered the rate of change in expression across leachate/leachate (β1 ) or the maximum of expression (β2 ).
>
> Since this is the first time we're fitting models to our data in this walkthrough, we will also output some important plots to help us explore and QC model predictions as we work toward testing for differential expression attributed to non-linear effects. - [MarineOmics](https://marineomics.github.io/DGE_comparison_v2.html#Non-linear_effects:~:text=linear%20linear%20model.-,Non%2Dlinear%20effects%3A%20example%20in%20edgeR,-Let%E2%80%99s%20fit%20a)

```{r}
# Estimate dispersion coefficients
y1 <- estimateDisp(y, robust = TRUE) # Estimate mean dispersal

# Plot tagwise dispersal and impose w/ mean dispersal and trendline
plotBCV(y1) 
```

> The above figure is an output of the plotBCV() function in edgeR, which visualises the biological coefficient of variation (otherwise known as dispersion) parameter fitted to individual genes (black circles), fitted across gene expression level (blue trendline), and averaged across the entire transcriptome (red horizontal line).
>
> The BCV/dispersion parameter is a necessary parameter, symbolized as θ in model specification, that must be estimated in negative binomial GLMs. θ represents the 'overdispersion' or the shape of a negative binomial distribution and is thus important for defining the distribution of read count data for a given gene during model fitting. Dispersion or θ can be input during model fitting in edgeR using the three estimates visualized in the above regression.

::: callout-note
Low expression (logCPM \< \~2):

\- Large variability in dispersion estimates (spread-out dots)

\- This is expected: low counts are noisy and unstable

\- Many of these genes may get downweighted or filtered

Medium-to-high expression:

\- Dots cluster more tightly around the trend.

\- The blue line slopes down --- dispersion tends to decrease with increasing expression, then flattens out.

Red vs Blue lines:

\- A common dispersion (red) that is higher or lower than the trend (blue), could indicate a poor fit or biological overdispersion.
:::

![MarineOmics biological coefficient of variation plot for comparison](marine-omics-plotBCV.jpg)

> Now that we've estimated dispersion, we will use these estimates to fit negative binomial GLMs to our read count data using the glmQLFit() function in edgeR, one of the more robust model fitting functions offered by edgeR

Using numeric values instead of factors so...

-   `hpf` (... our time component!)

    -   hours post fertilization (4, 9, & 14) associated with specific embryonic development stages (cleavage, prawn chip, early gastrulation)

    -   **NOT** a 1-to-1 equivalent to the tutorial...

-   `leachate` (... our environmental condition!)

    -   equivalent to pCO2 in the tutorial

```{r}
# Square environmental condition variable
metadata_filt <- metadata_filt %>% 
  mutate(leachate_2 = leachate^2)

# pull out variables values for model
hpf <- metadata_or$hpf
leachate <- metadata_or$pvc_leachate_concentration_mg.L
leachate_2 <- metadata_or$leachate_2

# Fit multifactorial design matrix
design_nl <- model.matrix(~ 1 + leachate_2 + leachate + leachate_2:hpf + leachate:hpf) 
# Generate multivariate edgeR glm

# Fit quasi-likelihood, neg binom linear regression
nl_fit <-
    glmQLFit(y1, design_nl) # Fit multivariate model to counts

## Test for effect of leachate/leachate and leachate/leachate^2
nl_leachate_2 <-
    glmQLFTest(nl_fit,
    coef = 2,
    contrast = NULL,
    poisson.bound = FALSE) # Estimate significant DEGs
nl_leachate <-
    glmQLFTest(nl_fit,
    coef = 3,
    contrast = NULL,
    poisson.bound = FALSE) # Estimate significant DEGs

# Make contrasts
is.de_nl_leachate <-
  decideTestsDGE(nl_leachate, adjust.method = "fdr", p.value = 0.05)
is.de_nl_leachate_2 <-
  decideTestsDGE(nl_leachate_2, adjust.method = "fdr", p.value = 0.05)
```

```{r}
# Summarize differential expression attributed to leachate
summary(is.de_nl_leachate)
```

```{r}
# Summarize differential expression attributed to leachate_2
summary(is.de_nl_leachate_2)
```

We have just fit our first GLM to our read count data and have tested for differential expression across `leachate/leachate`/`leachate` . At this stage, it is important to output a few diagnostic plots. For example, edgeR has the function 'plotMD()' which, when input with a differential expression test object such as a glmQLFTest result, will produce a plot of differential expression log2 fold change values across gene expression level (log2 counts per million or logCPM). logCPM is a major component of gene function and statistical power, and is a useful variable to plot in order to make initial assessments of differential expression results.

Visualizations of differential expression (logFC) across baseline logCPM can be produced by the plotMD() (short for "mean-difference") function in edgeR. To do so, plotMD() is input with the results of glmQLFTest, which we used above to test for differential expression attributable to the parameters leachate/leachate and leachate/leachate\^2. Let's produce two such plots for both parameters where "leachate/leachate_2" represents the leachate/leachate\^2 parameter:

```{r}
# Plot differential expression due to leachate and leachate^2
plotMD(nl_leachate)
```

```{r}
plotMD(nl_leachate_2)
```

In the two mean-difference plots above, red and blue colored points represent genes exhibiting significant differential expression attributed to `leachate` or `leachate_2` . As you can see, there is a lot of differential expression in this experiment resulting from both of these parameters. **Part of the reason for this large number of significant differentially expressed genes (DEGs) is that we have not applied a cutoff value for logFC.** For example, a logFC cutoff of 2.0 value would not designate any genes with an absolute logFC value less than 0.05 as a significant DEG. However, deciding on a logFC cutoff is very tricky! Since we used a continuous predictor for `leachate` during model fitting, the values in the y-axis of this plot are slopes representing the rate of change in expression per unit `leachate` (mg/L). Determining what slope represents a significant change in expression requires informed reasoning and a strong body of prior data. For that reason, we are not applying a logFC cutoff in this walkthrough.

Next, we will use glmQLFTest() to test for differential expression attributed to interactions between `hpf` and `leachate` or `hpf` and `leachate_2` before producing mean-difference plots visualizing DEGs derived from these two interactions.

`glmQLFTest()` is an `edgeR` function for a Generalized Linear Model Quasi-Likelihood F-test, which tests for **differential expression** between conditions using **quasi-likelihood F-tests** in a **negative binomial generalized linear model (GLM)** framework.

```{r}
## Test for interactions between time and leachate or leachate^2
nl_leachate_int <-
  glmQLFTest(nl_fit,
  coef = 4,
  contrast = NULL,
  poisson.bound = FALSE) # Estimate significant DEGs
nl_leachate_2_int <-
  glmQLFTest(nl_fit,
  coef = 5,
  contrast = NULL,
  poisson.bound = FALSE) # Estimate significant DEGs
  
# Make contrasts
is.de_nl_leachate_int <-
  decideTestsDGE(nl_leachate_int, adjust.method = "fdr", p.value = 0.05) # Make contrasts
is.de_nl_leachate_2_int <-
  decideTestsDGE(nl_leachate_2_int, adjust.method = "fdr", p.value = 0.05)
```

```{r}
# Summarize differential expression attributed to leachate 
summary(is.de_nl_leachate_int)
```

```{r}
# Summarize differential expression attributed to leachate_2 
summary(is.de_nl_leachate_2_int)
```

```{r}
plotMD(nl_leachate_int)
```

```{r}
plotMD(nl_leachate_2_int)
```

> Taking a detour from tests for DEGs attributed to non-linear interactions, we can now output the residuals of our GLMs in order to better test for whether our data fit the assumptions of negative binomial distribution families. If the distribution of residuals from our GLMs are normal, this indicates that our data meet the assumption of the negative binomial distribution. We will use the equation for estimating Pearson residuals:
>
> $$
> residual=\frac{observed−fitted}{\sqrt{fitted(dispersion∗fitted)}}
> $$

```{r}
# Output observed
y_nl <- nl_fit$counts

# Output fitted
mu_nl <- nl_fit$fitted.values

# Output dispersion or coefficient of variation
phi_nl <- nl_fit$dispersion

# Calculate denominator
v_nl <- mu_nl*(1+phi_nl*mu_nl)

# Calculate Pearson residual
resid.pearson <- (y_nl-mu_nl) / sqrt(v_nl)

# Plot distribution of Pearson residuals
ggplot(data = melt(as.data.frame(resid.pearson)), aes(x = value)) +
  geom_histogram(fill = "grey") +
  xlim(-2.5, 5.0) +
  theme_classic() +
  labs(title = "Distribution of negative binomial GLM residuals",
       x = "Pearson residuals",
       y = "Density")

```

::: callout-note
Our residuals appear to be normally distributed, indicating that our data fit the negative binomial distribution assumed by the GLM.
:::

Now let's get back to analyzing non-linear effects on gene expression!

## Plotting non-linear effects

```{r}
## Bin transcripts based on (i) whether they have a significant positive or negative vertex and then (ii) whether they showed significant interactions between beta1 (vertex value) and time.

# Export diff expression data for transcripts with significant DE associated with leachate^2 parameter
nl_leachate_2_sig <-
  topTags(
  nl_leachate_2,
  n = (2489 + 1922), # the number of up and down degs from is.de_nl_leachate_2
  adjust.method = "BH",
  p.value = 0.05
  )

nl_leachate_2_sig_geneids <- row.names(nl_leachate_2_sig) #Output a list of geneids associated with sig leachate^2 effect

nl_leachate_sig <-
  topTags(
  nl_leachate,
  n = (3352 + 2934), # the number of up and down degs from is.de_nl_leachate
  adjust.method = "BH",
  p.value = 0.05
  )

nl_leachate_sig_geneids <- row.names(nl_leachate_sig) #Output a list of geneids associated with sig leachate effect

# Create tabulated dataframe of mean expression across each leachate level with metadata for transcript ID and timepoint
logCPM_df <- as.data.frame(df_log)

# Create tabularized df containing all replicates using 'melt' function in reshape2
logCPM_df$geneid <- row.names(logCPM_df)

tab_exp_df <- melt(logCPM_df,
                   id = c("geneid"))

# Add covariate information for hpf and leachate
tab_exp_df <- tab_exp_df %>%
  mutate(
    leachate = str_extract(variable, "[CLMHN]"),      # extract treatment letter
    hpf = as.numeric(str_extract(variable, "\\d+$"))  # extract trailing number (e.g. 4, 9, 14)
  )

# Correct leachates to exact values
tab_exp_df <- tab_exp_df %>%
  mutate(
    leachate_code = leachate,  # preserve original letter
    leachate = recode(leachate_code,
      C = 0,
      L = 0.01,
      M = 0.1,
      H = 1
    )
  )

# Create binary variable in df_all_log for significant non-linear expression
tab_exp_df$leachate_2_sig <-
  ifelse(tab_exp_df$geneid %in% nl_leachate_2_sig_geneids, "Yes", "No")
tab_exp_df$leachate_sig <-
  ifelse(tab_exp_df$geneid %in% nl_leachate_sig_geneids, "Yes", "No")

# Create a binary variable related to up or down-regulation
up_genes <- filter(nl_leachate_sig$table, logFC > 0)

tab_exp_df$logFC_dir <-
  ifelse(tab_exp_df$geneid %in% row.names(up_genes), "Up", "Down")

# Add geneid to nl_leachate_int$coefficients
nl_leachate_int$coefficients$geneid <- row.names(nl_leachate_int$coefficients)

# Estimate average logCPM per gene per timepoint
tab_exp_avg <- summarySE(
  measurevar = "value",
  groupvars = c("leachate",     "hpf",       "geneid", 
                "leachate_sig", "leachate_2_sig", "logFC_dir"),
  data = tab_exp_df
  )

# First exploratory plot of non-linear expression grouping by exposure time and direction of differential expression
ggplot(data = filter(tab_exp_avg, leachate_2_sig == "Yes"), 
       aes(x = leachate, y = value)) +
  geom_path(
    alpha = 0.01,
    size = 0.25,
    stat = "identity",
    aes(group = as.factor(geneid))) +
  facet_grid(logFC_dir ~ hpf) +
  theme_classic() +
  theme(strip.background = element_blank()) +
  labs(y = "Avg logCPM", title = "Non-linear changes in GE output by edgeR")
```

> Our plots of non-linear changes in gene expression appear to include many trends that appear... well, linear. This is a pervasive issue in modeling non-linear regressions, and one potential pitfall of using outputs from packages such as edgeR or DESeq2 alone in testing for non-linear effects. The FDR-adjusted p -values we have used determine significance of non-linear effects essentially tell us the probability that a parameter value equal to or greater to what we have fitted could be generated given a random distribution of read counts. The p -value is not a representation of the strength of a non-linear effect relative to a linear effect. Numerous genes that nominally show significant non-linear effects of leachate may be only weakly affected, and a linear effect may in fact be more probable than a non-linear one despite what our p-values tell us. Instead of asking "for what genes may there be significant, non-linear effects of leachate ?", we should ask "for what genes should we test for significant, non-linear effects?".

> One of the best ways to determine whether a non-linear model is appropriate for a transcript is to determine whether it is more probable that its expression is linear or non-linear relative to a continuous predictor. We can calculate this relative probability using a likelihood ratio test (LRT). In the code chunk below, we will fit linear and 2nd order non-linear models to the expression of each gene before applying LRTs to each transcript. We will then further filter our edgeR dataset based on (i) significant differential expression attributed to non-linear effects and (ii) a significant LRT ratio supporting non-linear effects. Then, we will replot the expression levels of this re-filtered set. The code below fits gaussian linear models to log2-transformed CPM values, but can be adjusted to fit negative binomial GLMs to untransformed CPM similar to edgeR and DESeq2 by setting using the MASS package to set 'family = negative_binomial(theta = θ)' where θ = the dispersion estimate or biological coefficient of variation for a given transcript.

```{r}
## Using dlply, fit linear and non-linear models to each gene
# Create leachate^2 variable in df_all_log
tab_exp_df$leachate_2 <- tab_exp_df$leachate^2

# Fit linear models - should take about 4 minutes
lms <- dlply(tab_exp_df, c("geneid"), function(df) 
lm(value ~ leachate + hpf + leachate:hpf, data = df))

# Fit non-linear models - should take about 2 minutes
nlms <- dlply(tab_exp_df, c("geneid"), function(df) 
lm(value ~ leachate + leachate_2 + hpf + leachate:hpf + leachate_2:hpf, data = df))

# Output nlm coefficients into dataframe
nlms_coeff <- ldply(nlms, coef)
head(nlms_coeff)
```

```{r}
## Apply LRTs to lm's and nlm's for each transcript - should take about 2 minutes
lrts <- list() # Create list to add LRT results to

for (i in 1:length(lms)) {
  lrts[[i]] <- lrtest(lms[[i]], nlms[[i]]) # Apply LRTs with for loop
}

## Filter lrt results for transcripts with significantly higher likelihoods of nl model
lrt_dfs <- list()

# Turn list of LRT outputs into list of dataframes containing output info
for (i in 1:length(lrts)) {
  lrt_dfs[[i]] <- data.frame(lrts[i])
}

# Create singular dataframe with geneids and model outputs for chi-squared and LRT p-value
lrt_coeff_df  <- na.omit(bind_rows(lrt_dfs, .id = "column_label")) # na.omit removes first row of each df, which lacks these data

# Add geneid based on element number from original list of LRT outputs
lrt_coeff_df <- merge(lrt_coeff_df,
                      data.frame(geneid = names(nlms),
                      column_label = as.character(seq(length(
                      nlms
                      )))),
                      by = "column_label")
                      
# Apply FDR adjustment to LRT p-values before filtering for sig non-linear effects
lrt_coeff_df$FDR <- p.adjust(lrt_coeff_df$Pr..Chisq., method = "fdr")

# Filter LRT results for sig FDR coeff... produces only 3 genes for pval 0.1!
lrt_filt <- filter(lrt_coeff_df, FDR < 0.1)

## Plot sig nl genes according to LRT, grouped by timepoint and direction of beta 1 coefficient
# Add beta coefficients to logCPM df
leachate_pos <- filter(nlms_coeff, leachate > 0)
leachate_2_pos <- filter(nlms_coeff, leachate_2 > 0)

# Bin genes based on positive or negative leachate and leachate^2 betas
tab_exp_avg$leachate_binom <- ifelse(tab_exp_avg$geneid %in% leachate_pos$geneid, "Positive", "Negative")
tab_exp_avg$leachate_2_binom <- ifelse(tab_exp_avg$geneid %in% leachate_2_pos$geneid, "Concave", "Convex")

# Filter for how many gene id's with significant likelihood of nl effect in LRT
LRT_filt_df <- filter(tab_exp_avg, geneid %in% lrt_filt$geneid)

# Plot
ggplot(data = LRT_filt_df,
       aes(x = leachate, y = value)) +
  geom_path(
    alpha = .25,
    size = 0.25,
    stat = "identity",
    aes(group = as.factor(geneid))
    ) +
  facet_grid(leachate_2_binom ~ hpf) +
  geom_smooth(method = "loess", se = TRUE, span = 1) +
  theme_classic() +
  theme(strip.background = element_blank()) +
  labs(y = "Avg logCPM", title = "Non-linear changes in GE output by LRTs")
```

::: callout-important
3 genes had likelihood of non-linear effect (p \< 0.1) as identified by LRT and DEG
:::

> ... the most robust estimate we have for non-linear effects of environmental conditions on gene expression over time comes from filtering for significant DEGs in edgeR *and* significant LRTs.

## Save non-linear gene list

`LRT_filt_df` is a dataframe gene id's with significant likelihood of nl effect in LRT

```{r}
write_csv(lrt_filt, "../output/09_multifactorial/lrt_filt.csv")
LRT_filt_df
```

# Interaction

> How does gene expression vary between exposure times for genes exhibiting significant interactions between *leachate*2\^2 and time? Let's produce an exploratory plot of non-linear expression across 0.5 and 7 hpfs of exposure for such genes identified using edgeR alone.

```{r}
# Export diff expression data for transcripts with significant DE associated with interaction between leachate^2 and time
nl_leachate_2_int_sig <- topTags(nl_leachate_2_int, n = (3289 + 4317), adjust.method = "BH",p.value = 0.05)
nl_leachate_2_int_sig_geneids <- row.names(nl_leachate_2_int_sig) #Output a list of geneids associated with sig leachate^2 x time interaction

# Filter down df for gene id's exhibit leachate significant effect in edgeR and significant likelihood of nl effect in LRT 
edgeR_interaction_df <- filter(tab_exp_avg, geneid %in% nl_leachate_2_int_sig_geneids )
edgeR_interaction_df$gene_id_hpf <- paste(edgeR_interaction_df$geneid,
                                           edgeR_interaction_df$hpf,
                                           sep = "_")

# Average logCPM across different groups according to leachate^2 estimate and time
edgeR_interaction_avg <- summarySE(measurevar = "value",
                                   groupvars = c("hpf", "leachate", "leachate_2_binom"),
                                                 data = edgeR_interaction_df)

# Plot 
ggplot(data = edgeR_interaction_avg, 
       aes(
         x = leachate,
         y = value,
         color = as.factor(hpf),
         group = as.factor(hpf)
         )) +
         geom_path(stat = "identity") +
  geom_errorbar(aes(ymin = value - se, ymax = value + se), width = 0) +
  geom_point() +
  facet_wrap(~leachate_2_binom) +
  theme_classic() +
  theme(strip.background = element_blank()) +
  labs(y = "logCPM", color = "Time (hours)", title = "Interactions between leachate^2 and exposure time")

```

This plot shows changes in gene expression curves across time. logCPM is the log Counts Per Million and it's a commonly used metric in **RNA-seq data analysis** to represent gene expression levels.

**CPM (Counts Per Million)** = normalization of raw read counts by sequencing depth:

✅ Why logCPM is used:

Normalizes for sequencing depth across samples

Stabilizes variance, especially for high-count genes

Compresses the large dynamic range of gene expression

Works well for linear modeling (e.g., in limma-voom, edgeR, etc.)

> Genes with both concave and convex expression curves across *leachate*2 at 4, 9, and 14 hours of exposure exhibited increases in expression as the experiment progressed

# Interactive effects

> Interactive effects shaping gene expression are common in nature and are becoming increasingly prevalent in models of gene expression derived from experimental studies. Below, we go a little bit deeper on interactive effects by outlining methods for fitting them using categorical and continuous variables in models of expression. We provide examples in edgeR, DESeq2, and Voom and point out how specification for interactive effects differs between these packages. Lastly, we compare correlations between these programs' fold change (logFC) predictions and test statistics. This final comparison of model predictions for interaction parameters is as much a tutorial on interactive effects as it is a tutorial on how to visualize differences in DE packages' results. Please use section of the code for any comparisons between DE packages you may want to make for other effect types, not just interactive effects!

## Interactive effects: edgeR

> edgeR and Voom both take the same syntax for interactive effects, which we define below using the model.matrix() function as 'design_multi \<- model.matrix( \~1 + leachate + leachate:hpf )'. Then, we will use this multifactorial model design in both edgeR and Voom as specified below:

```{r}
str(y1)
```

```{r}
# Fit multifactorial design matrix that includes an interaction term for leachate x hpf
design_multi <-
  model.matrix(~ 1 + leachate + leachate:hpf) #Generate multivariate edgeR glm
  
# Fit quasi-likelihood, neg binom linear regression
multi_fit <-
  glmQLFit(y1, design_multi) # Fit multivariate model to counts
  
# Test for effect of leachate
tr_leachate <-
  glmQLFTest(
  multi_fit,
  coef = 2,
  contrast = NULL,
  poisson.bound = FALSE
  ) # Estimate significant DEGs
  
is.de_tr_leachate <-
  decideTestsDGE(tr_leachate, adjust.method = "fdr", p.value = 0.05) # Make contrasts
  
summary(is.de_tr_leachate)
```

::: callout-note
In edgeR 8458 genes down + 3304 genes up = 11,762 genes total were found to be differentially expressed due to the effect of leachate (p = 0.05)
:::

```{r}
# Test for interaction between leachate and hpf
tr_int <-
  glmQLFTest(multi_fit, coef = 3, poisson.bound = FALSE) # Estimate significant DEGs
  
is.de_int <-
  decideTestsDGE(tr_int, adjust.method = "fdr", p.value = 0.05) # Make contrasts
  
summary(is.de_int)
```

::: callout-note
3294 genes down + 8341 genes up = 11,635 genes total were differentially expressed with an interaction between leachate and hpf
:::

## Interactive effects: limma-Voom

> Below we will fit the same 'design_multi' model (y \~ 1 + *leachate*2 + *leachate*2:hpf) to our read counts using Voom rather than edgeR. As stated earlier, Voom fits interactions between fixed effects using the same syntax as edgeR, so the 'design_multi' object can be used by both programs. \> \> What *is* different between edgeR and Voom is the manner by which Voom accounts for variation between replicates. Rather than using gene-wise or averaged estimations of biological coefficients of variation and inputting these values as θθ in a negative binomial GLM, Voom models what is referred to as the mean-variance relationship and incorporates this relationship within a linear model rather than a GLM by "weighting" the accuracy of a given observation based on its level of expression and modelled variance. This mean-variance relationship can be plotted by Voom after model fitting as we show below:

```{r}
# Perform Voom transformation
Voom <- voom(y, design_multi, plot = T)
```

> When deciding between packages for testing DGE, it can be helpful to compare model assumptions regarding variance by plotting graphs like Voom's mean-variance relationship above and the output of edgeR's 'plotBCV()' function, which we provided an example of earlier and will replot below for the sake of making this visual comparison. Take note that the y-axes of these variance plots in Voom (square root of standard deviation) and edgeR (biological coefficient of variation) are different:

```{r}
# Estimate dispersion coefficients
y1 <- estimateDisp(y, robust = TRUE) # Estimate mean dispersal

# Plot tagwise dispersal and impose w/ mean dispersal and trendline
plotBCV(y1) 

```

> The mean-variance relationships modelled in edgeR and Voom are generally similar, but it appears that Voom assumes even lesser variance in expression among genes at the highest expression level. One result of this may be that greater statistical power is assumed in tests of differential expression by Voom for genes with high expression relative to low expression genes. Keep this in mind as we continue to move forward with fitting parameters for interactions between *`leachate`* and `hpf` in Voom.

```{r}
# Fit using Voom
lm_Voom_fit <- lmFit(Voom, design_multi)

# Create a contrast across continuous leachate variable
cont_leachate <- contrasts.fit(lm_Voom_fit, coef = "leachate")

# Create a contrast across interaction etween continuous leachate and time variables
cont_leachate_hpf <- contrasts.fit(lm_Voom_fit, coef = "leachate:hpf")

# Perform empirical Bayes smoothing of standard errors
cont_leachate <- eBayes(cont_leachate)
cont_leachate_hpf <- eBayes(cont_leachate_hpf)

# Output test statistics
leachate_results <-
  topTable(cont_leachate,
  coef = "leachate",
  adjust.method = "fdr",
  n = Inf)

leachate_hpf_results <-
  topTable(cont_leachate_hpf,
  coef = "leachate:hpf",
  adjust.method = "fdr",
  n = Inf)
```

```{r}
# How many DEGs are associated with leachate
length(which(leachate_results$adj.P.Val < 0.05))
```

```{r}
# How many DEGs are associated with leachate:hpf
length(which(leachate_hpf_results$adj.P.Val < 0.05))
```

::: callout-note
Voom has identified 10,157 genes whose variation is affected by an interaction between leachate and hours post fertilization, compared to 11,635 genes identified by edgeR.
:::

## Interactive effects: DESeq2

DESeq2 requires a slightly different syntax for specifying interactive effects compared to edgeR and Voom.

```{r}
gcounts <- gcm_filt_or

totalCounts <- colSums(gcounts)
# Make a DGEList object for edgeR
y <- DGEList(counts = gcounts, remove.zeros = TRUE)

# Let's remove samples with less then 0.5 cpm (this is ~10 counts in the count file) in fewer then 3/61 samples
keep_g <- rowSums(cpm(gcounts) > .5) >= 3

table(keep_g)
```

```{r}
head(metadata_or)
```

```{r}
### WALD TEST - FULL MODEL ###

dds <- DESeqDataSetFromMatrix(gcounts,
                              colData = metadata_or,
                              design = formula( ~ 1 + leachate + hpf : leachate))

rld <- rlog(dds)
rld.df <- assay(rld)

# Wald test for leachate:hpf
dds_int <- DESeq(dds, minReplicatesForReplace = Inf)

design <- design(dds_int)

DESeq2_int_result_names <- resultsNames(dds_int)

# Count DEGs due to interaction
DESeq2_int_results <- results(dds_int, lfcThreshold = 0, alpha = 0.05)

summary(DESeq2_int_results)
```

> DESeq2 has identified 8287 up and 3340 down = 11627 genes whose variation is affected by an interaction between *`leachate`* and `hpf`, compared to 11635 identified by edgeR and 10157 by Voom.

```{r}
DESeq2_int_result_names
```

```{r}
library(ggplot2)
library(dplyr)

deg_df <- as.data.frame(DESeq2_int_results)


# Filter to remove NA padj values
deg_df <- deg_df %>%
  filter(!is.na(padj)) %>%
  mutate(
    neg_log10_padj = -log10(padj),
    sig = case_when(
      padj < 0.05 & log2FoldChange > 1 ~ "Up",
      padj < 0.05 & log2FoldChange < -1 ~ "Down",
      TRUE ~ "NS"
    )
  )

# Volcano plot
ggplot(deg_df, aes(x = log2FoldChange, y = neg_log10_padj, color = sig)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Up" = "red", "Down" = "blue", "NS" = "gray")) +
  theme_minimal() +
  labs(
    title = "Volcano Plot of DEGs",
    x = "Log2 Fold Change",
    y = "-Log10 Adjusted P-value",
    color = "Significance"
  ) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black")

# Save the last plot that was displayed
ggsave("../output/09_multifactorial/volcano_plot.png", width = 8, height = 6, dpi = 300)

```

```{r}
# Assuming DEG_results is your DESeq2 results table
deg_gene_list <- DESeq2_int_results %>%
  filter(!is.na(padj), padj < 0.05) %>%          # keep only significant DEGs
  select(gene, log2FoldChange, pvalue, padj)     # select desired columns

# Save to CSV
write_csv(deg_gene_list, "../output/09_multifactorial/deg_gene_list.csv")
```

## Comparing test statistics: interactive effects

Differential expression packages can make dramatically different predictions for both the fold-change and probability of differential expression. If you find yourself deciding between different packages or wanting to compare how conservative different approaches are, it helps to run regressions of model predictions by different packages. Below we apply the ggpairs() function from GGAlly (Schloerke et al. 2018) to plot a correlation matrix of logFC values and negative log-transformed p-values attributed to interactive effects between time and *p*CO22 predicted by edgeR, DESeq2, and Voom:

```{r}
# Merge logFC and pval data from each program
Voom_edgeR_deseq_int_comp <- merge( 
  merge(
    data.frame(geneid = row.names(leachate_hpf_results),
               Voom_logFC = leachate_hpf_results$logFC,
               Voom_pval = leachate_hpf_results$P.Value),
    data.frame(geneid = row.names(tr_int$table),
               edgeR_logFC = tr_int$table$logFC,
               edgeR_pval = tr_int$table$PValue), 
    by = "geneid" ),
  data.frame(geneid = row.names(DESeq2_int_results),
             DESeq2_logFC = DESeq2_int_results$log2FoldChange,
             DESeq2_pval = DESeq2_int_results$pvalue),
  by = "geneid")

# Create neg log pvalues
Voom_edgeR_deseq_int_comp$Voom_neglogp <- -log(Voom_edgeR_deseq_int_comp$Voom_pval)
Voom_edgeR_deseq_int_comp$edgeR_neglogp <- -log(Voom_edgeR_deseq_int_comp$edgeR_pval)
Voom_edgeR_deseq_int_comp$DESeq2_neglogp <- -log(Voom_edgeR_deseq_int_comp$DESeq2_pval)

# Create function for adding 1:1 trendline on ggpairs plot and plotting geom_hex instead of geom_point
my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_hex( bins = 100,
            aes(fill = stat(log(count))), alpha = 1 ) +
    scale_fill_viridis_c() +
    geom_abline(
    slope = 1,
    intercept = 0,
    color = "red",
    lty = 2,
    size = 1,
    alpha = 0.5
    )
  p
}

# Correlation matrix of pvalues
pval_pairs <- ggpairs(data = Voom_edgeR_deseq_int_comp,
                      columns = c(8, 9, 10),
                      mapping = aes(alpha = 0.001),
                      lower = list(continuous = my_fn)) +
  labs(title = "Correlation matrix: interaction p-values")

pval_pairs +
  theme_classic(base_rect_size = 0)
```

> The above plot is the output of ggpairs() showing correlations between negative log-transformed p-values attributed to interactive effects of `leachate` and `hpf` estimated between each of the three packages, as well as the distributions of this parameter for each package, and correlation statistics. Red dashed lines represent a slope equal to 1. Yellow colors depict regions of each linear regression with more observations.
>
> DESeq2 and edgeR share the largest correlation between p-values estimated for an interactive effect of *`leachate:hpf`* on differential expression (R22 = 0.827).
>
> Next, let's create a ggpairs() plot contrasting logFC estimates of DE attributed to *`leachate:hpf`* in edgeR, Voom, and DESeq2. One might think that logFCs should demonstrate stronger correlations between programs compared to p-values, but due to the manners by which read counts are normalized and transformed in different packages this is not always the case!

```{r}
# Correlation matrix of logFC's
logFC_pairs <- ggpairs(data = Voom_edgeR_deseq_int_comp,
                       columns = c( 2, 4, 6 ),
                       mapping = aes( alpha = 0.001 ),
                       lower = list(continuous = my_fn)) +
  labs(title = "Correlation matrix: interaction logFC's")

logFC_pairs +
  theme_classic(base_rect_size = 0)
```

> **Once again, the strongest correlation exists between edgeR and Voom.** However, the slope of this correlation is less than 1, indicating that absolute edgeR logFC's are lower on average than absolute Voom logFCs. The correlation pair that shows a slope closest to 1.0 is edgeR vs. DESeq2. edgeR and DESeq2 normalize read counts in a similar manner, and thus a value derived from differences in read counts across continuous predictors such as logFC is likely to be more similar between these packages.

# Save gene lists for functional enrichment

## All genes detected

List of all genes in the dataset (gcm) = 54384 genes which will serve as a reference for functional enrichment.

```{r}
dim(gcm)
```

# Summary & next steps

> The statistical test (like LRT) assesses whether the expression of a gene changes significantly, taking variability and replicates into account.
>
> Fold change (log2FoldChange) is just the magnitude of the change in expression --- it tells you how much the expression changes, but not whether it's statistically reliable.
>
> A gene might have a big fold change, but if the variability is high or replicates are few, the p-value might be not significant.
>
> Conversely, a small fold change could be statistically significant if the change is consistent and variability is low.