---
title: "Functional Enrichment Analysis"
output: md_document
date: "2023-05-04"
---

# Functional Enrichment Overview
Currently, this portion of the workflow is a work in progress as of 05/09/2023.

Functional enrichment analysis is a valuable tool to interpret the results of differential gene expression analysis by identifying the biological processes, molecular functions, and pathways that are overrepresented among a set of differentially expressed genes. The general steps to do a functional enrichment analysis:

1. Select a functional annotation database: There are several databases available that provide functional annotations for genes, such as Gene Ontology (GO), KEGG, Reactome, and others. Choose the database that best fits your research question and the species you are studying.

2. Determine the background gene set: This set of genes should include all genes expressed in your experiment. You can use a reference genome or the genes detected in your RNA-Seq experiment as the background set.

3. Select the differentially expressed genes. We have already made a DEGlist.tab in the psuedo alignment script. so we can directly read this in.

4. Perform functional enrichment analysis: Using a tool such as clusterProfiler, input the list of differentially expressed genes and the background gene set, and perform enrichment analysis against the chosen functional annotation database. The output will provide a list of significantly enriched GO terms, KEGG pathways, and other functional categories, along with the associated p-values and false discovery rate (FDR) values.

5. Interpret the results: Interpret the enriched functional categories in the context of your research question and the available literature. This will help you gain insight into the underlying biological processes and pathways involved in the differential expression of genes.


```{r setup, include=FALSE}
library(knitr)
library(biomartr)
library(clusterProfiler)
library(tidyverse)
library(enrichplot)
library(biomaRt)  # only use to remove cache bug
library(tidyverse)
library(dplyr)
library(goseq)
library(DESeq2)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = TRUE,         # Setting this to false will allow us to knit the code without evaluating code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```


=======
# Do DESeq2
We will use the R package DESeq2 to help us digest the gene abundance count matrix we generated. 

Below, I am reading in and modifying the count matrix we generated from Kallisto.

We need to create objects in our global environment that can be used towards completing a functional enrichment analysis.

Below we read in the counts matrix, rename the row names from 1-n to the value of the first column, remove the first column, and round the values in the count matrix.
```{r read in and wrangle count matrix, eval = TRUE}
countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t') %>%
  {rownames(.) <- .$X; .} %>%
  {.[,-1]} %>%
  {round(., 0)}
```

We will then make a new data frame that will contain information about the treatment groups and the type of RNAseq data produced. This will an object called "deseq2.colData". We then use the DESeqDataSetFromMatrix function from the DESeq2 package to make a dataset that we can use in our DEseq function to do the analysis.
```{r deseq2 conditions, eval=TRUE}
deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), 
                             type=factor(rep("paired-end", 4)))
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)
```

Next we will conduct a DESeq analysis using the DESeq function:
```{r deseq analysis, eval=TRUE}
deseq2.dds <- DESeq(deseq2.dds) 
deseq2.res <- results(deseq2.dds) #this places the results in an object that we can open
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] #reorders the rows
head(deseq2.res) #take a peak
```

### Reading in whole annotated transcriptome
```{r, eval=TRUE, cache=TRUE}
blast_go <- read.delim(file = "../output/blast_annot_go.tab")
colnames(blast_go)[1] <- 'access_num' #renaming the first column 

# Merge together all results from DESeq analysis with blast_go data frame
res_df <- as.data.frame(deseq2.res) %>%
  mutate(access_num = rownames(.)) %>%
  distinct()

#clean up the blast annotated table to keep only unique rows
blast_go_small <- distinct(blast_go, access_num, .keep_all = TRUE)

transcriptome_anno <- merge(blast_go_small, res_df, by = "access_num", all = TRUE)
```

### Reading in the list of differentially expressed genes:
We will read in the annotated DEG list that was created in the 04-blast code script.
```{r}
diff_genes <- read.delim(file = "../output/annote_DEGlist.tab", sep =" ")

#peak at annotated DEG list
head(diff_genes)
```


## define gene list and background
```{r}
gene_list <- diff_genes$access_num
background <-transcriptome_anno$access_num
go_ids <- diff_genes$Gene.Ontology.IDs
go_bp_annotations <- diff_genes$Gene.Ontology..biological.process
deg_data <- data.frame(gene = gene_list, GO_ID = go_ids, GO_BP = go_bp_annotations)

#peak at deg_data
head(deg_data)

```
### Brainstorming, code may or may not work below, beware... 


```{r}
enrich_result <- enricher(gene = gene_list,
                          pAdjustMethod = "BH",
                          universe = background,
                          TERM2GENE = deg_data,
                          pvalueCutoff = 0.05,
                          qvalueCutoff = 0.05)

```