---
title: "Functional Enrichment Analysis"
output: html_document
date: "2023-05-04"
---
# Functional Enrichment Overview

Functional enrichment analysis is a valuable tool to interpret the results of differential gene expression analysis by identifying the biological processes, molecular functions, and pathways that are overrepresented among a set of differentially expressed genes. The general steps to do a functional enrichment analysis:

1. Select a functional annotation database: There are several databases available that provide functional annotations for genes, such as Gene Ontology (GO), KEGG, Reactome, and others. Choose the database that best fits your research question and the species you are studying.

2. Determine the background gene set: This set of genes should include all genes expressed in your experiment. You can use a reference genome or the genes detected in your RNA-Seq experiment as the background set.

3. Select the differentially expressed genes. We have already made a DEGlist.tab in the psuedo alignment script. so we can directly read this in.

4. Perform functional enrichment analysis: Using a tool such as clusterProfiler, input the list of differentially expressed genes and the background gene set, and perform enrichment analysis against the chosen functional annotation database. The output will provide a list of significantly enriched GO terms, KEGG pathways, and other functional categories, along with the associated p-values and false discovery rate (FDR) values.

5. Interpret the results: Interpret the enriched functional categories in the context of your research question and the available literature. This will help you gain insight into the underlying biological processes and pathways involved in the differential expression of genes.


```{r setup, include=FALSE}
library(knitr)
library(biomartr)
library(clusterProfiler)
library(tidyverse)
library(enrichplot)
library(biomaRt)  # only use to remove cache bug
library(tidyverse)
library(dplyr)
library(GenomicFeatures)
library(rtracklayer)
library(goseq)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,         # Setting this to false will allow us to knit the code without evaluating code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```


=======

### curling annotation file
To perform functional enrichment analysis using the R packages biomartr and clusterProfiler, you can start with the following code chunks:

This will curl the gene annotation txt file.
```{r, engine='bash'}
cd ../data
curl --insecure -O http://botryllus.stanford.edu/botryllusgenome/download/botznik-gene-ann-2.txt
```

### Reading in the list of differentially expressed genes:
Below, we are going to read in the DEGlist.tab we made in the DEG analysis we previously conducted. Below we will clean up the data frame to make a new column with the gene identifiers. Since we used the "complete mRNA records" of Botryllus schlosseri from NCBI to do our psuedo-alignment, the gene identifier column has accession numbers for mRNA reads from NCBI. We will use this column later to annotate what was differentially expressed.
```{r}

diff_genes <- read.delim(file = "../output/DEGlist.tab", sep =" ")
diff_genes$rna_access_num <- row.names(diff_genes) #adds the gene name into data frame as new column called gene
rownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n instead of the gene name
diff_genes <- diff_genes[,c(7, 1:6)] #puts the gene column in front
```

reading in the gene annotation file, however, this may not be that useful to us since the information we have regarding the ID of the gene is the accession number of the mRNA entry on NCBI in the DEGlist.tab.
```{r}
# Set the path to the gene annotation file
anno_file <- "../data/botznik-gene-ann-2.txt"

# Read in the gene annotation file
anno <- tryCatch(
  read.table(anno_file, header = TRUE, sep = "\t", fill = TRUE, stringsAsFactors = FALSE), #fille = TRUE helps handle the missing values in the annotation file
  error = function(e) {
    message("Error reading the file: ", e$message)
    data.frame()
  }
)

```
`
Code that will potentially take the accession numbers of our diff_genes df and get gene ontology terms for them.
```{r}
# Get the gene names
gene_list <- diff_genes$rna_access_num

# Get the GO terms for the differentially expressed genes
go <- getGO(organism = "Botryllus schlosseri", genes = gene_list)

```

```{r}

# Calculate the enrichment of GO terms
go_enrichment <- enrichGO(gene = names(go), 
                          universe = gene_names, 
                          ont = "BP", 
                          pvalueCutoff = 0.01, 
                          qvalueCutoff = 0.1, 
                          keyType = "ENSEMBL")

# Get the KEGG pathways for the differentially expressed genes
kegg <- getKEGG(gene.names = gene_names,
                org = "bschlosseri",
                output.format = "list")

# Calculate the enrichment of KEGG pathways
kegg_enrichment <- enrichKEGG(gene = names(kegg), 
                              organism = "bschlosseri",
                              pvalueCutoff = 0.01,
                              qvalueCutoff = 0.1)

# Visualize the results
dotplot(go_enrichment)
dotplot(kegg_enrichment)

```


### next steps after merging two tables
3. Create a gene ID conversion table that maps the gene IDs in the differential gene expression table to the gene IDs in the annotation file. This step is necessary because the gene IDs in the differential expression table may be different from those in the annotation file:

```{r}
# Create gene ID conversion table
id_conversion_table <- anno_file[, c("Gene.ID", "Gene.Name")]
colnames(id_conversion_table) <- c("ensembl_gene_id", "gene_symbol")
id_conversion_table <- id_conversion_table[!duplicated(id_conversion_table$ensembl_gene_id), ]
id_conversion_table$ensembl_gene_id <- gsub("\\..*", "", id_conversion_table$ensembl_gene_id)

# Map differential gene expression data to annotation information
diff_genes$ensembl_gene_id <- diff_genes$gene
diff_genes$gene_symbol <- mapIds(id_conversion_table, 
                                  keys = diff_genes$ensembl_gene_id, 
                                  column = "ensembl_gene_id", 
                                  keytype = "ENSEMBL_ID", 
                                  multiVals = "first")
```

4. Filter out genes that do not have a corresponding gene symbol in the annotation file:

```{r}
# Filter out genes without gene symbols
diff_genes <- diff_genes[!is.na(diff_genes$gene_symbol), ]
```

5. Perform functional enrichment analysis using the clusterProfiler package. You can choose a functional annotation database such as Gene Ontology (GO) or KEGG pathways, and specify the significance level threshold for selecting enriched terms:

```{r}
# Perform functional enrichment analysis using GO database
ego <- enrichGO(gene         = diff_genes$gene_symbol,
                universe    = unique(anno_file$Gene.Name),
                ont         = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.1)
                
# Print the top enriched terms
head(ego)
```

This code should help you get started with performing functional enrichment analysis using R packages biomartr and clusterProfiler. You can modify the code as necessary based on your specific research question and data.