---
title: "Gene Enrichment with GO-MWU"
author: "Yaamini Venkataraman"
date: "7/11/2019"
output: github_document
---

In this R Markdown file, we format files with a gene enrichment with [GO-MWU](https://github.com/z0on/GO_MWU). This is a rank-based gene enrichment method using a Mann-Whitney U test that works well with non-model organisms. There are separate annotations for *Zostera marina* and *Labyrinthula zosterae*, so these files will be formatted separately.

# Set up R Markdown document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
sessionInfo() #Obtain session information.
```

#Format GO-MWU input

## GO annotations table

The GO annotation tables need to be tab-delimited with 2 columns: gene ID and GOterms separated by semicolons.

### *Z. marina*

```{r}
allZosteraAnnotations <- read.delim("../../data/Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab", header = FALSE) #Import file with gene counts for annotated genes
colnames(allZosteraAnnotations) <- c("GeneID", "Accession", "Isoform", "E-value", "ProteinN", "GO_BP", "GO_CC", "GO_MF", "GO", "Status", "Organism", "S_10B", "S_9A", "S_13A", "S_42A", "S_46B", "S_47B", "S_48B", "S_2A", "S_2B", "S_7B", "S_8B", "S_33A", "S_36B", "S_38A", "S_40A") #Rename columns
head(allZosteraAnnotations) #Confirm changes
```

```{r}
zosteraGOAnnotationsTable <- data.frame(allZosteraAnnotations$GeneID,
                                        allZosteraAnnotations$GO) #Save gene ID and all gene ontology terms in a new dataframe
head(zosteraGOAnnotationsTable) #Confirm format is good for GO-MWU. It is.
```

```{r}
write.table(zosteraGOAnnotationsTable, "2019-07-11-Zostera-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file
```

```{bash}
head 2019-07-11-Zostera-GO-Annotations-Table.tab #Confirm formatting is good.
```

### *L. zosterae*

```{r}
allnonZosteraAnnotations <- read.delim("../../data/nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab", header = FALSE) #Import file with gene counts for annotated genes
colnames(allnonZosteraAnnotations) <- c("GeneID", "Accession", "Isoform", "E-value", "ProteinN", "GO_BP", "GO_CC", "GO_MF", "GO", "Status", "Organism", "S_10B", "S_9A", "S_13A", "S_42A", "S_46B", "S_47B", "S_48B", "S_2A", "S_2B", "S_7B", "S_8B", "S_33A", "S_36B", "S_38A", "S_40A") #Rename columns
head(allnonZosteraAnnotations) #Confirm changes
```

```{r}
nonZosteraGOAnnotationsTable <- data.frame(allnonZosteraAnnotations$GeneID,
                                           allnonZosteraAnnotations$GO) #Save gene ID and all gene ontology terms in a new dataframe
head(nonZosteraGOAnnotationsTable) #Confirm format is good for GO-MWU. It is.
```

```{r}
write.table(nonZosteraGOAnnotationsTable, "2019-07-11-nonZostera-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file
```

```{bash}
head 2019-07-11-nonZostera-GO-Annotations-Table.tab
```

## Table of significance measures

The table of significance measures needs to be a CSV file with two columns: gene ID and measure of continuous significance. All genes need to be included, not just those deemed significantly different by `edgeR`. Similar to the GO-MWU README, I used signed negative log p-values modified from `edgeR` output.

### *Z. marina*

#### Subset table

```{r}
zosteraDEG <- read.delim("../EdgeR/EXP.CON.Zostera.ALL.txt", header = TRUE) #Import all genes used in edgeR output
head(zosteraDEG) #Confirm import
```

```{r}
zosteraSigMeasures <- data.frame("gene" = row.names(zosteraDEG),
                                 "logFC" = zosteraDEG$logFC,
                                 "logP" = log(zosteraDEG$PValue)) #Subset row names as gene ID and log fold change. Also include log(p-value)
head(zosteraSigMeasures) #Confirm import
```

#### Correct p-values for upregulated genes

```{r}
zosteraSigMeasuresUpregulated <- subset(x = zosteraSigMeasures, subset = zosteraSigMeasures$logFC > 0) #Subset differentially expressed genes that are upregulated by pulling out genes with a positive fold change
tail(zosteraSigMeasuresUpregulated) #Confirm subset
```

```{r}
range(zosteraSigMeasuresUpregulated$logP) #Look at the range of p-values for upregulated genes. All values are negative. For signed negative log p-values, the sign of the log p-value should correspond with upregulation for downregulation. All upregulated genes should have positive p-values.
zosteraSigMeasuresUpregulated$signLogP <- -1 * zosteraSigMeasuresUpregulated$logP #Multiply all p-values by -1 so they become positive
head(zosteraSigMeasuresUpregulated) #Confirm changes
```

#### Correct p-values for downregulated genes

```{r}
zosteraSigMeasuresDownregulated <- subset(x = zosteraSigMeasures, subset = zosteraSigMeasures$logFC < 0) #Subset differentially expressed genes that are downregulated by pulling out genes with a negative fold change
head(zosteraSigMeasuresDownregulated) #Confirm subset
```

```{r}
range(zosteraSigMeasuresDownregulated$logP) #All downregulated genes should have negative p-values. These p-values are already negative, so no more modification needs to happen.
zosteraSigMeasuresDownregulated$signLogP <- zosteraSigMeasuresDownregulated$logP #Copy entries to a new column to be consistent with formatting for table with upregulated genes
```

#### Combine tables with corrected p-values

```{r}
zosteraSigMeasuresCorrected <- rbind(zosteraSigMeasuresUpregulated, zosteraSigMeasuresDownregulated) #Combine tables by columns
zosteraSigMeasuresCorrected <- zosteraSigMeasuresCorrected[,-c(2:3)] #Remove unnecessary columns
head(zosteraSigMeasuresCorrected) #Confirm changes
```

```{r}
write.csv(zosteraSigMeasuresCorrected, "2019-07-11-Zostera-Table-of-Significance-Measures.csv", quote = FALSE, row.names = FALSE) #Save file
```

```{bash}
head 2019-07-11-Zostera-Table-of-Significance-Measures.csv #Check that row name column was not included in output
```

### *L. zosterae*

#### Subset table

```{r}
nonzosteraDEG <- read.delim("../EdgeR/EXP.CON.NonZostera.all.txt", header = TRUE) #Import all genes used in edgeR output
head(nonzosteraDEG) #Confirm import
```

```{r}
nonzosteraSigMeasures <- data.frame("gene" = row.names(nonzosteraDEG),
                                    "logFC" = nonzosteraDEG$logFC,
                                    "logP" = log(nonzosteraDEG$PValue)) #Subset row names as gene ID and log fold change. Also include log(p-value)
head(nonzosteraSigMeasures) #Confirm changes
```

#### Correct p-values for upregulated genes

```{r}
nonzosteraSigMeasuresUpregulated <- subset(x = nonzosteraSigMeasures, subset = nonzosteraSigMeasures$logFC > 0) #Subset differentially expressed genes that are upregulated by pulling out genes with a positive fold change
tail(nonzosteraSigMeasuresUpregulated) #Confirm subset
```

```{r}
range(nonzosteraSigMeasuresUpregulated$logP) #Look at the range of p-values for upregulated genes. All values are negative. For signed negative log p-values, the sign of the log p-value should correspond with upregulation for downregulation. All upregulated genes should have positive p-values.
nonzosteraSigMeasuresUpregulated$signLogP <- -1 * nonzosteraSigMeasuresUpregulated$logP #Multiply all p-values by -1 so they become positive
head(nonzosteraSigMeasuresUpregulated) #Confirm changes
```

#### Correct p-values for downregulated genes

```{r}
nonzosteraSigMeasuresDownregulated <- subset(x = nonzosteraSigMeasures, subset = nonzosteraSigMeasures$logFC < 0) #Subset differentially expressed genes that are downregulated by pulling out genes with a negative fold change
head(nonzosteraSigMeasuresDownregulated) #Confirm subset
```

```{r}
range(nonzosteraSigMeasuresDownregulated$logP) #All downregulated genes should have negative p-values. These p-values are already negative, so no more modification needs to happen.
nonzosteraSigMeasuresDownregulated$signLogP <- nonzosteraSigMeasuresDownregulated$logP #Copy entries to a new column to be consistent with formatting for table with upregulated genes
```

#### Combine tables with corrected p-values

```{r}
nonzosteraSigMeasuresCorrected <- rbind(nonzosteraSigMeasuresUpregulated, nonzosteraSigMeasuresDownregulated) #Combine tables by columns
nonzosteraSigMeasuresCorrected <- nonzosteraSigMeasuresCorrected[,-c(2:3)] #Remove unnecessary columns
head(nonzosteraSigMeasuresCorrected) #Confirm changes
```

```{r}
write.csv(nonzosteraSigMeasuresCorrected, "2019-07-11-nonZostera-Table-of-Significance-Measures.csv", quote = FALSE, row.names = FALSE) #Save file
```

```{bash}
head 2019-07-11-nonZostera-Table-of-Significance-Measures.csv #Confirm row names are not included in the file
```