---
title: "Gene Enrichment with GO-MWU"
author: "Yaamini Venkataraman"
date: "7/30/2019"
output: html_document
---

In this R Markdown document, I will format files for gene enrichment with [GO-MWU](https://github.com/z0on/GO_MWU). GO-MWU needs two input files. The first is a gene list, with all genes used in differential methylation analysis with all associated GOterms. The second is a list of genes with a continuous measure of significance.

# Set up R Markdown file

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Install packages

```{r}
#install.packages("dichromat")
require(dichromat)
```

# Obtain session information

```{r}
sessionInfo() #Obtain session information
```

# Create master list

## Import gene background

```{r}
geneBackground <- read.csv("2019-06-20-mRNA-Gene-Background-Uniprot.csv") #Import gene background with Uniprot annotations
geneBackground <- geneBackground[,-1] #Remove extra column
head(geneBackground) #Confirm import
```

## Import GOterms

```{r}
uniprotGOTerms <- read.delim("uniprot-reviewed_yes.tab", header = TRUE) #Import tab-delimited file with header
uniprotGOTerms <- uniprotGOTerms[,c(1,8)] #Retain Entry and GOterm columns
colnames(uniprotGOTerms) <- c("Uniprot", "GO") #Rename columns
head(uniprotGOTerms) #Confirm import
```

## Merge gene background with GOterms

```{r}
geneBackgroundGOterms <- merge(x = geneBackground, y = uniprotGOTerms, by = "Uniprot") #Merge files by Uniprot ID
head(geneBackgroundGOterms) #Confirm merge
```

```{r}
geneBackgroundGOterms$DMLB.start <- geneBackgroundGOterms$DMLB.start + 1 #Add 1 to each start position. Normally I would subtract 1 from each end position (dinucleotide), but the geneBackground file doesn't seem to include that.
geneBackgroundGOterms <- geneBackgroundGOterms[,c(1:4, 9, 20, 22:23)] #Remove extra columns
head(geneBackgroundGOterms) #Confirm changes
```

## Merge gene background with all loci tested

```{r}
allTested <- read.csv("../2018-10-25-MethylKit/2018-10-25-Loci-Analysis/2019-06-30-All-Loci-Methylation-Statistic-Filtered-Destrand-50-Cov5.csv", header = TRUE) #Import all loci tested
allTested <- allTested[,-1] #Remove extra column
colnames(allTested) <- c("chr", "DMLB.start", "DMLB.end", "strand", "pvalue", "qvalue", "meth.diff") #Rename columns to match previous naming conventions
head(allTested) #Confirm import
```

```{r}
allTestedGOterms <- merge(x = allTested, y = geneBackgroundGOterms, by = "DMLB.start") #Merge all loci tested with gene background information
head(allTestedGOterms) #Confirm merge
```

```{r}
allTestedGOterms <- allTestedGOterms[,c(2, 1, 3:8, 11:14)] #Reorganize columns
colnames(allTestedGOterms) <- c("chr", "DMLB.start", "DMLB.end", "strand", "pvalue", "qvalue", "meth.diff", "Uniprot", "Genbank", "evalue", "Product", "GO") #Rename columns
head(allTestedGOterms) #Confirm changes
```

## Remove redundant lines

```{r}
attach(allTestedGOterms) #Attach dataframe
allTestedGOterms <- allTestedGOterms[order(Genbank, pvalue),] #Sort by gene ID, then p-value
detach(allTestedGOterms) #Detatch dataframe
head(allTestedGOterms) #Confirm changes
```

```{r}
write.csv(allTestedGOterms, "2019-07-30-All-Tested-Loci-Uniprot-GOTerms.csv", quote = FALSE, row.names = FALSE) #Save ordered file
```

```{bash}
#Sort a file by the Genbank column (--key = 9,9) and only retain unique lines (--unique). Save output to a new file
sort --unique --key=9,9 2019-07-30-All-Tested-Loci-Uniprot-GOTerms.csv \
> 2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv
```

```{bash}
head -n4 2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv
```

# Create GO-MWU inputs

```{r}
allTestedGOtermsCondensed <- read.csv("2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv", header = TRUE) #Import file with one line per gene
head(allTestedGOtermsCondensed, 2) #Confirm import
```

## GO annotations table

The first table needs to be a tab-delimited file with one column for genes and one column for GOterms.

```{r}
goAnnotations <- data.frame(allTestedGOtermsCondensed$Genbank,
                            allTestedGOtermsCondensed$GO) #Create a new dataframe with Genbank and GO columns
head(goAnnotations) #Confirm format is good. It is.
```

```{r}
write.table(goAnnotations, "2019-07-30-allTested-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file
```

```{bash}
head 2019-07-30-allTested-GO-Annotations-Table.tab #Confirm formatting was retained
```

## Table of significance measures

This table is a .csv file with one column for gene IDs and another for a continuous significance measure. All genes need to be included, not just those with significant DML. Similar to the GO-MWU example, I will use signed negative log p-values modified from `methylKit` output. Hypomethylated loci will have negative p-values, and hypermethylated loci will have positive p-values.

### Subset table

```{r}
sigMeasures <- data.frame("gene" = allTestedGOtermsCondensed$Genbank,
                          "logP" = log(allTestedGOtermsCondensed$pvalue),
                          "meth.diff" = allTestedGOtermsCondensed$meth.diff) #Subset Genbank ID, log(p-value), and meth.diff
head(sigMeasures) #Confirm subset
```

```{r}
sigMeasuresUncorrected <- sigMeasures[,c(1:2)] #Keep first two columns to see if GO-MWU provides different results with uncorrected p-values
head(sigMeasuresUncorrected) #Confirm subset
```

```{r}
write.csv(sigMeasuresUncorrected, "2019-07-30-allTested-Table-of-Significance-Measures-Uncorrected.csv", quote = FALSE, row.names = FALSE) #Save file
```

### Correct p-values for hypermethylated loci

```{r}
sigMeasuresHyper <- subset(x = sigMeasures, subset = sigMeasures$meth.diff >= 0) #Subset hypermethylated loci. Differences of 0 will need positive p-values as well.
range(sigMeasuresHyper$meth.diff) #Confirm subset
```

```{r}
range(sigMeasuresHyper$logP) #Look at range of values for hypermethylated loci. All values are negative, so they need to be made positive.
```

```{r}
sigMeasuresHyper$signLogP <- -1 * sigMeasuresHyper$logP #Multiply values by -1
head(sigMeasuresHyper) #Confirm changes
```

### Correct p-values for hypomethylated loci

```{r}
sigMeasuresHypo <- subset(x = sigMeasures, subset = sigMeasures$meth.diff < 0) #Subset hypomethylated loci
range(sigMeasuresHypo$meth.diff) #Confirm subset
```

```{r}
range(sigMeasuresHypo$logP) #Look at range of p-values. All of these are already negative, so there's no need to modify it
```

```{r}
sigMeasuresHypo$signLogP <- sigMeasuresHypo$logP #Copy logP to a new columns for consistency
head(sigMeasuresHypo) #Confirm changes
```

### Combine tables with corrected p-values

```{r}
sigMeasuresCorrected <- rbind(sigMeasuresHyper, sigMeasuresHypo) #Combine dataframes
sigMeasuresCorrected <- sigMeasuresCorrected[,-c(2:3)] #Remove unnecessary columns
head(sigMeasuresCorrected) #Confirm changes
```

```{r}
write.csv(sigMeasuresCorrected, "2019-07-30-allTested-Table-of-Significance-Measures.csv", quote = FALSE, row.names = FALSE) #Save file
```

```{bash}
head 2019-07-30-allTested-Table-of-Significance-Measures.csv #Confirm formatting is good. It is.
```

# Gene enrichment with GO-MWU

These commands were copied from the GO-MWU.R file in the GO-MWU Github repository.

## Signed log p-values

### Biological processes

#### Edit input variables

```{r}
input="2019-07-30-allTested-Table-of-Significance-Measures.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="2019-07-30-allTested-GO-Annotations-Table.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="BP" # BP for biological processes
source("gomwu.functions.R")
```

#### Run GO-MWU

```{r}
# Calculating stats. It might take ~3 min for MF and BP. Do not rerun it if you just want to replot the data with different cutoffs, go straight to gomwuPlot. If you change any of the numeric values below, delete the files that were generated in previos runs first.
gomwuStats(input, goDatabase, goAnnotations, goDivision,
	perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
	largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
	smallest=5,   # a GO category should contain at least this many genes to be considered
	clusterCutHeight=0.25) # threshold for merging similar (gene-sharing) terms. See README for details.
# There are no GO terms pass 10% FDR.
```

#### Describe parent GOslim terms

Even though there are no significantly enriched terms, I want to describe the parent GOslim categories GO-MWU used to sort my data.

##### All tested loci

```{r}
parentGOBP <- read.delim("BP_2019-07-30-allTested-Table-of-Significance-Measures.csv", header = TRUE) #Import GO-MWU output file with individual genes and associated GOslim term. Even though it is a .csv extension, it is a tab-delimited file
colnames(parentGOBP) <- c("name", "term", "lev", "Genbank", "value") #Rename gene ID column to match previous naming conventions
head(parentGOBP) #Confirm import
```

```{r}
parentGOBPCounts <- as.data.frame(table(parentGOBP$name)) #Count the frequency of each GOslim term
colnames(parentGOBPCounts) <- c("parentGOslim", "frequency") #Rename columns
head(parentGOBPCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPCounts, "2019-07-30-allTested-Parent-GOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Map to GOslim terms

```{r}
CVGOslim <- read.delim("../2019-08-14-Differentially-Methylated-Genes/Blastquery-GOslim-BP.sorted.unique.noOther", sep = "\t", header = FALSE) #Import GOslim information
colnames(CVGOslim) <- c("Genbank", "GOslim") #Add column names
head(CVGOslim) #Confirm import
```

```{r}
parentGOBPGOslim <- unique(merge(x = parentGOBP, y = CVGOslim, by = "Genbank", all.x = TRUE)) #Merge GO-MWU with GOslim information, retaining all GO-MWU entries. Remove any merging artifacts
head(parentGOBPGOslim) #Confirm merge
length(parentGOBPGOslim$GOslim) #Count number of entries
```

```{r}
sum(is.na(parentGOBPGOslim$GOslim)) #Count how many entries do not have GOslim terms. 27/2281 GO-MWU entries do not have GOslim entries.
```

```{r}
parentGOBPGOslimCounts <- as.data.frame(table(parentGOBPGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOBPGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
head(parentGOBPGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPGOslimCounts, "2019-07-30-allTested-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOBPGOslimCounts$percentage <- (parentGOBPGOslimCounts$frequency/sum(parentGOBPGOslimCounts$frequency)*100) #Create a new column with percentages of GOSlim terms instead of raw counts
head(parentGOBPGOslimCounts) #Confirm column creation
max(parentGOBPGOslimCounts$percentage)
```

```{r}
parentGOBPGOslimCounts <- parentGOBPGOslimCounts[c(4, 1, 5, 11, 13,
                                                         2, 6, 3,
                                                         12,
                                                         7, 10, 9, 8),] #Reorganize rows based on broader functions. Rows are general activity, development, stress response, and metabolic processes
head(parentGOBPGOslimCounts) #Confirm organization
```

```{r}
parentGOBPGOslimCounts$GOSlim <- c("Cell-Cell Signaling", "Cell Adhesion", "Death", "Signal Transduction", "Transport",
                                   "Cell Cycle and Proliferation", "Developmental Processes", "Cell Organization and Biogenesis",
                                   "Stress Response",
                                   "DNA Metabolism", "RNA Metabolism", "Protein Metabolism", "Other Metabolic Processes") #Capitalize labels
head(parentGOBPGOslimCounts) #Confirm capitalization
```

####### Create figure

```{r}
RColorBrewer::display.brewer.all() #Show all RColorBrewer palettes. I will choose greens.
plotColors <- rev(RColorBrewer::brewer.pal(5, "GnBu")) #Create a color palette for the barplots. Use 5 green-blue shades from RColorBrewer. Reverse theorder so the darkest shade is used first.
barplot(parentGOBPGOslimCounts$percentage,
        col = plotColors) #See what plot looks like with new scheme
barplot(parentGOBPGOslimCounts$percentage,
        col = dichromat(plotColors)) #Check that the plot colors will be easy to interpret for those with color blindess
```

```{r}
#pdf("2019-11-19-BP-GOSlim-allTested-Gene-Overlaps.pdf", width = 11, height = 8.5) #Save as a pdf

par(mar = c(5, 20, 1, 1)) #Change figure boundaries

barsGOSlim <- barplot((parentGOBPGOslimCounts$percentage),
                      horiz = TRUE,
                      axes = FALSE,
                      xlim = c(0,18),
                      col = c(rep(plotColors[2], times = 5),
                              rep(plotColors[4], times = 3),
                              rep(plotColors[5], times = 1),
                              rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object.

axis(side = 2, at = barsGOSlim, labels = parentGOBPGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las)

axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis
mtext(side = 1, "Percent of Genes", line = 3, cex = 1.5) #Add x-axis label

par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot
plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot
legend("bottomleft",
       xpd = TRUE,
       legend = c("General Activity", "Development", "Stress Response", "Metabolic Processes"),
       pch = 22, 
        col = "black", 
        pt.bg = c(plotColors[2], plotColors[4], plotColors[5], plotColors[1]),
       bty = "n") #Place a legend in the bottom left of the figure with no box

#dev.off()
```

##### Differentially methylated loci

GO-MWU takes all loci into account, but there may be some differences in parental GOslim terms if I only look at differentially methylated loci. I will import the DML annotations I made previously and work from there.

```{r}
DMLGeneAnnotation <- read.csv("2019-06-20-DML-Gene-Annotation.csv", header = TRUE, row.names = 1) #Import file. The first column are row names.
head(DMLGeneAnnotation) #Confirm import
```

```{r}
DMLGenbank <- data.frame("Genbank" = DMLGeneAnnotation$Genbank,
                         "pvalue" = DMLGeneAnnotation$pvalue,
                         "meth.diff" = DMLGeneAnnotation$meth.diff) #Create a new dataframe with Genbank IDs and pvalues for each DML. Include meth.diff as well
head(DMLGenbank) #Confirm dataframe creation
```

```{r}
parentGOBPDML <- unique(merge(x = DMLGenbank, y = parentGOBP, by = "Genbank", all.x = TRUE)) #Merge parent GOslim terms and DML information by Genbank ID. Keep only unique entries. This will allow me to retain multiple DML per gene, but remove any merging artifacts. Keep rows that don't have matching GOterms from GO-MWU just in case.
head(parentGOBPDML) #Confirm merge. There are several entries without any matching information from GO-MWU.
```

```{r}
sum(is.na(parentGOBPDML$name)) #Count how many genes with DML do not have GO-MWU entries.
```

```{r}
parentGOBPDMLCounts <- as.data.frame(table(parentGOBPDML$name)) #Count the frequency of each GOslim term
colnames(parentGOBPDMLCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOBPDMLCounts <- subset(parentGOBPDMLCounts, subset = frequency > 0) #Remove residual factors
head(parentGOBPDMLCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPDMLCounts, "2019-07-30-condensedDML-Parent-GOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Map to GOslim terms

```{r}
parentGOBPDMLGOslim <- unique(merge(x = parentGOBPDML, y = CVGOslim, by = "Genbank", all.x = TRUE)) #Merge GO-MWU with GOslim information, retaining all GO-MWU entries. Remove any merging artifacts
head(parentGOBPDMLGOslim) #Confirm merge
length(parentGOBPDMLGOslim$GOslim) #Count number of entries
```

```{r}
sum(is.na(parentGOBPDMLGOslim$GOslim)) #Count how many entries do not have GOslim terms. 425/5283 GO-MWU entries do not have GOslim entries.
```

```{r}
parentGOBPDMLGOslimCounts <- as.data.frame(table(parentGOBPDMLGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOBPDMLGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
head(parentGOBPDMLGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPDMLGOslimCounts, "2019-07-30-condensedDML-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOBPDMLGOslimCounts$percentage <- (parentGOBPDMLGOslimCounts$frequency/sum(parentGOBPDMLGOslimCounts$frequency)*100) #Create a new column with percentages of GOSlim terms instead of raw counts
head(parentGOBPDMLGOslimCounts) #Confirm column creation
max(parentGOBPDMLGOslimCounts$percentage)
```

```{r}
parentGOBPDMLGOslimCounts <- parentGOBPDMLGOslimCounts[c(4, 1, 5, 11, 13,
                                                         2, 6, 3,
                                                         12,
                                                         7, 10, 9, 8),] #Reorganize rows based on broader functions. Rows are general activity, development, stress response, and metabolic processes
head(parentGOBPDMLGOslimCounts) #Confirm organization
```

```{r}
parentGOBPDMLGOslimCounts$GOSlim <- c("Cell-Cell Signaling", "Cell Adhesion", "Death", "Signal Transduction", "Transport",
                                      "Cell Cycle and Proliferation", "Developmental Processes", "Cell Organization and Biogenesis",
                                      "Stress Response",
                                      "DNA Metabolism", "RNA Metabolism", "Protein Metabolism", "Other Metabolic Processes") #Capitalize labels
head(parentGOBPDMLGOslimCounts) #Confirm capitalization
```

####### Create figure

```{r}
#pdf("2019-11-19-BP-GOSlim-DML-Gene-Overlaps.pdf", width = 11, height = 8.5) #Save as a pdf

par(mar = c(5, 20, 1, 1)) #Change figure boundaries

barsGOSlimDML <- barplot(parentGOBPDMLGOslimCounts$percentage,
                      horiz = TRUE,
                      axes = FALSE,
                      xlim = c(0,18),
                      col = c(rep(plotColors[2], times = 5),
                              rep(plotColors[4], times = 3),
                              rep(plotColors[5], times = 1),
                              rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object.

axis(side = 2, at = barsGOSlimDML, labels = parentGOBPDMLGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las)

axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis
mtext(side = 1, "Percent of Genes with DML", line = 3, cex = 1.5) #Add x-axis label

par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot
plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot
legend("bottomleft",
       xpd = TRUE,
       legend = c("General Activity", "Development", "Stress Response", "Metabolic Processes"),
       pch = 22, 
        col = "black", 
        pt.bg = c(plotColors[2], plotColors[4], plotColors[5], plotColors[1]),
       bty = "n") #Place a legend in the bottom left of the figure with no box

#dev.off()
```

###### Hypermethylated loci

The division between hypermethylated and hypomethylated biological processes may be interesting too.

```{r}
condensedDMLHyper <- subset(DMLGenbank, subset = meth.diff > 0) #Subset hypermethylated loci
range(condensedDMLHyper$meth.diff) #Confirm subset
```

```{r}
parentGOBPCondensedHyper <- unique(merge(x = condensedDMLHyper, y = parentGOBP, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOBP
head(parentGOBPCondensedHyper) #Confirm merge
```

```{r}
sum(is.na(parentGOBPCondensedHyper$name))
```

```{r}
parentGOBPCondensedHyperCounts <- as.data.frame(table(parentGOBPCondensedHyper$name)) #Count the frequency of each GOslim term
colnames(parentGOBPCondensedHyperCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOBPCondensedHyperCounts <- subset(parentGOBPCondensedHyperCounts, subset = frequency > 0) #Remove residual factors
head(parentGOBPCondensedHyperCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPCondensedHyperCounts, "2019-07-30-condensedDMLHyper-Parent-GOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOBPCondensedHyperGOslim <- subset(x = parentGOBPDMLGOslim, subset = parentGOBPDMLGOslim$meth.diff > 0) #Subset hypermethylated loci
head(parentGOBPCondensedHyperGOslim) #Confirm subset
```

```{r}
sum(is.na(parentGOBPCondensedHyperGOslim$GOslim)) #Count how many entries do not have GOslim terms. 164/1016 do not have GO-MWU entries.
```

```{r}
parentGOBPCondensedHyperGOslimCounts <- as.data.frame(table(parentGOBPCondensedHyperGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOBPCondensedHyperGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
parentGOBPCondensedHyperGOslimCounts <- subset(parentGOBPCondensedHyperGOslimCounts, subset = frequency > 0) #Remove residual factors
head(parentGOBPCondensedHyperGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPCondensedHyperGOslimCounts, "2019-07-30-condensedDMLHyper-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Hypomethylated loci

```{r}
condensedDMLHypo <- subset(DMLGenbank, subset = meth.diff < 0) #Subset hypermethylated loci
range(condensedDMLHypo$meth.diff) #Confirm subset
```

```{r}
parentGOBPCondensedHypo <- unique(merge(x = condensedDMLHypo, y = parentGOBP, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOBP
head(parentGOBPCondensedHypo) #Confirm merge
```

```{r}
sum(is.na(parentGOBPCondensedHypo$name)) #Count genes with DML that do not have GO-MWU terms associated with them
```

```{r}
parentGOBPCondensedHypoCounts <- as.data.frame(table(parentGOBPCondensedHypo$name)) #Count the frequency of each GOslim term
colnames(parentGOBPCondensedHypoCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOBPCondensedHypoCounts <- subset(parentGOBPCondensedHypoCounts, subset = frequency > 0) #Remove residual factors
head(parentGOBPCondensedHypoCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPCondensedHypoCounts, "2019-07-30-condensedDMLHypo-Parent-GOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOBPCondensedHypoGOslim <- subset(x = parentGOBPDMLGOslim, subset = parentGOBPDMLGOslim$meth.diff < 0) #Subset hypermethylated loci
head(parentGOBPCondensedHypoGOslim) #Confirm subset
```

```{r}
sum(is.na(parentGOBPCondensedHypoGOslim$GOslim)) #Count how many entries do not have GOslim terms. 192/3169 do not have GO-MWU entries.
```

```{r}
parentGOBPCondensedHypoGOslimCounts <- as.data.frame(table(parentGOBPCondensedHypoGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOBPCondensedHypoGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
parentGOBPCondensedHypoGOslimCounts <- subset(parentGOBPCondensedHypoGOslimCounts, subset = frequency > 0) #Remove residual factors
head(parentGOBPCondensedHypoGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOBPCondensedHypoGOslimCounts, "2019-07-30-condensedDMLHypo-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file
```

##### Figure to compare all tested and DML distributions

```{r}
#pdf("2019-11-19-BP-GOSlim-allTested-Versus-DML.pdf", width = 11, height = 8.5) #Save as a pdf

par(mfrow = c(1, 2), oma = c(5, 20, 0, 3), mar = c(0, 0, 0, 0)) #Set up parameters for multipanel plot

barsGOSlim <- barplot(-1*(parentGOBPGOslimCounts$percentage),
                      horiz = TRUE,
                      axes = FALSE,
                      xlim = c(-18,0),
                      col = c(rep(plotColors[2], times = 5),
                              rep(plotColors[4], times = 3),
                              rep(plotColors[5], times = 1),
                              rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object.

axis(side = 2, at = barsGOSlim, labels = parentGOBPGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las)

axis(side = 1, at = seq(from = -18, to = 0, by = 2), labels = c(seq(from = 18, to = 0, by = -2)), cex = 1.2, col = "grey80") #Add x-axis
mtext(side = 1, "% Genes", line = 3, cex = 1.5) #Add x-axis label

barsGOSlimDML <- barplot(parentGOBPDMLGOslimCounts$percentage,
                         horiz = TRUE,
                         axes = FALSE,
                         xlim = c(0,18),
                         col = c(rep(plotColors[2], times = 5),
                                 rep(plotColors[4], times = 3),
                                 rep(plotColors[5], times = 1),
                                 rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object.

axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis
mtext(side = 1, "% Genes with DML", line = 3, cex = 1.5) #Add x-axis label

#dev.off()
```


### Cellular components

#### Edit input variables

```{r}
goDivision="CC" # CC for cellular components. All other inputs are the same.
```

#### Run GO-MWU

```{r}
# Calculating stats. It might take ~3 min for MF and BP. Do not rerun it if you just want to replot the data with different cutoffs, go straight to gomwuPlot. If you change any of the numeric values below, delete the files that were generated in previos runs first.
gomwuStats(input, goDatabase, goAnnotations, goDivision,
	perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
	largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
	smallest=5,   # a GO category should contain at least this many genes to be considered
	clusterCutHeight=0.25) # threshold for merging similar (gene-sharing) terms. See README for details.
# There are no GO terms pass 10% FDR.
```

#### Describe parent GOslim terms

##### All tested loci

```{r}
parentGOCC <- read.delim("CC_2019-07-30-allTested-Table-of-Significance-Measures.csv", header = TRUE) #Import GO-MWU output file with individual genes and associated GOslim term. Even though it is a .csv extension, it is a tab-delimited file
colnames(parentGOCC) <- c("name", "term", "lev", "Genbank", "value") #Rename gene ID column to match previous naming conventions
head(parentGOCC) #Confirm import
```

```{r}
parentGOCCCounts <- as.data.frame(table(parentGOCC$name)) #Count the frequency of each GOslim term
colnames(parentGOCCCounts) <- c("parentGOslim", "frequency") #Rename columns
head(parentGOCCCounts) #Confirm counts
```

```{r}
write.csv(parentGOCCCounts, "2019-07-30-allTested-Parent-GOSlim-Frequency-CC.csv", row.names = FALSE, quote = FALSE) #Save file
```

##### Differentially methylated loci

```{r}
parentGOCCCondensed <- unique(merge(x = DMLGenbank, y = parentGOCC, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOCC
head(parentGOCCCondensed) #Confirm merge
```

```{r}
sum(is.na(parentGOCCCondensed$name)) #Count how many genes with DML do not have GO-MWU characterizations
```

```{r}
parentGOCCCondensedCounts <- as.data.frame(table(parentGOCCCondensed$name)) #Count the frequency of each GOslim term
colnames(parentGOCCCondensedCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOCCCondensedCounts <- subset(parentGOCCCondensedCounts, subset = frequency > 0) #Remove residual factors
head(parentGOCCCondensedCounts) #Confirm counts
```

```{r}
write.csv(parentGOCCCondensedCounts, "2019-07-30-condensedDML-Parent-GOSlim-Frequency-CC.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Hypermethylated loci

```{r}
parentGOCCCondensedHyper <- unique(merge(x = condensedDMLHyper, y = parentGOCC, by = "Genbank")) #Merge condensed DML with parentGOCC
head(parentGOCCCondensedHyper) #Confirm merge
```

```{r}
parentGOCCCondensedHyperCounts <- as.data.frame(table(parentGOCCCondensedHyper$name)) #Count the frequency of each GOslim term
colnames(parentGOCCCondensedHyperCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOCCCondensedHyperCounts <- subset(parentGOCCCondensedHyperCounts, subset = frequency > 0) #Remove residual factors
head(parentGOCCCondensedHyperCounts) #Confirm counts
```

```{r}
write.csv(parentGOCCCondensedHyperCounts, "2019-07-30-condensedDMLHyper-Parent-GOSlim-Frequency-CC.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Hypomethylated loci

```{r}
parentGOCCCondensedHypo <- unique(merge(x = condensedDMLHypo, y = parentGOCC, by = "Genbank")) #Merge condensed DML with parentGOCC
head(parentGOCCCondensedHypo) #Confirm merge
```

```{r}
parentGOCCCondensedHypoCounts <- as.data.frame(table(parentGOCCCondensedHypo$name)) #Count the frequency of each GOslim term
colnames(parentGOCCCondensedHypoCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOCCCondensedHypoCounts <- subset(parentGOCCCondensedHypoCounts, subset = frequency > 0) #Remove residual factors
head(parentGOCCCondensedHypoCounts) #Confirm counts
```

```{r}
write.csv(parentGOCCCondensedHypoCounts, "2019-07-30-condensedDMLHypo-Parent-GOSlim-Frequency-CC.csv", row.names = FALSE, quote = FALSE) #Save file
```

### Molecular function

#### Edit input variables

```{r}
goDivision="MF" # MF for molecular function. All other inputs are the same.
```

#### Run GO-MWU

```{r}
# Calculating stats. It might take ~3 min for MF and BP. Do not rerun it if you just want to replot the data with different cutoffs, go straight to gomwuPlot. If you change any of the numeric values below, delete the files that were generated in previos runs first.
gomwuStats(input, goDatabase, goAnnotations, goDivision,
	perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
	largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
	smallest=5,   # a GO category should contain at least this many genes to be considered
	clusterCutHeight=0.25) # threshold for merging similar (gene-sharing) terms. See README for details.
# There are no GO terms pass 10% FDR.
```

#### Describe parent GOslim terms

##### All tested loci

```{r}
parentGOMF <- read.delim("MF_2019-07-30-allTested-Table-of-Significance-Measures.csv", header = TRUE) #Import GO-MWU output file with individual genes and associated GOslim term. Even though it is a .csv extension, it is a tab-delimited file
colnames(parentGOMF) <- c("name", "term", "lev", "Genbank", "value") #Rename gene ID column to match previous naming conventions
head(parentGOMF) #Confirm import
```

```{r}
parentGOMFCounts <- as.data.frame(table(parentGOMF$name)) #Count the frequency of each GOslim term
colnames(parentGOMFCounts) <- c("parentGOslim", "frequency") #Rename columns
head(parentGOMFCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCounts, "2019-07-30-allTested-Parent-GOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

##### Differentially methylated loci

```{r}
parentGOMFCondensed <- unique(merge(x = DMLGenbank, y = parentGOMF, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOMF
head(parentGOMFCondensed) #Confirm merge
```

```{r}
sum(is.na(parentGOMFCondensed$name)) #Count how many genes with DML do not have GO-MWU characterizations
```

```{r}
parentGOMFCondensedCounts <- as.data.frame(table(parentGOMFCondensed$name)) #Count the frequency of each GOslim term
colnames(parentGOMFCondensedCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOMFCondensedCounts <- subset(parentGOMFCondensedCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFCondensedCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCondensedCounts, "2019-07-30-condensedDML-Parent-GOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Map to GOslim terms

```{r}
CVGOslimMF <- read.delim("../2019-08-14-Differentially-Methylated-Genes/Blastquery-GOslim-MF.sorted", sep = "\t", header = FALSE) #Import GOslim information
colnames(CVGOslimMF) <- c("Genbank", "GOslim") #Add column names
head(CVGOslimMF) #Confirm import
```

```{r}
parentGOMFDMLGOslim <- unique(merge(x = parentGOMFCondensed, y = CVGOslimMF, by = "Genbank", all.x = TRUE)) #Merge GO-MWU with GOslim information, retaining all GO-MWU entries. Remove any merging artifacts
head(parentGOMFDMLGOslim) #Confirm merge
```

```{r}
sum(is.na(parentGOMFDMLGOslim$GOslim)) #Count how many entries do not have GOslim terms. 412/3190 do not have GO-MWU entries.
```

```{r}
parentGOMFDMLGOslimCounts <- as.data.frame(table(parentGOMFDMLGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOMFDMLGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
parentGOMFDMLGOslimCounts <- subset(parentGOMFDMLGOslimCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFDMLGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFDMLGOslimCounts, "2019-07-30-condensedDML-CVGOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Hypermethylated loci

```{r}
parentGOMFCondensedHyper <- unique(merge(x = condensedDMLHyper, y = parentGOMF, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOMF
head(parentGOMFCondensedHyper) #Confirm merge
```

```{r}
sum(is.na(parentGOMFCondensedHyper$name)) #Count DML without GO-MWU parent GOterms for corresponding genes.
```

```{r}
parentGOMFCondensedHyperCounts <- as.data.frame(table(parentGOMFCondensedHyper$name)) #Count the frequency of each GOslim term
colnames(parentGOMFCondensedHyperCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOMFCondensedHyperCounts <- subset(parentGOMFCondensedHyperCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFCondensedHyperCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCondensedHyperCounts, "2019-07-30-condensedDMLHyper-Parent-GOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOMFCondensedHyperGOslim <- subset(x = parentGOMFDMLGOslim, subset = parentGOMFDMLGOslim$meth.diff > 0) #Subset hypermethylated loci
head(parentGOMFCondensedHyperGOslim) #Confirm subset
```

```{r}
sum(is.na(parentGOMFCondensedHyperGOslim$GOslim)) #Count how many entries do not have GOslim terms. 193/1495 do not have GO-MWU entries.
```

```{r}
parentGOMFCondensedHyperGOslimCounts <- as.data.frame(table(parentGOMFCondensedHyperGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOMFCondensedHyperGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
parentGOMFCondensedHyperGOslimCounts <- subset(parentGOMFCondensedHyperGOslimCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFCondensedHyperGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCondensedHyperGOslimCounts, "2019-07-30-condensedDMLHyper-CVGOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

###### Hypomethylated loci

```{r}
parentGOMFCondensedHypo <- unique(merge(x = condensedDMLHypo, y = parentGOMF, by = "Genbank", all.x = TRUE)) #Merge condensed DML with parentGOMF
head(parentGOMFCondensedHypo) #Confirm merge
```

```{r}
sum(is.na(parentGOMFCondensedHypo$name)) #Count DML without parent GOterms from GO-MWU for corresponding genes. 973/
```

```{r}
parentGOMFCondensedHypoCounts <- as.data.frame(table(parentGOMFCondensedHypo $name)) #Count the frequency of each GOslim term
colnames(parentGOMFCondensedHypoCounts) <- c("parentGOslim", "frequency") #Rename columns
parentGOMFCondensedHypoCounts <- subset(parentGOMFCondensedHypoCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFCondensedHypoCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCondensedHypoCounts, "2019-07-30-condensedDMLHypo-Parent-GOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

```{r}
parentGOMFCondensedHypoGOslim <- subset(x = parentGOMFDMLGOslim, subset = parentGOMFDMLGOslim$meth.diff < 0) #Subset hypomethylated loci
head(parentGOMFCondensedHypoGOslim) #Confirm subset
```

```{r}
sum(is.na(parentGOMFCondensedHypoGOslim$GOslim)) #Count how many entries do not have GOslim terms. 219/1695 do not have GO-MWU entries.
```

```{r}
parentGOMFCondensedHypoGOslimCounts <- as.data.frame(table(parentGOMFCondensedHypoGOslim$GOslim)) #Count the frequency of each GOslim term
colnames(parentGOMFCondensedHypoGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns
parentGOMFCondensedHypoGOslimCounts <- subset(parentGOMFCondensedHypoGOslimCounts, subset = frequency > 0) #Remove residual factors
head(parentGOMFCondensedHypoGOslimCounts) #Confirm counts
```

```{r}
write.csv(parentGOMFCondensedHypoGOslimCounts, "2019-07-30-condensedDMLHypo-CVGOSlim-Frequency-MF.csv", row.names = FALSE, quote = FALSE) #Save file
```

## Uncorrected log p-values

I used signed log p-values since that was the GO-MWU example, but I also want to see what happens if I use uncorrected log p-values. This would not take into account hyper- and hypomethylation, but just focus on significant differences in methylation.

### Biological processes

#### Edit input variables

```{r}
input="2019-07-30-allTested-Table-of-Significance-Measures-Uncorrected.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="2019-07-30-allTested-GO-Annotations-Table.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="BP" # BP for biological processes
source("gomwu.functions.R")
```

#### Run GO-MWU

```{r}
# Calculating stats. It might take ~3 min for MF and BP. Do not rerun it if you just want to replot the data with different cutoffs, go straight to gomwuPlot. If you change any of the numeric values below, delete the files that were generated in previos runs first.
gomwuStats(input, goDatabase, goAnnotations, goDivision,
	perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
	largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
	smallest=5,   # a GO category should contain at least this many genes to be considered
	clusterCutHeight=0.25) # threshold for merging similar (gene-sharing) terms. See README for details.
# There are no GO terms pass 10% FDR, which is the same as the signed log p-value method. The outputs may be different, so I have to investigate that.
```