--- title: "Gene Enrichment with GO-MWU" author: "Yaamini Venkataraman" date: "7/30/2019" output: html_document --- In this R Markdown document, I will format files for gene enrichment with [GO-MWU](https://github.com/z0on/GO_MWU). GO-MWU needs two input files. The first is a gene list, with all genes used in differential methylation analysis with all associated GOterms. The second is a list of genes with a continuous measure of significance. **NOTE: MOVE THIS R MARKDOWN FILE INTO THE `2018-12-02-Gene-Enrichment-Analysis` FOLDER BEFORE RUNNING.** # Set up R Markdown file ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Install packages ```{r} #install.packages("dichromat") require(dichromat) ``` # Obtain session information ```{r} sessionInfo() #Obtain session information ``` # Create master list ## Import gene background ```{r} geneBackground <- read.csv("2019-06-20-mRNA-Gene-Background-Uniprot.csv") #Import gene background with Uniprot annotations geneBackground <- geneBackground[,-1] #Remove extra column head(geneBackground) #Confirm import ``` ## Import GOterms ```{r} uniprotGOTerms <- read.delim("uniprot-reviewed_yes.tab", header = TRUE) #Import tab-delimited file with header uniprotGOTerms <- uniprotGOTerms[,c(1,8)] #Retain Entry and GOterm columns colnames(uniprotGOTerms) <- c("Uniprot", "GO") #Rename columns head(uniprotGOTerms) #Confirm import ``` ## Merge gene background with GOterms ```{r} geneBackgroundGOterms <- merge(x = geneBackground, y = uniprotGOTerms, by = "Uniprot") #Merge files by Uniprot ID head(geneBackgroundGOterms) #Confirm merge ``` ```{r} geneBackgroundGOterms$DMLB.start <- geneBackgroundGOterms$DMLB.start + 1 #Add 1 to each start position. Normally I would subtract 1 from each end position (dinucleotide), but the geneBackground file doesn't seem to include that. geneBackgroundGOterms <- geneBackgroundGOterms[,c(1:4, 9, 20, 22:23)] #Remove extra columns head(geneBackgroundGOterms) #Confirm changes ``` ## Merge gene background with all loci tested ```{r} allTested <- read.csv("../2018-10-25-MethylKit/2018-10-25-Loci-Analysis/2019-06-30-All-Loci-Methylation-Statistic-Filtered-Destrand-50-Cov5.csv", header = TRUE) #Import all loci tested allTested <- allTested[,-1] #Remove extra column colnames(allTested) <- c("chr", "DMLB.start", "DMLB.end", "strand", "pvalue", "qvalue", "meth.diff") #Rename columns to match previous naming conventions head(allTested) #Confirm import ``` ```{r} allTestedGOterms <- merge(x = allTested, y = geneBackgroundGOterms, by = "DMLB.start") #Merge all loci tested with gene background information head(allTestedGOterms) #Confirm merge ``` ```{r} allTestedGOterms <- allTestedGOterms[,c(2, 1, 3:8, 11:14)] #Reorganize columns colnames(allTestedGOterms) <- c("chr", "DMLB.start", "DMLB.end", "strand", "pvalue", "qvalue", "meth.diff", "Uniprot", "Genbank", "evalue", "Product", "GO") #Rename columns head(allTestedGOterms) #Confirm changes ``` ## Remove redundant lines ```{r} attach(allTestedGOterms) #Attach dataframe allTestedGOterms <- allTestedGOterms[order(Genbank, pvalue),] #Sort by gene ID, then p-value detach(allTestedGOterms) #Detatch dataframe head(allTestedGOterms) #Confirm changes ``` ```{r} write.csv(allTestedGOterms, "2019-07-30-All-Tested-Loci-Uniprot-GOTerms.csv", quote = FALSE, row.names = FALSE) #Save ordered file ``` ```{bash} #Sort a file by the Genbank column (--key = 9,9) and only retain unique lines (--unique). Save output to a new file sort --unique --key=9,9 2019-07-30-All-Tested-Loci-Uniprot-GOTerms.csv \ > 2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv ``` ```{bash} head -n4 2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv ``` # Create GO-MWU inputs ```{r} allTestedGOtermsCondensed <- read.csv("2019-07-30-All-Tested-Loci-Uniprot-GOTerms-Unique.csv", header = TRUE) #Import file with one line per gene head(allTestedGOtermsCondensed, 2) #Confirm import ``` ## GO annotations table The first table needs to be a tab-delimited file with one column for genes and one column for GOterms. ```{r} goAnnotations <- data.frame(allTestedGOtermsCondensed$Genbank, allTestedGOtermsCondensed$GO) #Create a new dataframe with Genbank and GO columns head(goAnnotations) #Confirm format is good. It is. ``` ```{r} write.table(goAnnotations, "2019-07-30-allTested-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file ``` ```{bash} head 2019-07-30-allTested-GO-Annotations-Table.tab #Confirm formatting was retained ``` ## Table of significance measures This table is a .csv file with one column for gene IDs and another for a continuous significance measure. All genes need to be included, not just those with significant DML. Similar to the GO-MWU example, I will use signed negative log p-values modified from `methylKit` output. Hypomethylated loci will have negative p-values, and hypermethylated loci will have positive p-values. ```{r} sigMeasures <- data.frame("gene" = allTestedGOtermsCondensed$Genbank, "logP" = log(allTestedGOtermsCondensed$pvalue), "meth.diff" = allTestedGOtermsCondensed$meth.diff) #Subset Genbank ID, log(p-value), and meth.diff head(sigMeasures) #Confirm subset ``` ```{r} sigMeasuresUncorrected <- sigMeasures[,c(1:2)] #Keep first two columns to see if GO-MWU provides different results with uncorrected p-values head(sigMeasuresUncorrected) #Confirm subset ``` ```{r} write.csv(sigMeasuresUncorrected, "2019-07-30-allTested-Table-of-Significance-Measures-Uncorrected.csv", quote = FALSE, row.names = FALSE) #Save file ``` # Gene enrichment with GO-MWU These commands were copied from the GO-MWU.R file in the GO-MWU Github repository. ### Biological processes #### Edit input variables ```{r} input="2019-07-30-allTested-Table-of-Significance-Measures-Uncorrected.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not). goAnnotations="2019-07-30-allTested-GO-Annotations-Table.tab" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml goDivision="BP" # BP for biological processes source("gomwu.functions.R") ``` #### Run GO-MWU ```{r} # Calculating stats. It might take ~3 min for MF and BP. Do not rerun it if you just want to replot the data with different cutoffs, go straight to gomwuPlot. If you change any of the numeric values below, delete the files that were generated in previos runs first. gomwuStats(input, goDatabase, goAnnotations, goDivision, perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already largest=0.1, # a GO category will not be considered if it contains more than this fraction of the total number of genes smallest=5, # a GO category should contain at least this many genes to be considered clusterCutHeight=0.25) # threshold for merging similar (gene-sharing) terms. See README for details. # There are no GO terms pass 10% FDR. ``` #### Describe parent GOslim terms Even though there are no significantly enriched terms, I want to describe the parent GOslim categories GO-MWU used to sort my data. ##### All tested loci ```{r} parentGOBP <- read.delim("BP_2019-07-30-allTested-Table-of-Significance-Measures.csv", header = TRUE) #Import GO-MWU output file with individual genes and associated GOslim term. Even though it is a .csv extension, it is a tab-delimited file colnames(parentGOBP) <- c("name", "term", "lev", "Genbank", "value") #Rename gene ID column to match previous naming conventions head(parentGOBP) #Confirm import ``` ###### Map to GOslim terms ```{r} CVGOslim <- read.delim("../2018-12-02-Gene-Enrichment-Analysis/Blastquery-GOslim-BP.sorted.unique.noOther", sep = "\t", header = FALSE) #Import GOslim information colnames(CVGOslim) <- c("Genbank", "GOslim") #Add column names head(CVGOslim) #Confirm import ``` ```{r} parentGOBPGOslim <- unique(merge(x = parentGOBP, y = CVGOslim, by = "Genbank", all.x = TRUE)) #Merge GO-MWU with GOslim information, retaining all GO-MWU entries. Remove any merging artifacts head(parentGOBPGOslim) #Confirm merge length(parentGOBPGOslim$GOslim) #Count number of entries ``` ```{r} sum(is.na(parentGOBPGOslim$GOslim)) #Count how many entries do not have GOslim terms. 27/2281 GO-MWU entries do not have GOslim entries. ``` ```{r} parentGOBPGOslimCounts <- as.data.frame(table(parentGOBPGOslim$GOslim)) #Count the frequency of each GOslim term colnames(parentGOBPGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns head(parentGOBPGOslimCounts) #Confirm counts ``` ```{r} write.csv(parentGOBPGOslimCounts, "2019-07-30-allTested-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file ``` ```{r} parentGOBPGOslimCounts$percentage <- (parentGOBPGOslimCounts$frequency/sum(parentGOBPGOslimCounts$frequency)*100) #Create a new column with percentages of GOSlim terms instead of raw counts head(parentGOBPGOslimCounts) #Confirm column creation max(parentGOBPGOslimCounts$percentage) ``` ```{r} parentGOBPGOslimCounts <- parentGOBPGOslimCounts[c(4, 1, 5, 11, 13, 2, 6, 3, 12, 7, 10, 9, 8),] #Reorganize rows based on broader functions. Rows are general activity, development, stress response, and metabolic processes head(parentGOBPGOslimCounts) #Confirm organization ``` ```{r} parentGOBPGOslimCounts$GOSlim <- c("Cell-Cell Signaling", "Cell Adhesion", "Death", "Signal Transduction", "Transport", "Cell Cycle and Proliferation", "Developmental Processes", "Cell Organization and Biogenesis", "Stress Response", "DNA Metabolism", "RNA Metabolism", "Protein Metabolism", "Other Metabolic Processes") #Capitalize labels head(parentGOBPGOslimCounts) #Confirm capitalization ``` ####### Create figure ```{r} RColorBrewer::display.brewer.all() #Show all RColorBrewer palettes. I will choose greens. plotColors <- rev(RColorBrewer::brewer.pal(5, "GnBu")) #Create a color palette for the barplots. Use 5 green-blue shades from RColorBrewer. Reverse theorder so the darkest shade is used first. barplot(parentGOBPGOslimCounts$percentage, col = plotColors) #See what plot looks like with new scheme barplot(parentGOBPGOslimCounts$percentage, col = dichromat(plotColors)) #Check that the plot colors will be easy to interpret for those with color blindess ``` ```{r} #pdf("2019-11-19-BP-GOSlim-allTested-Gene-Overlaps.pdf", width = 11, height = 8.5) #Save as a pdf par(mar = c(5, 20, 1, 1)) #Change figure boundaries barsGOSlim <- barplot((parentGOBPGOslimCounts$percentage), horiz = TRUE, axes = FALSE, xlim = c(0,18), col = c(rep(plotColors[2], times = 5), rep(plotColors[4], times = 3), rep(plotColors[5], times = 1), rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object. axis(side = 2, at = barsGOSlim, labels = parentGOBPGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las) axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis mtext(side = 1, "Percent of Genes", line = 3, cex = 1.5) #Add x-axis label par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot legend("bottomleft", xpd = TRUE, legend = c("General Activity", "Development", "Stress Response", "Metabolic Processes"), pch = 22, col = "black", pt.bg = c(plotColors[2], plotColors[4], plotColors[5], plotColors[1]), bty = "n") #Place a legend in the bottom left of the figure with no box #dev.off() ``` ##### Differentially methylated loci GO-MWU takes all loci into account, but there may be some differences in parental GOslim terms if I only look at differentially methylated loci. I will import the DML annotations I made previously and work from there. ```{r} DMLGeneAnnotation <- read.csv("2019-06-20-DML-Gene-Annotation.csv", header = TRUE, row.names = 1) #Import file. The first column are row names. head(DMLGeneAnnotation) #Confirm import ``` ```{r} DMLGenbank <- data.frame("Genbank" = DMLGeneAnnotation$Genbank, "pvalue" = DMLGeneAnnotation$pvalue, "meth.diff" = DMLGeneAnnotation$meth.diff) #Create a new dataframe with Genbank IDs and pvalues for each DML. Include meth.diff as well head(DMLGenbank) #Confirm dataframe creation ``` ```{r} parentGOBPDML <- unique(merge(x = DMLGenbank, y = parentGOBP, by = "Genbank", all.x = TRUE)) #Merge parent GOslim terms and DML information by Genbank ID. Keep only unique entries. This will allow me to retain multiple DML per gene, but remove any merging artifacts. Keep rows that don't have matching GOterms from GO-MWU just in case. head(parentGOBPDML) #Confirm merge. There are several entries without any matching information from GO-MWU. ``` ```{r} sum(is.na(parentGOBPDML$name)) #Count how many genes with DML do not have GO-MWU entries. ``` ###### Map to GOslim terms ```{r} parentGOBPDMLGOslim <- unique(merge(x = parentGOBPDML, y = CVGOslim, by = "Genbank", all.x = TRUE)) #Merge GO-MWU with GOslim information, retaining all GO-MWU entries. Remove any merging artifacts head(parentGOBPDMLGOslim) #Confirm merge length(parentGOBPDMLGOslim$GOslim) #Count number of entries ``` ```{r} sum(is.na(parentGOBPDMLGOslim$GOslim)) #Count how many entries do not have GOslim terms. 425/5283 GO-MWU entries do not have GOslim entries. ``` ```{r} parentGOBPDMLGOslimCounts <- as.data.frame(table(parentGOBPDMLGOslim$GOslim)) #Count the frequency of each GOslim term colnames(parentGOBPDMLGOslimCounts) <- c("CVGOslim", "frequency") #Rename columns head(parentGOBPDMLGOslimCounts) #Confirm counts ``` ```{r} write.csv(parentGOBPDMLGOslimCounts, "2019-07-30-condensedDML-CVGOSlim-Frequency-BP.csv", row.names = FALSE, quote = FALSE) #Save file ``` ```{r} parentGOBPDMLGOslimCounts$percentage <- (parentGOBPDMLGOslimCounts$frequency/sum(parentGOBPDMLGOslimCounts$frequency)*100) #Create a new column with percentages of GOSlim terms instead of raw counts head(parentGOBPDMLGOslimCounts) #Confirm column creation max(parentGOBPDMLGOslimCounts$percentage) ``` ```{r} parentGOBPDMLGOslimCounts <- parentGOBPDMLGOslimCounts[c(4, 1, 5, 11, 13, 2, 6, 3, 12, 7, 10, 9, 8),] #Reorganize rows based on broader functions. Rows are general activity, development, stress response, and metabolic processes head(parentGOBPDMLGOslimCounts) #Confirm organization ``` ```{r} parentGOBPDMLGOslimCounts$GOSlim <- c("Cell-Cell Signaling", "Cell Adhesion", "Death", "Signal Transduction", "Transport", "Cell Cycle and Proliferation", "Developmental Processes", "Cell Organization and Biogenesis", "Stress Response", "DNA Metabolism", "RNA Metabolism", "Protein Metabolism", "Other Metabolic Processes") #Capitalize labels head(parentGOBPDMLGOslimCounts) #Confirm capitalization ``` ####### Create figure ```{r} #pdf("2019-11-19-BP-GOSlim-DML-Gene-Overlaps.pdf", width = 11, height = 8.5) #Save as a pdf par(mar = c(5, 20, 1, 1)) #Change figure boundaries barsGOSlimDML <- barplot(parentGOBPDMLGOslimCounts$percentage, horiz = TRUE, axes = FALSE, xlim = c(0,18), col = c(rep(plotColors[2], times = 5), rep(plotColors[4], times = 3), rep(plotColors[5], times = 1), rep(plotColors[1], times = 4))) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object. axis(side = 2, at = barsGOSlimDML, labels = parentGOBPDMLGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las) axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis mtext(side = 1, "Percent of Genes with DML", line = 3, cex = 1.5) #Add x-axis label par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot legend("bottomleft", xpd = TRUE, legend = c("General Activity", "Development", "Stress Response", "Metabolic Processes"), pch = 22, col = "black", pt.bg = c(plotColors[2], plotColors[4], plotColors[5], plotColors[1]), bty = "n") #Place a legend in the bottom left of the figure with no box #dev.off() ``` ##### Figure to compare all tested and DML distributions ```{r} #pdf("2019-11-19-BP-GOSlim-allTested-Versus-DML.pdf", width = 11, height = 8.5) #Save as a pdf par(mfrow = c(1, 2), oma = c(5, 20, 0, 3), mar = c(0, 0, 0, 0)) #Set up parameters for multipanel plot barsGOSlim <- barplot(-1*(parentGOBPGOslimCounts$percentage), horiz = TRUE, axes = FALSE, xlim = c(-18,0), col = plotColors[5]) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object. axis(side = 2, at = barsGOSlim, labels = parentGOBPGOslimCounts$GOSlim, tick = FALSE, las = 2, col = "grey80", cex.axis = 1.5) #Add y-axis with GOSlim terms from BPGOSlimCounts. Place labels at specific barplot values from barsGOSlim. Remove tick marks (tick) and change orientation of labels (las) axis(side = 1, at = seq(from = -18, to = 0, by = 2), labels = c(seq(from = 18, to = 0, by = -2)), cex = 1.2, col = "grey80") #Add x-axis mtext(side = 1, "% Genes", line = 3, cex = 1.5) #Add x-axis label barsGOSlimDML <- barplot(parentGOBPDMLGOslimCounts$percentage, horiz = TRUE, axes = FALSE, xlim = c(0,18), col = plotColors[2]) #Create a barplot that is horizontal (horiz = TRUE). Use axes = FALSE to remove all axes. Set xlim based on maximum percentage value. Colors correspond to broader GOSlim groupings. Save plot as a new object. axis(side = 1, at = seq(from = 0, to = 18, by = 2), cex = 1.2, col = "grey80") #Add x-axis mtext(side = 1, "% Genes with DML", line = 3, cex = 1.5) #Add x-axis label #dev.off() ```