--- title: "Gene Enrichment with GO-MWU" author: "Yaamini Venkataraman" date: "7/11/2019" output: github_document --- In this R Markdown file, we format files with a gene enrichment with [GO-MWU](https://github.com/z0on/GO_MWU). This is a rank-based gene enrichment method using a Mann-Whitney U test that works well with non-model organisms. There are separate annotations for *Zostera marina* and *Labyrinthula zosterae*, so these files will be formatted separately. # Set up R Markdown document ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r} sessionInfo() #Obtain session information. ``` #Format GO-MWU input ## GO annotations table The GO annotation tables need to be tab-delimited with 2 columns: gene ID and GOterms separated by semicolons. ### *Z. marina* ```{r} allZosteraAnnotations <- read.delim("../../data/Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab", header = FALSE) #Import file with gene counts for annotated genes colnames(allZosteraAnnotations) <- c("GeneID", "Accession", "Isoform", "E-value", "ProteinN", "GO_BP", "GO_CC", "GO_MF", "GO", "Status", "Organism", "S_10B", "S_9A", "S_13A", "S_42A", "S_46B", "S_47B", "S_48B", "S_2A", "S_2B", "S_7B", "S_8B", "S_33A", "S_36B", "S_38A", "S_40A") #Rename columns head(allZosteraAnnotations) #Confirm changes ``` ```{r} zosteraGOAnnotationsTable <- data.frame(allZosteraAnnotations$GeneID, allZosteraAnnotations$GO) #Save gene ID and all gene ontology terms in a new dataframe head(zosteraGOAnnotationsTable) #Confirm format is good for GO-MWU. It is. ``` ```{r} write.table(zosteraGOAnnotationsTable, "2019-07-11-Zostera-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file ``` ```{bash} head 2019-07-11-Zostera-GO-Annotations-Table.tab #Confirm formatting is good. ``` ### *L. zosterae* ```{r} allnonZosteraAnnotations <- read.delim("../../data/nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab", header = FALSE) #Import file with gene counts for annotated genes colnames(allnonZosteraAnnotations) <- c("GeneID", "Accession", "Isoform", "E-value", "ProteinN", "GO_BP", "GO_CC", "GO_MF", "GO", "Status", "Organism", "S_10B", "S_9A", "S_13A", "S_42A", "S_46B", "S_47B", "S_48B", "S_2A", "S_2B", "S_7B", "S_8B", "S_33A", "S_36B", "S_38A", "S_40A") #Rename columns head(allnonZosteraAnnotations) #Confirm changes ``` ```{r} nonZosteraGOAnnotationsTable <- data.frame(allnonZosteraAnnotations$GeneID, allnonZosteraAnnotations$GO) #Save gene ID and all gene ontology terms in a new dataframe head(nonZosteraGOAnnotationsTable) #Confirm format is good for GO-MWU. It is. ``` ```{r} write.table(nonZosteraGOAnnotationsTable, "2019-07-11-nonZostera-GO-Annotations-Table.tab", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE) #Save file ``` ```{bash} head 2019-07-11-nonZostera-GO-Annotations-Table.tab ``` ## Table of significance measures The table of significance measures needs to be a CSV file with two columns: gene ID and measure of continuous significance. All genes need to be included, not just those deemed significantly different by `edgeR`. Similar to the GO-MWU README, I used signed negative log p-values modified from `edgeR` output. ### *Z. marina* #### Subset table ```{r} zosteraDEG <- read.delim("../EdgeR/EXP.CON.Zostera.ALL.txt", header = TRUE) #Import all genes used in edgeR output head(zosteraDEG) #Confirm import ``` ```{r} zosteraSigMeasures <- data.frame("gene" = row.names(zosteraDEG), "logFC" = zosteraDEG$logFC, "logP" = log(zosteraDEG$PValue)) #Subset row names as gene ID and log fold change. Also include log(p-value) head(zosteraSigMeasures) #Confirm import ``` #### Correct p-values for upregulated genes ```{r} zosteraSigMeasuresUpregulated <- subset(x = zosteraSigMeasures, subset = zosteraSigMeasures$logFC > 0) #Subset differentially expressed genes that are upregulated by pulling out genes with a positive fold change tail(zosteraSigMeasuresUpregulated) #Confirm subset ``` ```{r} range(zosteraSigMeasuresUpregulated$logP) #Look at the range of p-values for upregulated genes. All values are negative. For signed negative log p-values, the sign of the log p-value should correspond with upregulation for downregulation. All upregulated genes should have positive p-values. zosteraSigMeasuresUpregulated$signLogP <- -1 * zosteraSigMeasuresUpregulated$logP #Multiply all p-values by -1 so they become positive head(zosteraSigMeasuresUpregulated) #Confirm changes ``` #### Correct p-values for downregulated genes ```{r} zosteraSigMeasuresDownregulated <- subset(x = zosteraSigMeasures, subset = zosteraSigMeasures$logFC < 0) #Subset differentially expressed genes that are downregulated by pulling out genes with a negative fold change head(zosteraSigMeasuresDownregulated) #Confirm subset ``` ```{r} range(zosteraSigMeasuresDownregulated$logP) #All downregulated genes should have negative p-values. These p-values are already negative, so no more modification needs to happen. zosteraSigMeasuresDownregulated$signLogP <- zosteraSigMeasuresDownregulated$logP #Copy entries to a new column to be consistent with formatting for table with upregulated genes ``` #### Combine tables with corrected p-values ```{r} zosteraSigMeasuresCorrected <- rbind(zosteraSigMeasuresUpregulated, zosteraSigMeasuresDownregulated) #Combine tables by columns zosteraSigMeasuresCorrected <- zosteraSigMeasuresCorrected[,-c(2:3)] #Remove unnecessary columns head(zosteraSigMeasuresCorrected) #Confirm changes ``` ```{r} write.csv(zosteraSigMeasuresCorrected, "2019-07-11-Zostera-Table-of-Significance-Measures.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{bash} head 2019-07-11-Zostera-Table-of-Significance-Measures.csv #Check that row name column was not included in output ``` ### *L. zosterae* #### Subset table ```{r} nonzosteraDEG <- read.delim("../EdgeR/EXP.CON.NonZostera.all.txt", header = TRUE) #Import all genes used in edgeR output head(nonzosteraDEG) #Confirm import ``` ```{r} nonzosteraSigMeasures <- data.frame("gene" = row.names(nonzosteraDEG), "logFC" = nonzosteraDEG$logFC, "logP" = log(nonzosteraDEG$PValue)) #Subset row names as gene ID and log fold change. Also include log(p-value) head(nonzosteraSigMeasures) #Confirm changes ``` #### Correct p-values for upregulated genes ```{r} nonzosteraSigMeasuresUpregulated <- subset(x = nonzosteraSigMeasures, subset = nonzosteraSigMeasures$logFC > 0) #Subset differentially expressed genes that are upregulated by pulling out genes with a positive fold change tail(nonzosteraSigMeasuresUpregulated) #Confirm subset ``` ```{r} range(nonzosteraSigMeasuresUpregulated$logP) #Look at the range of p-values for upregulated genes. All values are negative. For signed negative log p-values, the sign of the log p-value should correspond with upregulation for downregulation. All upregulated genes should have positive p-values. nonzosteraSigMeasuresUpregulated$signLogP <- -1 * nonzosteraSigMeasuresUpregulated$logP #Multiply all p-values by -1 so they become positive head(nonzosteraSigMeasuresUpregulated) #Confirm changes ``` #### Correct p-values for downregulated genes ```{r} nonzosteraSigMeasuresDownregulated <- subset(x = nonzosteraSigMeasures, subset = nonzosteraSigMeasures$logFC < 0) #Subset differentially expressed genes that are downregulated by pulling out genes with a negative fold change head(nonzosteraSigMeasuresDownregulated) #Confirm subset ``` ```{r} range(nonzosteraSigMeasuresDownregulated$logP) #All downregulated genes should have negative p-values. These p-values are already negative, so no more modification needs to happen. nonzosteraSigMeasuresDownregulated$signLogP <- nonzosteraSigMeasuresDownregulated$logP #Copy entries to a new column to be consistent with formatting for table with upregulated genes ``` #### Combine tables with corrected p-values ```{r} nonzosteraSigMeasuresCorrected <- rbind(nonzosteraSigMeasuresUpregulated, nonzosteraSigMeasuresDownregulated) #Combine tables by columns nonzosteraSigMeasuresCorrected <- nonzosteraSigMeasuresCorrected[,-c(2:3)] #Remove unnecessary columns head(nonzosteraSigMeasuresCorrected) #Confirm changes ``` ```{r} write.csv(nonzosteraSigMeasuresCorrected, "2019-07-11-nonZostera-Table-of-Significance-Measures.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{bash} head 2019-07-11-nonZostera-Table-of-Significance-Measures.csv #Confirm row names are not included in the file ```