--- title: "Gene Enrichment Analysis" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` In this script, I'll isolate Uniprot codes needed to perform a gene enrichment. My ultimate goal is to identify enriched genes for mRNA regions that overlap with DMRs and DMLs. I will isolate Uniprot codes for the DMR-mRNA overlap, and for the gene background. The gene background will serve as my background for enrichment analyses, instead of the full transcriptome. ## Session information ```{r} sessionInfo() ``` ## Obtain GOterms Before I do any sort of gene enrichment analysis, I need to get Uniprot codes for all genes. To do this, I will use `blastx`. ### Make a Uniprot Database The Uniprot database was downloaded from this link as a .fasta file: http://www.uniprot.org/uniprot/?query=&fil=reviewed%3Ayes&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cgo(biological%20process)%2Cgo-id%2Ccomment(PATHWAY)%2Cdatabase(UniPathway)%2Cdatabase(CDD)%2Cdatabase(Pfam. ```{bash} head uniprot-filtered-reviewed.fasta #Check database creation ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb -help #Get the help menu ``` The first step involves taking the uniprot fasta file and making a new database. To do this, I will use the following code: 1. Path to `makeblastdb` to create the database needed 2. `in` to select the input file for the database 3. `dbtype` specifies the type of file which in our case, is a protein file ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb \ -in uniprot-filtered-reviewed.fasta \ -dbtype 'prot' ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx -help ``` Now I can perform a `blastx` to compare my query, the *C. virginica* transcripts, to our protein database. 1. Path to `blastx` 2. `-query` provides the file we want to blast (*C. virginica* transcripts) 3.`-db` specifies database created in the previous step 4. `-outfmt` specifies the type of output file. I will use 6, a tabular file 5. `-out ` will allow me to save the output as a new file 6. `-num_threads 4` uses 4 CPUs in the BLAST search 7. `-max_target_seqs 1` keeps only one aligned sequence (i.e. the best match) ```{bash} pwd #Obtain present working directory ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx \ -query 2018-09-06-Virginica-transcripts.fna \ -db uniprot-filtered-reviewed.fasta \ -outfmt 6 \ -out 2018-09-06-Transcript-Uniprot-blastx.txt \ -num_threads 4 \ -max_target_seqs 1 ``` ```{bash} head 2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt #Check file format of Uniprot codes. I'm using a file with the Uniprot code already isolated so I don't have to unfold any columns after merging files. ``` ## Gene background First, I will merge the gene background with *C. virginica* Uniprot information. ### Check file format ```{bash} head 2018-12-18-Gene-Background-mRNA.txt #Check file format ``` The IDs in the blast result match up with the Genbank IDs in the overlap file. I need to isolate the Genbank IDs before I can merge the files. I'm not sure how to do this file manipulation in bash or R, so I'll do this in Excel. ```{bash} head 2018-12-18-Gene-Background-mRNA-Unfolded.txt #Confirm unfolding ``` ### Import files ```{r} blastResults <- read.delim("2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt", header = FALSE, sep = "\t", dec = ".") #Import blast results head(blastResults) #Confirm import colnames(blastResults) <- c("Genbank", "V2", "Uniprot", "V4", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") head(blastResults) #Confirm column headers ``` ```{r} backgroundmRNA <- read.delim("2018-12-18-Gene-Background-mRNA-Unfolded.txt", header = FALSE) #Import background-mRNA overlaps head(backgroundmRNA) #Confirm import colnames(backgroundmRNA) <- c("backgroundchromosome", "backgroundstart", "backgroundend", "backgroundstrand", "V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","V29","V30","V31","V32","V33","V34", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V40", "mRNAstrand", "V42", "ID", "Parent", "Dbxref=GeneID", "V46", "V47", "Genbank", "Name", "V50", "V51", "V52", "V53", "V54", "V55", "V56", "V57", "V58", "V59", "V60", "V61") #Rename columns head(backgroundmRNA) #Confirm column naming ``` ### Merge with Uniprot codes ```{r} backgroundmRNAblast <- merge(x = blastResults, y = backgroundmRNA, by = "Genbank") #Merge the gene background and Uniprot codes by the Genbank ID column head(backgroundmRNAblast) #Confirm merge ``` ```{r} write.csv(backgroundmRNAblast, "2019-02-22-mRNA-Gene-Background-Uniprot.csv") #Write out file ``` ### Separate unique entries I want a list of unique genes and matching Uniprot codes represented in the gene background-mRNA overlaps. ```{r} uniqueBackgroundmRNAblast <- subset(backgroundmRNAblast, !duplicated(backgroundmRNAblast$Genbank)) #Subset the unique Genbank IDs from backgroundmRNAUniprot and save as a new dataframe. head(uniqueBackgroundmRNAblast) #Confirm changes ``` ```{r} nrow(uniqueBackgroundmRNAblast) #Number of unique genes = 14943 write.csv(uniqueBackgroundmRNAblast, "2019-02-28-mRNA-Gene-Background-Uniprot-Unique.csv") #Save file ``` ### Isolate Uniprot codes To obtain gene enrichment information from DAVID, I need to isolate the Uniprot codes. ```{r} backgroundmRNAUniprot <- backgroundmRNAblast$Uniprot #Isolate Uniprot codes from merged file write.csv(backgroundmRNAUniprot, "2019-02-22-mRNA-Gene-Background-UniprotOnly.csv") #Save Uniprot codes as a new file ``` ## Overlaps ### DMR-mRNA overlaps I will repeat the steps above with the DMR-mRNA overlap file. #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DMR-mRNA.txt #Check file format ``` Again, I need to unfold the last column. ```{bash} head 2018-11-07-DMR-mRNA-unfolded.txt ``` #### Import files ```{r} DMRmRNA <- read.delim("2018-11-07-DMR-mRNA-unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps head(DMRmRNA) #Confirm import colnames(DMRmRNA) <- c("DMRchromosome", "DMRstart", "DMRend", "DMRstrand", "V5", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V11", "mRNAstrand", "V13", "ID", "Parent", "Dbxref=GeneID", "V17", "V18", "Genbank", "Name", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns head(DMRmRNA) #Confirm column naming ``` #### Merge with Uniprot codes Now I'll merge the two documents. ```{r} mRNADMRblast <- merge(x = blastResults, y = DMRmRNA, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMRblast) #Confirm merge ``` ```{r} write.csv(mRNADMRblast, "2019-02-22-mRNA-DMR-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMRmRNAUniprot <- mRNADMRblast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMRmRNAUniprot, "2019-02-22-mRNA-DMR-UniprotOnly.csv") #Save Uniprot codes as a new file ``` ### DML-mRNA overlaps #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DML-mRNA-Unfolded.txt #I unfolded the columns in Excel so the Genbank ID would be in its own column. ``` #### Import files ```{r} DMLmRNA <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DML-mRNA-unfolded.txt", header = FALSE) #Import DML-mRNA overlaps head(DMLmRNA) #Confirm import colnames(DMLmRNA) <- c("DMLchromosome", "DMLstart", "DMLend", "DMLstrand", "V5", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V11", "mRNAstrand", "V13", "ID", "Parent", "Dbxref=GeneID", "V17", "V18", "Genbank", "Name", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns head(DMLmRNA) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMLblast <- merge(x = blastResults, y = DMLmRNA, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMLblast) #Confirm merge ``` ```{r} write.csv(mRNADMLblast, "2019-02-22-mRNA-DML-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMLmRNAUniprot <- mRNADMLblast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMLmRNAUniprot, "2019-02-22-mRNA-DML-UniprotOnly.csv") #Save Uniprot codes as a new file ``` ## Upstream Flanks ### DMR-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR-Unfolded.txt #Check file format ``` #### Import files ```{r} DMRmRNAUpstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR-Unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps with upstream flanks head(DMRmRNAUpstream) #Confirm import colnames(DMRmRNAUpstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns head(DMRmRNAUpstream) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMRUpstreamBlast <- merge(x = blastResults, y = DMRmRNAUpstream, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMRUpstreamBlast) #Confirm merge ``` ```{r} write.csv(mRNADMRUpstreamBlast, "2019-02-22-mRNA-DMR-UpstreamFlanks-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMRmRNAUpstreamUniprot <- mRNADMRUpstreamBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMRmRNAUpstreamUniprot, "2019-02-22-mRNA-DMR-UpstreamFlanks-UniprotOnly.csv") ``` ### DML-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML-Unfolded.txt #Check file format ``` #### Import files ```{r} DMLmRNAUpstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML-Unfolded.txt", header = FALSE) #Import DML-mRNA overlaps with upstream flanks head(DMRmRNAUpstream) #Confirm import colnames(DMLmRNAUpstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns head(DMLmRNAUpstream) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMLUpstreamBlast <- merge(x = blastResults, y = DMLmRNAUpstream, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMLUpstreamBlast) #Confirm merge ``` ```{r} write.csv(mRNADMLUpstreamBlast, "2019-02-22-mRNA-DML-UpstreamFlanks-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMLmRNAUpstreamUniprot <- mRNADMLUpstreamBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMLmRNAUpstreamUniprot, "2019-02-22-mRNA-DML-UpstreamFlanks-UniprotOnly.csv") ``` ## Downstream Flanks ### DMR-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR-Unfolded.txt #Check file format ``` #### Import files ```{r} DMRmRNADownstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR-Unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps with downstream flanks head(DMRmRNADownstream) #Confirm import colnames(DMRmRNADownstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns head(DMRmRNADownstream) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMRDownstreamBlast <- merge(x = blastResults, y = DMRmRNADownstream, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMRDownstreamBlast) #Confirm merge ``` ```{r} write.csv(mRNADMRDownstreamBlast, "2019-02-22-mRNA-DMR-DownstreamFlanks-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMRmRNADownstreamUniprot <- mRNADMRDownstreamBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMRmRNADownstreamUniprot, "2019-02-22-mRNA-DMR-DownstreamFlanks-UniprotOnly.csv") ``` ### DML-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML-Unfolded.txt #Check file format ``` #### Import files ```{r} DMLmRNADownstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML-Unfolded.txt", header = FALSE) #Import DML-mRNA overlaps with downstream flanks head(DMLmRNADownstream) #Confirm import colnames(DMLmRNADownstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns head(DMLmRNADownstream) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMLDownstreamBlast <- merge(x = blastResults, y = DMLmRNADownstream, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMLDownstreamBlast) #Confirm merge ``` ```{r} write.csv(mRNADMLDownstreamBlast, "2019-02-22-mRNA-DML-DownstreamFlanks-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMLmRNADownstreamUniprot <- mRNADMLDownstreamBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMLmRNADownstreamUniprot, "2019-02-22-mRNA-DML-DownstreamFlanks-UniprotOnly.csv") ``` ## Non-overlapping Features ### DMR-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs-Unfolded.txt #Check file format ``` #### Import files ```{r} DMRmRNANoOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs-Unfolded.txt", header = FALSE) #Import DMR-mRNA closest non-overlapping regions head(DMRmRNANoOverlaps) #Confirm import colnames(DMRmRNANoOverlaps) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "V20", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns head(DMRmRNANoOverlaps) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMRNoOverlapsBlast <- merge(x = blastResults, y = DMRmRNANoOverlaps, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMRNoOverlapsBlast) #Confirm merge ``` ```{r} write.csv(mRNADMRNoOverlapsBlast, "2019-02-22-mRNA-DMR-NoOverlaps-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMRmRNANoOverlapsUniprot <- mRNADMRNoOverlapsBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMRmRNANoOverlapsUniprot, "2019-02-22-mRNA-DMR-NoOverlaps-UniprotOnly.csv") ``` ### DML-mRNA #### Check file format ```{bash} head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs-Unfolded.txt #Check file format ``` #### Import files ```{r} DMLmRNANoOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs-Unfolded.txt", header = FALSE) #Import DML-mRNA closest non-overlapping regions head(DMLmRNANoOverlaps) #Confirm import colnames(DMLmRNANoOverlaps) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DML", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "V20", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns head(DMLmRNANoOverlaps) #Confirm column naming ``` #### Merge with Uniprot codes ```{r} mRNADMLNoOverlapsBlast <- merge(x = blastResults, y = DMLmRNANoOverlaps, by = "Genbank") #Merge the two files by the Genbank ID column head(mRNADMLNoOverlapsBlast) #Confirm merge ``` ```{r} write.csv(mRNADMLNoOverlapsBlast, "2019-02-22-mRNA-DML-NoOverlaps-Uniprot.csv") #Write out file ``` #### Isolate Uniprot codes ```{r} DMLmRNANoOverlapsUniprot <- mRNADMLNoOverlapsBlast$Uniprot #Isolate Uniprot codes from merged file write.csv(DMLmRNANoOverlapsUniprot, "2019-02-22-mRNA-DML-NoOverlaps-UniprotOnly.csv") ```