---
title: "Gene Enrichment Analysis"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this script, I'll isolate Uniprot codes needed to perform a gene enrichment. My ultimate goal is to identify enriched genes for mRNA regions that overlap with DMRs and DMLs. I will isolate Uniprot codes for the DMR-mRNA overlap, and for the gene background. The gene background will serve as my background for enrichment analyses, instead of the full transcriptome.
## Session information
```{r}
sessionInfo()
```
## Obtain GOterms
Before I do any sort of gene enrichment analysis, I need to get Uniprot codes for all genes. To do this, I will use `blastx`.
### Make a Uniprot Database
The Uniprot database was downloaded from this link as a .fasta file: http://www.uniprot.org/uniprot/?query=&fil=reviewed%3Ayes&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cgo(biological%20process)%2Cgo-id%2Ccomment(PATHWAY)%2Cdatabase(UniPathway)%2Cdatabase(CDD)%2Cdatabase(Pfam.
```{bash}
head uniprot-filtered-reviewed.fasta #Check database creation
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb -help #Get the help menu
```
The first step involves taking the uniprot fasta file and making a new database. To do this, I will use the following code:
1. Path to `makeblastdb` to create the database needed
2. `in` to select the input file for the database
3. `dbtype` specifies the type of file which in our case, is a protein file
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb \
-in uniprot-filtered-reviewed.fasta \
-dbtype 'prot'
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx -help
```
Now I can perform a `blastx` to compare my query, the *C. virginica* transcripts, to our protein database.
1. Path to `blastx`
2. `-query` provides the file we want to blast (*C. virginica* transcripts)
3.`-db` specifies database created in the previous step
4. `-outfmt` specifies the type of output file. I will use 6, a tabular file
5. `-out ` will allow me to save the output as a new file
6. `-num_threads 4` uses 4 CPUs in the BLAST search
7. `-max_target_seqs 1` keeps only one aligned sequence (i.e. the best match)
```{bash}
pwd #Obtain present working directory
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx \
-query 2018-09-06-Virginica-transcripts.fna \
-db uniprot-filtered-reviewed.fasta \
-outfmt 6 \
-out 2018-09-06-Transcript-Uniprot-blastx.txt \
-num_threads 4 \
-max_target_seqs 1
```
```{bash}
head 2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt #Check file format of Uniprot codes. I'm using a file with the Uniprot code already isolated so I don't have to unfold any columns after merging files.
```
## Gene background
First, I will merge the gene background with *C. virginica* Uniprot information.
### Check file format
```{bash}
head 2018-12-18-Gene-Background-mRNA.txt #Check file format
```
The IDs in the blast result match up with the Genbank IDs in the overlap file. I need to isolate the Genbank IDs before I can merge the files. I'm not sure how to do this file manipulation in bash or R, so I'll do this in Excel.
```{bash}
head 2018-12-18-Gene-Background-mRNA-Unfolded.txt #Confirm unfolding
```
### Import files
```{r}
blastResults <- read.delim("2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt", header = FALSE, sep = "\t", dec = ".") #Import blast results
head(blastResults) #Confirm import
colnames(blastResults) <- c("Genbank", "V2", "Uniprot", "V4", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
head(blastResults) #Confirm column headers
```
```{r}
backgroundmRNA <- read.delim("2018-12-18-Gene-Background-mRNA-Unfolded.txt", header = FALSE) #Import background-mRNA overlaps
head(backgroundmRNA) #Confirm import
colnames(backgroundmRNA) <- c("backgroundchromosome", "backgroundstart", "backgroundend", "backgroundstrand", "V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","V29","V30","V31","V32","V33","V34", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V40", "mRNAstrand", "V42", "ID", "Parent", "Dbxref=GeneID", "V46", "V47", "Genbank", "Name", "V50", "V51", "V52", "V53", "V54", "V55", "V56", "V57", "V58", "V59", "V60", "V61") #Rename columns
head(backgroundmRNA) #Confirm column naming
```
### Merge with Uniprot codes
```{r}
backgroundmRNAblast <- merge(x = blastResults, y = backgroundmRNA, by = "Genbank") #Merge the gene background and Uniprot codes by the Genbank ID column
head(backgroundmRNAblast) #Confirm merge
```
```{r}
write.csv(backgroundmRNAblast, "2019-02-22-mRNA-Gene-Background-Uniprot.csv") #Write out file
```
### Separate unique entries
I want a list of unique genes and matching Uniprot codes represented in the gene background-mRNA overlaps.
```{r}
uniqueBackgroundmRNAblast <- subset(backgroundmRNAblast, !duplicated(backgroundmRNAblast$Genbank)) #Subset the unique Genbank IDs from backgroundmRNAUniprot and save as a new dataframe.
head(uniqueBackgroundmRNAblast) #Confirm changes
```
```{r}
nrow(uniqueBackgroundmRNAblast) #Number of unique genes = 14943
write.csv(uniqueBackgroundmRNAblast, "2019-02-28-mRNA-Gene-Background-Uniprot-Unique.csv") #Save file
```
### Isolate Uniprot codes
To obtain gene enrichment information from DAVID, I need to isolate the Uniprot codes.
```{r}
backgroundmRNAUniprot <- backgroundmRNAblast$Uniprot #Isolate Uniprot codes from merged file
write.csv(backgroundmRNAUniprot, "2019-02-22-mRNA-Gene-Background-UniprotOnly.csv") #Save Uniprot codes as a new file
```
## Overlaps
### DMR-mRNA overlaps
I will repeat the steps above with the DMR-mRNA overlap file.
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DMR-mRNA.txt #Check file format
```
Again, I need to unfold the last column.
```{bash}
head 2018-11-07-DMR-mRNA-unfolded.txt
```
#### Import files
```{r}
DMRmRNA <- read.delim("2018-11-07-DMR-mRNA-unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps
head(DMRmRNA) #Confirm import
colnames(DMRmRNA) <- c("DMRchromosome", "DMRstart", "DMRend", "DMRstrand", "V5", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V11", "mRNAstrand", "V13", "ID", "Parent", "Dbxref=GeneID", "V17", "V18", "Genbank", "Name", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns
head(DMRmRNA) #Confirm column naming
```
#### Merge with Uniprot codes
Now I'll merge the two documents.
```{r}
mRNADMRblast <- merge(x = blastResults, y = DMRmRNA, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMRblast) #Confirm merge
```
```{r}
write.csv(mRNADMRblast, "2019-02-22-mRNA-DMR-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMRmRNAUniprot <- mRNADMRblast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMRmRNAUniprot, "2019-02-22-mRNA-DMR-UniprotOnly.csv") #Save Uniprot codes as a new file
```
### DML-mRNA overlaps
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DML-mRNA-Unfolded.txt #I unfolded the columns in Excel so the Genbank ID would be in its own column.
```
#### Import files
```{r}
DMLmRNA <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-07-DML-mRNA-unfolded.txt", header = FALSE) #Import DML-mRNA overlaps
head(DMLmRNA) #Confirm import
colnames(DMLmRNA) <- c("DMLchromosome", "DMLstart", "DMLend", "DMLstrand", "V5", "mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V11", "mRNAstrand", "V13", "ID", "Parent", "Dbxref=GeneID", "V17", "V18", "Genbank", "Name", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns
head(DMLmRNA) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMLblast <- merge(x = blastResults, y = DMLmRNA, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMLblast) #Confirm merge
```
```{r}
write.csv(mRNADMLblast, "2019-02-22-mRNA-DML-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMLmRNAUniprot <- mRNADMLblast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMLmRNAUniprot, "2019-02-22-mRNA-DML-UniprotOnly.csv") #Save Uniprot codes as a new file
```
## Upstream Flanks
### DMR-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMRmRNAUpstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR-Unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps with upstream flanks
head(DMRmRNAUpstream) #Confirm import
colnames(DMRmRNAUpstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns
head(DMRmRNAUpstream) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMRUpstreamBlast <- merge(x = blastResults, y = DMRmRNAUpstream, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMRUpstreamBlast) #Confirm merge
```
```{r}
write.csv(mRNADMRUpstreamBlast, "2019-02-22-mRNA-DMR-UpstreamFlanks-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMRmRNAUpstreamUniprot <- mRNADMRUpstreamBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMRmRNAUpstreamUniprot, "2019-02-22-mRNA-DMR-UpstreamFlanks-UniprotOnly.csv")
```
### DML-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMLmRNAUpstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML-Unfolded.txt", header = FALSE) #Import DML-mRNA overlaps with upstream flanks
head(DMRmRNAUpstream) #Confirm import
colnames(DMLmRNAUpstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns
head(DMLmRNAUpstream) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMLUpstreamBlast <- merge(x = blastResults, y = DMLmRNAUpstream, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMLUpstreamBlast) #Confirm merge
```
```{r}
write.csv(mRNADMLUpstreamBlast, "2019-02-22-mRNA-DML-UpstreamFlanks-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMLmRNAUpstreamUniprot <- mRNADMLUpstreamBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMLmRNAUpstreamUniprot, "2019-02-22-mRNA-DML-UpstreamFlanks-UniprotOnly.csv")
```
## Downstream Flanks
### DMR-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMRmRNADownstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR-Unfolded.txt", header = FALSE) #Import DMR-mRNA overlaps with downstream flanks
head(DMRmRNADownstream) #Confirm import
colnames(DMRmRNADownstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns
head(DMRmRNADownstream) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMRDownstreamBlast <- merge(x = blastResults, y = DMRmRNADownstream, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMRDownstreamBlast) #Confirm merge
```
```{r}
write.csv(mRNADMRDownstreamBlast, "2019-02-22-mRNA-DMR-DownstreamFlanks-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMRmRNADownstreamUniprot <- mRNADMRDownstreamBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMRmRNADownstreamUniprot, "2019-02-22-mRNA-DMR-DownstreamFlanks-UniprotOnly.csv")
```
### DML-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMLmRNADownstream <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML-Unfolded.txt", header = FALSE) #Import DML-mRNA overlaps with downstream flanks
head(DMLmRNADownstream) #Confirm import
colnames(DMLmRNADownstream) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30") #Rename columns
head(DMLmRNADownstream) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMLDownstreamBlast <- merge(x = blastResults, y = DMLmRNADownstream, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMLDownstreamBlast) #Confirm merge
```
```{r}
write.csv(mRNADMLDownstreamBlast, "2019-02-22-mRNA-DML-DownstreamFlanks-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMLmRNADownstreamUniprot <- mRNADMLDownstreamBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMLmRNADownstreamUniprot, "2019-02-22-mRNA-DML-DownstreamFlanks-UniprotOnly.csv")
```
## Non-overlapping Features
### DMR-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMRmRNANoOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs-Unfolded.txt", header = FALSE) #Import DMR-mRNA closest non-overlapping regions
head(DMRmRNANoOverlaps) #Confirm import
colnames(DMRmRNANoOverlaps) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMRchromosome", "DMRstart", "DMRend", "DMR", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "V20", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns
head(DMRmRNANoOverlaps) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMRNoOverlapsBlast <- merge(x = blastResults, y = DMRmRNANoOverlaps, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMRNoOverlapsBlast) #Confirm merge
```
```{r}
write.csv(mRNADMRNoOverlapsBlast, "2019-02-22-mRNA-DMR-NoOverlaps-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMRmRNANoOverlapsUniprot <- mRNADMRNoOverlapsBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMRmRNANoOverlapsUniprot, "2019-02-22-mRNA-DMR-NoOverlaps-UniprotOnly.csv")
```
### DML-mRNA
#### Check file format
```{bash}
head ../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs-Unfolded.txt #Check file format
```
#### Import files
```{r}
DMLmRNANoOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs-Unfolded.txt", header = FALSE) #Import DML-mRNA closest non-overlapping regions
head(DMLmRNANoOverlaps) #Confirm import
colnames(DMLmRNANoOverlaps) <- c("mRNAchromosome", "Gnomon", "mRNA", "mRNAstart", "mRNAend", "V6", "mRNAstrand", "V8", "V9", "DMLchromosome", "DMLstart", "DMLend", "DML", "Percentdiff", "ID", "Parent", "Dbxref=GeneID", "V18", "V19", "V20", "Genbank", "Name", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29") #Rename columns
head(DMLmRNANoOverlaps) #Confirm column naming
```
#### Merge with Uniprot codes
```{r}
mRNADMLNoOverlapsBlast <- merge(x = blastResults, y = DMLmRNANoOverlaps, by = "Genbank") #Merge the two files by the Genbank ID column
head(mRNADMLNoOverlapsBlast) #Confirm merge
```
```{r}
write.csv(mRNADMLNoOverlapsBlast, "2019-02-22-mRNA-DML-NoOverlaps-Uniprot.csv") #Write out file
```
#### Isolate Uniprot codes
```{r}
DMLmRNANoOverlapsUniprot <- mRNADMLNoOverlapsBlast$Uniprot #Isolate Uniprot codes from merged file
write.csv(DMLmRNANoOverlapsUniprot, "2019-02-22-mRNA-DML-NoOverlaps-UniprotOnly.csv")
```