---
title: "Gene Enrichment Analysis"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this script, I'll create several annotation tables for DML, perform a functional gene description, and prepare Uniprot codes for enrichment in DAVID.
# Install packages
```{r}
require(RColorBrewer)
require(dichromat)
require(dplyr)
```
# Session information
```{r}
sessionInfo()
```
# Obtain Uniprot codes
Before I do any sort of gene enrichment analysis, I need to get Uniprot codes for all genes. To do this, I will use `blastx`.
## Make a Uniprot Database
The Uniprot database was downloaded from this link as a .fasta file: http://www.uniprot.org/uniprot/?query=&fil=reviewed%3Ayes&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cgo(biological%20process)%2Cgo-id%2Ccomment(PATHWAY)%2Cdatabase(UniPathway)%2Cdatabase(CDD)%2Cdatabase(Pfam.
```{bash}
head uniprot-filtered-reviewed.fasta #Check database creation
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb -help #Get the help menu
```
The first step involves taking the uniprot fasta file and making a new database. To do this, I will use the following code:
1. Path to `makeblastdb` to create the database needed
2. `in` to select the input file for the database
3. `dbtype` specifies the type of file which in our case, is a protein file
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb \
-in uniprot-filtered-reviewed.fasta \
-dbtype 'prot'
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx -help
```
Now I can perform a `blastx` to compare my query, the *C. virginica* transcripts, to our protein database.
## Perform `blastx`
1. Path to `blastx`
2. `-query` provides the file we want to blast (*C. virginica* transcripts)
3.`-db` specifies database created in the previous step
4. `-outfmt` specifies the type of output file. I will use 6, a tabular file
5. `-out ` will allow me to save the output as a new file
6. `-num_threads 4` uses 4 CPUs in the BLAST search
7. `-max_target_seqs 1` keeps only one aligned sequence (i.e. the best match)
```{bash}
pwd #Obtain present working directory
```
```{bash}
/Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx \
-query 2018-09-06-Virginica-transcripts.fna \
-db uniprot-filtered-reviewed.fasta \
-outfmt 6 \
-out 2018-09-06-Transcript-Uniprot-blastx.txt \
-num_threads 4 \
-max_target_seqs 1
```
```{bash}
head 2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt #Check file format of Uniprot codes. I'm using a file with the Uniprot code already isolated so I don't have to unfold any columns after merging files.
```
```{r}
blastResults <- read.delim("2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt", header = FALSE, sep = "\t", dec = ".") #Import blast results
head(blastResults) #Confirm import
colnames(blastResults) <- c("Genbank", "V2", "Uniprot", "V4", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
head(blastResults) #Confirm column headers
```
# Master gene list
I want to create a master list with the following columns:
- chr
- start
- end
- p-value
- q-value
- meth.diff
- dist-to-closest-gene
- gene-chr
- gene-start
- gene-end
- gene-ID
- mRNA-start
- mRNA-end
- genbank
- uniprot-ID
- e-value
- product
I can start with my DML csv file, then merge that with the information from `closestBed`. Then I'll merge that document with my gene track to get gene IDs. I'll merge the resultant file with the mRNA track by gene ID to get Genbank IDs. Finally, I'll merge that document with the Uniprot codes by matching Genbank IDs.
## Merge DML list with `closestBed` list
```{r}
allDML <- read.csv("../2018-10-25-MethylKit/2018-10-25-Loci-Analysis/2019-04-05-Differentially-Methylated-Loci-Filtered-Destrand-50-Cov5.csv", header = TRUE) #Import DML list
allDML <- allDML[,-1] #Remove first column
head(allDML) #Confirm import
```
```{r}
closestDML <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-Flanking-Analysis/2019-05-29-Genes-Closest-NoOverlap-DMLs.txt", header = FALSE) #Import DML with closestBed information
colnames(closestDML) <- c("chr", "start", "end", "meth.diff", "gene-chr", "gene-start", "gene-end", "dist-to-closest-gene") #Add column names
head(closestDML) #Confirm column names. DML downstream of genes have positive distances. There may be more lines than DML becuase there are some DML that overlap multiple gene features, or have ties for closest gene features.
```
```{r}
#Add 1 to the start position and subtract 1 from the end position to get DML positions compatible with the DML list
closestDML$start <- closestDML$start + 1
closestDML$end <- closestDML$end -1
head(closestDML) #Confirm changes
```
```{r}
allDMLclosestBed <- merge(x = closestDML, y = allDML, by = "start", sort = FALSE) #Merge closest information and DML list. Do not resort the output.
head(allDMLclosestBed) #Confirm merge
```
```{r}
allDMLclosestBed <- allDMLclosestBed[, c(2, 1, 3, 4, 11, 12, 13, 14, 8, 5, 6, 7)] #Reorganize columns
head(allDMLclosestBed) #Confirm reorganization
```
## Merge with gene track
### Import full gene list
```{r}
geneList <- read.delim("../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3", header = FALSE) #Import gene gff3
colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "V6", "V7", "V8", "V9") #Rename columns
head(geneList) #Confirm import
```
```{r}
geneList <- geneList[,-c(6:8)] #Remove columns 6-8
head(geneList) #Confirm changes
```
### Isolate gene description from full list
```{bash}
cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3 \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-Gene-Descriptions.txt #Isolate gene description column and convert ";" to "\t". Save the output as a new file
```
```{bash}
head 2019-06-20-Gene-Descriptions.txt #Confirm file creation.
```
Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`.
```{bash}
awk '{print NF}' 2019-06-20-Gene-Descriptions.txt > 2019-06-20-Gene-Description-Column-Numbers.txt
```
### Import isolated gene descriptions
```{r}
geneDescriptionColumns <- read.delim("2019-06-20-Gene-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row
max(geneDescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 9 columns per row.
```
```{r}
geneDescriptions <- read.delim2("2019-06-20-Gene-Descriptions.txt", header = FALSE, col.names = c("gene-number", "gene-ID", "V3", "V4", "V5", "V6", "V7", "V8", "V9")) #Import document with gene descrptions. Name 9 columns so all rows are imported with 9 columns
head(geneDescriptions) #Confirm import
```
```{r}
geneID <- geneDescriptions$gene.ID #Isolate gene ID column
geneID <- gsub("Dbxref=GeneID:", "", geneID) #Remove extraneous characters
head(geneID) #Confirm changes
```
```{r}
geneList <- data.frame(geneList, geneID) #Replace geneList dataframe with geneList and geneID
geneList <- geneList[,-6] #Remove extra column
colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "gene-ID") #Rename columns to match previous naming conventions
head(geneList) #Confirm merge
```
### Merge gene track with `allDMLclosestBed`
```{r}
allDMLclosestBedgeneList <- merge(x = allDMLclosestBed, y = geneList, by = c("gene-start", "gene-end"), sort = FALSE) #Merge dataframes by gene start and end positions
head(allDMLclosestBedgeneList) #Confirm merge
```
```{r}
allDMLclosestBedgeneList <- allDMLclosestBedgeneList[,c(3:12, 14:15, 1:2, 16)] #Reorganize columns
head(allDMLclosestBedgeneList) #Confirm reorganization
```
## Merge with mRNA track
### Import full mRNA track
```{r}
mRNAList <- read.delim("../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3", header = FALSE) #Import mRNA track
mRNAList <- mRNAList[,-c(6,8)] #Remove extra columns
colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "mRNA-strand", "V9") #Rename columns
head(mRNAList) #Confirm import
```
### Isolate mRNA description from full list
```{bash}
cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \
| awk '{gsub(";","\t",$0); print;}' \
| awk '{gsub(",","\t",$0); print;}' \
> 2019-06-20-mRNA-Descriptions.txt #Isolate mRNA description column and convert ";" and "," to "\t". Save the output as a new file
```
```{bash}
head 2019-06-20-mRNA-Descriptions.txt #Confirm file creation.
```
Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`.
```{bash}
awk '{print NF}' 2019-06-20-mRNA-Descriptions.txt > 2019-06-20-mRNA-Description-Column-Numbers.txt
```
### Import isolated mRNA descriptions
```{r}
mRNADescriptionColumns <- read.delim("2019-06-20-mRNA-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row
max(mRNADescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 78 columns per row.
```
```{r}
mRNADescriptions <- read.delim2("2019-06-20-mRNA-Descriptions.txt", header = FALSE, col.names = paste("V", 1:78, sep = "")) #Import document with gene descrptions. Name 78 columns so all rows are imported with 78 columns
head(mRNADescriptions) #Confirm import
```
```{r}
geneIDmRNAList <- mRNADescriptions$V3 #Isolate gene ID column
geneIDmRNAList <- gsub("Dbxref=GeneID:", "", geneIDmRNAList) #Remove extraneous characters
head(geneIDmRNAList) #Confirm changes
```
```{r}
genbankIDmRNAList <- mRNADescriptions$V4 #Isolate Genbank ID
genbankIDmRNAList <- gsub("Genbank:", "", genbankIDmRNAList) #Remove extraneous characters
head(genbankIDmRNAList) #Confirm changes
```
Buried in the `mRNAList` is mRNA product information. I'm going to extract that information and add it to the master list.
### Isolate product descriptions
```{bash}
cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \
| awk '{gsub("product=","\t",$0); print;}' \
| cut -f2 \
| awk '{gsub(";","\t",$0); print;}' \
| awk '{gsub("%","\t",$0); print;}' \
> 2019-06-20-mRNA-Products.txt #Isolate mRNA description column and convert "product=" to "\t". This separates the descriptions into 2 columns. The beginning of second column has the product description. Take the second column of the output and convert ";" and "%" to "\t" to better isolate mRNA products. Save the output as a new file.
```
```{bash}
head 2019-06-20-mRNA-Products.txt #Confirm product description isolation
```
```{bash}
awk '{print NF}' 2019-06-20-mRNA-Products.txt > 2019-06-20-mRNA-Products-Column-Numbers.txt
```
### Import isolated mRNA products
```{r}
mRNAProductColumns <- read.delim("2019-06-20-mRNA-Products-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row
max(mRNAProductColumns) #Max number of columns. I'll need to import my dataset with 16 columns per row.
```
```{r}
mRNAProducts <- read.delim2("2019-06-20-mRNA-Products.txt", header = FALSE, col.names = paste("V", 1:16, sep = "")) #Import document with gene descrptions. Name 16 columns so all rows are imported with 16 columns.
head(mRNAProducts) #Confirm import
```
```{r}
mRNAProducts <- mRNAProducts$V1 #Extract the first column with product information
```
### Merge mRNA track with `allDMLclosestBedgeneList`
```{r}
mRNAList <- data.frame(mRNAList, geneIDmRNAList, genbankIDmRNAList, mRNAProducts) #Replace mRNAList dataframe with mRNAList, geneIDmRNAList, genbankIDmRNAList, and mRNAProducts
mRNAList <- mRNAList[,-c(6:7)] #Remove extra columns
colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "gene-ID", "Genbank", "Product") #Rename columns to match previous naming conventions
head(mRNAList) #Confirm merge
```
```{r}
allDMLclosestBedgeneListmRNAList <- merge(x = allDMLclosestBedgeneList, y = mRNAList, by = "gene-ID", sort = FALSE, all.x = TRUE) #Merge dataframes by gene ID
head(allDMLclosestBedgeneListmRNAList) #Confirm merge
```
```{r}
allDMLclosestBedgeneListmRNAList <- allDMLclosestBedgeneListmRNAList[,c(2:15, 1, 19:22)] #Reorganize columns
head(allDMLclosestBedgeneListmRNAList) #Confirm reorganization
```
```{r}
allDMLclosestBedgeneListmRNAList <- distinct(allDMLclosestBedgeneListmRNAList) #Keep only unique rows
```
## Merge with Uniprot codes
```{r}
allDMLclosestBedgeneListmRNAListUniprot <- merge(x = allDMLclosestBedgeneListmRNAList, y = blastResults, by = "Genbank", sort = FALSE, all.x = TRUE) #Merge DML list with Uniprot codes by Genbank IDs
head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm merge
```
```{r}
allDMLclosestBedgeneListmRNAListUniprot <- allDMLclosestBedgeneListmRNAListUniprot[,c(2:4, 9, 6:8, 10:11, 14:18, 1, 21:22, 31, 19)] #Reorganize columns and save as a new dataframe
colnames(allDMLclosestBedgeneListmRNAListUniprot) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product") #Rename columns
allDMLclosestBedgeneListmRNAListUniprot <- distinct(allDMLclosestBedgeneListmRNAListUniprot) #Keep only unique rows
head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm reorganization
```
## Merge with GOterms
I will use a Uniprot-SwissProt database I downloaded to match Uniprot-Accession with Gene Ontology information, which includes GOterms and IDs. I downloaded a tab-delimited SwissProt database with five additional columns: Gene Ontology IDs, Gene ontology (GO), Gene ontology (biological processes), Gene ontology (cellular component), and Gene ontology (molecular function).
### Import Uniprot-Gene Ontology databse
```{bash}
gunzip uniprot-reviewed_yes.tab.gz #Unzip file
```
```{bash}
head -n2 uniprot-reviewed_yes.tab #View file header
```
```{r}
uniprotGOTerms <- read.delim("uniprot-reviewed_yes.tab", header = TRUE) #Import tab-delimited file with header
head(uniprotGOTerms) #Confirm import
```
```{r}
uniprotGOTerms <- uniprotGOTerms[,c(1:2, 8, 10:12)] #Remove unnecessary columns
colnames(uniprotGOTerms) <- c("Uniprot-Accession", "Uniprot-ID", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match previous naming conventions
head(uniprotGOTerms) #Confirm changes
```
### Merge with `allDMLclosestBedgeneListmRNAListUniprot`
```{r}
allDMLclosestBedgeneListmRNAListUniprotGOTerms <- merge(x = allDMLclosestBedgeneListmRNAListUniprot, y = uniprotGOTerms, by = c("Uniprot-Accession", "Uniprot-ID"), sort = FALSE, all.x = TRUE) #Merge files by Uniprot Accesssion code and Uniprot ID
head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm merge
```
```{r}
allDMLclosestBedgeneListmRNAListUniprotGOTerms <- allDMLclosestBedgeneListmRNAListUniprotGOTerms[,c(3:17, 1:2, 18:23)] #Reorganize columns
colnames(allDMLclosestBedgeneListmRNAListUniprotGOTerms) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match naming conventions
head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm changes
```
## Export master gene list
```{r}
masterDMLAnnotation <- distinct(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Remove duplicated rows (merging artifacts)
head(masterDMLAnnotation) #Confirm changes
```
```{r}
write.csv(masterDMLAnnotation, "2019-06-20-Master-DML-Annotation.csv") #Write out masterDMLAnnotation as a .csv file
```
Now that I have a master gene list, I can create relevant subset documents.
## DML-Gene overlaps
```{r}
DMLGeneAnnotation <- subset(x = masterDMLAnnotation, subset = masterDMLAnnotation$`dist-to-closest-gene` == 0) #Subset dataset where the distance to the closest gene = 0 (i.e. DML and gene are overlapping)
head(DMLGeneAnnotation) #Confirm subset
```
```{r}
write.csv(DMLGeneAnnotation, "2019-06-20-DML-Gene-Annotation.csv") #Save file
```
### Characterize DML distribution in genes
##### Count the number of DML per gene
```{r}
DMLGeneAnnotationNomRNA <- unique(DMLGeneAnnotation[,-c(8:9, 13:18, 20:23)]) #Remove mRNA information from annotation to quantify #DML/gene (and not by mRNA). Also eliminate Uniprot columns. Use unique to remove any lines that are merging artifacts
head(DMLGeneAnnotationNomRNA) #Confirm column removal
```
```{r}
DMLbyGeneCounts <- as.data.frame(table(DMLGeneAnnotationNomRNA$geneID)) #Count the number of DML/gene
colnames(DMLbyGeneCounts) <- c("geneID", "DMLCount") #Change column names
head(DMLbyGeneCounts) #Confirm formatting
```
```{r}
mean(DMLbyGeneCounts$DMLCount) #Calculate mean number of DML in genes with DML
median(DMLbyGeneCounts$DMLCount) #Calculate median number of DML in genes with DML
max(DMLbyGeneCounts$DMLCount) #Calculate max number of DML in a single gene
```
```{r}
sum(DMLbyGeneCounts$DMLCount == 1) #Count number of genes with 1 DML
sum(DMLbyGeneCounts$DMLCount == 2) #Count number of genes with 2 DML
sum(DMLbyGeneCounts$DMLCount == 3) #Count number of genes with 3 DML
sum(DMLbyGeneCounts$DMLCount == 4) #Count number of genes with 4 DML
sum(DMLbyGeneCounts$DMLCount == 5) #Count number of genes with 5 DML
```
```{r}
numberDMLinGenes <- data.frame("DMLCount" = c("1", "2", "3", "4", "5"),
"geneCount" = c(sum(DMLbyGeneCounts$DMLCount == 1),
sum(DMLbyGeneCounts$DMLCount == 2),
sum(DMLbyGeneCounts$DMLCount == 3),
sum(DMLbyGeneCounts$DMLCount == 4),
sum(DMLbyGeneCounts$DMLCount == 5))) #Create dataframe for plotting
head(numberDMLinGenes) #Confirm creation
```
```{r}
genes1DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 1,] #Subset genes with 1 DML
genes1DMLAnnotations <- unique(merge(x = genes1DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts
head(genes1DMLAnnotations) #Confirm merge
```
```{r}
write.csv(genes1DMLAnnotations, "2019-10-01-Genes-with-1-DML.csv", quote = FALSE, row.names = FALSE) #Save file
```
```{r}
genes2DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 2,] #Subset genes with 2 DML
genes2DMLAnnotations <- unique(merge(x = genes2DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts
head(genes2DMLAnnotations) #Confirm merge
```
```{r}
write.csv(genes2DMLAnnotations, "2019-10-01-Genes-with-2-DML.csv", quote = FALSE, row.names = FALSE) #Save file
```
```{r}
genes3DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 3,] #Subset genes with 3 DML
genes3DMLAnnotations <- unique(merge(x = genes3DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 3 DML with annotation information. Use unique to remove merging artifacts
head(genes3DMLAnnotations) #Confirm merge
```
```{r}
write.csv(genes3DMLAnnotations, "2019-10-01-Genes-with-3-DML.csv", quote = FALSE, row.names = FALSE) #Save file
```
```{r}
genes4DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 4,] #Subset genes with 4 DML
genes4DMLAnnotations <- unique(merge(x = genes4DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 4 DML with annotation information. Use unique to remove merging artifacts
head(genes4DMLAnnotations) #Confirm merge
```
```{r}
write.csv(genes4DMLAnnotations, "2019-10-01-Genes-with-4-DML.csv", quote = FALSE, row.names = FALSE) #Save file
```
```{r}
genes5DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 5,] #Subset genes with 2 DML
genes5DMLAnnotations <- unique(merge(x = genes5DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 5 DML with annotation information. Use unique to remove merging artifacts
head(genes5DMLAnnotations) #Confirm merge
```
```{r}
write.csv(genes5DMLAnnotations, "2019-10-01-Genes-with-5-DML.csv", quote = FALSE, row.names = FALSE) #Save file
```
#### Characterize types of DML in genes
```{r}
sum(genes1DMLAnnotations$meth.diff > 0) #Count number of genes with 1 DML hypermethylated
sum(genes1DMLAnnotations$meth.diff < 0) #Count number of genes with 1 DML hypomethylated
```
```{r}
genes1DMLBreakdown <- data.frame("type" = c("1hyper", "0hyper"),
"DMLcount" = c(sum(genes1DMLAnnotations$meth.diff > 0),
sum(genes1DMLAnnotations$meth.diff < 0))) #Create dataframe with hyper/hypo breakdown to plot data
head(genes1DMLBreakdown)
```
I did some Excel manipulations to count the number of hypermethylated DML in genes with multiple DML.
```{r}
genes2DMLCounts <- read.csv("2019-10-01-Genes-with-2-DML-withCounts.csv", header = TRUE) #Import file
genes2DMLCounts$no.hypo <- genes2DMLCounts$DMLCount - genes2DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes
head(genes2DMLCounts) #Confirm changes
```
```{r}
sum(genes2DMLCounts$no.hyper == 2) #Count # genes with both DML hypermethylated
sum(genes2DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated
sum(genes2DMLCounts$no.hyper == 0) #Count # genes with both DML hypomethylated
```
```{r}
genes2DMLBreakdown <- data.frame("type" = c("2hyper", "1hyper", "0hyper"),
"DMLcount" = c(sum(genes2DMLCounts$no.hyper == 2),
sum(genes2DMLCounts$no.hyper == 1),
sum(genes2DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data
head(genes2DMLBreakdown)
```
```{r}
genes3DMLCounts <- read.csv("2019-10-01-Genes-with-3-DML-withCounts.csv", header = TRUE) #Import file
genes3DMLCounts$no.hypo <- genes3DMLCounts$DMLCount - genes3DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes
head(genes3DMLCounts) #Confirm changes
```
```{r}
sum(genes3DMLCounts$no.hyper == 3) #Count # genes with all DML hypermethylated
sum(genes3DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 1 hypomethylated
sum(genes3DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated
sum(genes3DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated
```
```{r}
genes3DMLBreakdown <- data.frame("type" = c("3hyper", "2hyper", "1hyper", "0hyper"),
"DMLcount" = c(sum(genes3DMLCounts$no.hyper == 3),
sum(genes3DMLCounts$no.hyper == 2),
sum(genes3DMLCounts$no.hyper == 1),
sum(genes3DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data
head(genes3DMLBreakdown)
```
```{r}
genes4DMLCounts <- read.csv("2019-10-01-Genes-with-4-DML-withCounts.csv", header = TRUE) #Import file
genes4DMLCounts$no.hypo <- genes4DMLCounts$DMLCount - genes4DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes
head(genes4DMLCounts) #Confirm changes
```
```{r}
sum(genes4DMLCounts$no.hyper == 4) #Count # genes with all DML hypermethylated
sum(genes4DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 1 hypomethylated
sum(genes4DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 2 hypomethylated
sum(genes4DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 3 hypomethylated
sum(genes4DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated
```
```{r}
genes4DMLBreakdown <- data.frame("type" = c("4hyper", "3hyper", "2hyper", "1hyper", "0hyper"),
"DMLcount" = c(sum(genes4DMLCounts$no.hyper == 4),
sum(genes4DMLCounts$no.hyper == 3),
sum(genes4DMLCounts$no.hyper == 2),
sum(genes4DMLCounts$no.hyper == 1),
sum(genes4DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data
head(genes4DMLBreakdown)
```
```{r}
genes5DMLCounts <- read.csv("2019-10-01-Genes-with-5-DML-withCounts.csv", header = TRUE) #Import file
genes5DMLCounts$no.hypo <- genes5DMLCounts$DMLCount - genes5DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes
head(genes5DMLCounts) #Confirm changes
```
```{r}
sum(genes5DMLCounts$no.hyper == 5) #Count # genes with all DML hypermethylated
sum(genes5DMLCounts$no.hyper == 4) #Count # genes with 4 DML hypermethylated and 1 hypomethylated
sum(genes5DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 2 hypomethylated
sum(genes5DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 3 hypomethylated
sum(genes5DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 4 hypomethylated
sum(genes5DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated
```
##### Create individual figures
I'm going to create a multipanel plot with a breakdown of the number hyper- and hypomethylated in genes. I will start by creating each figure separately.
```{r}
#DML/gene
barplot(numberDMLinGenes$geneCount,
axes = FALSE,
names.arg = numberDMLinGenes$DMLCount,
ylim = c(0, 450)) #Create base plot
axis(side = 2, las = 1, at = c(0, 450), col = "grey80") #Put the axis labels only at 0 and 450.
mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label
mtext(side = 1, "Number of DML", line = 3) #Add x-axis label
```
```{r}
#Hyper/hypo breakdown for genes with 1 DML
barplot(genes1DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("1,0", "0,1"),
ylim = c(0, 250),
col = c("grey20", "grey100")) #Create base plot
axis(side = 2, las = 1, at = c(0, 250), col = "grey80") #Put the axis labels only at 0 and 250.
mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label
mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label
```
```{r}
#Hyper/hypo breakdown for genes with 2 DML
barplot(genes2DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("2,0", "1,1", "0,2"),
ylim = c(0, 20),
col = c("grey20", "grey60", "grey100")) #Create base plot
axis(side = 2, las = 1, at = c(0, 20), col = "grey80") #Put the axis labels only at 0 and 20.
mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label
mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label
```
```{r}
#Hyper/hypo breakdown for genes with 3 DML
barplot(genes3DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("3,0", "2,1", "1,2", "0,3"),
ylim = c(0, 5),
col = c("grey20", "grey40", "grey60", "grey100")) #Create base plot
axis(side = 2, las = 1, at = c(0, 5), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 5.
mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label
mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label
```
```{r}
#Hyper/hypo breakdown for genes with 4 DML
barplot(genes4DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("4,0", "3,1", "2,2", "1,3", "0,4"),
ylim = c(0, 3),
col = c("grey20", "grey40", "grey60", "grey80", "grey100")) #Create base plot
axis(side = 2, las = 1, at = c(0, 3), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 3.
mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label
mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label
```
#### Create multipanel figure
##### Create color scheme
```{r}
RColorBrewer::display.brewer.all() #Show all RColorBrewer palettes. I will choose greens.
plotColors <- rev(RColorBrewer::brewer.pal(5, "GnBu")) #Create a color palette for the barplots. Use 5 green-blue shades from RColorBrewer. Reverse theorder so the darkest shade is used first.
barplot(t(t(proportionData)),
col = plotColors) #See what plot looks like with new scheme
barplot(t(t(proportionData)),
col = dichromat(plotColors)) #Check that the plot colors will be easy to interpret for those with color blindess
```
##### Plot!
I'm going to combine the breakdowns with the number of DML in each chromosome.
```{r}
#pdf("2019-10-03-DML-Distribution-in-Chr-Genes.pdf", width = 11, height = 8.5)
plotMatrix <- matrix(c(1,1,0,3,4,
2,2,0,5,6),
nrow = 2, ncol = 5, byrow = TRUE) #Create a matrix with plot location information. Fill matrix by row.
par(mar = c(0, 0, 6, 2.5), oma = c(6, 10, 0, 0)) #Specify inner and outer margins
layout(mat = plotMatrix, width = c(2, 2, 0.7, 1, 1)) #Create a layout based on plotMatrix. The first and second plots (first column) should be wider.
#layout.show(n = 6) #Confirm layout looks good
#Chromosome distribution
DMLbarplot <- barplot(as.matrix(t(DMLchrCounts$DMLCount)),
axes = FALSE,
names.arg = DMLchrCounts$chr,
cex.names = 1.5,
xlim = c(0.7,11.5),
ylim = c(0,150),
col = plotColors[5]) #Create a barplot and save as a new object. Use axes = FALSE to remove the y-axis and names.arg to set labels on the x-axis. The object contains x coordinates for bars, so xlim is set at 12 to compensate for maximum value of 11.5
mtext(side = 1, "Chromosome", line = 3, cex = 1.3) #Add x-axis label
axis(side = 2, line = 1.5, at = seq(0, 150, by = 50), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for DML counts
mtext(side = 2, "Number of DML", line = 5, cex = 1.3) #Add y-axis label for DML counts
mtext(side = 3, line = -1, at = c(0.5, 130), text = "a") #Add figure label
#Add gene number infomration
par(new = TRUE) #Create a new plot
plot(x = DMLbarplot,
y = DMLchrCounts$geneCount,
type = "b",
axes = FALSE, xlab = "", ylab = "", xaxs = "i", yaxs = "i",
pch = 16, col = plotColors[2],
xlim = c(0,12), ylim = c(0,6500)) #Plot points and lines (type = "b") for gene count by chromosome. Use the coordinates from DMLbarplot (x = DMLbarplot) and set xlim = (0,12) so plots are lined up. Use axes = FALSE to remove both axes. Remove x and y lables (xlab = ""; ylab = ""). Set ylim = (0,6500) to account for max y-values. Use xaxs and yxaxs to remove space between axes.
axis(side = 4, line = 1.5, at = seq(0, 6500, by = 500), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for gene sequence counts
mtext(side = 4, "Number of Genes", line = 7, cex = 1.3) #Add y-axis label for gene sequence counts
#DML/gene distribution
barplot(numberDMLinGenes$geneCount,
axes = FALSE,
names.arg = numberDMLinGenes$DMLCount,
col = plotColors[5],
ylim = c(0, 450)) #Create base plot
axis(side = 2, las = 1, at = c(0, 450), col = "grey80") #Put the axis labels only at 0 and 450.
mtext(side = 2, "Number of Genes with DML", line = 3, cex = 1.3) #Add y-axis label
mtext(side = 1, "Number of DML", line = 3, cex = 1.3) #Add x-axis label
mtext(side = 3, line = -1, at = c(0.15, 440), text = "b") #Add figure label
#Hyper/hypo breakdown for genes with 1 DML
barplot(genes1DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("1,0", "0,1"),
ylim = c(0, 250),
col = c(plotColors[1], plotColors[5])) #Create base plot
axis(side = 2, las = 1, at = c(0, 250), col = "grey80") #Put the axis labels only at 0 and 250.
mtext(side = 3, line = -1, at = c(0.4, 240), text = "c") #Add figure label
#Hyper/hypo breakdown for genes with 2 DML
barplot(genes2DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("2,0", "", "0,2"),
ylim = c(0, 20),
col = c(plotColors[1], plotColors[3], plotColors[5])) #Create base plot
axis(side = 2, las = 1, at = c(0, 20), col = "grey80") #Put the axis labels only at 0 and 20.
mtext(side = 3, line = -1, at = c(0.5, 24.5), text = "d") #Add figure label
#Hyper/hypo breakdown for genes with 3 DML
barplot(genes3DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("3,0", "", "", "0,3"),
ylim = c(0, 5),
col = c(plotColors[1], plotColors[2], plotColors[4], plotColors[5])) #Create base plot
axis(side = 2, las = 1, at = c(0, 5), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 5.
mtext(side = 2, "Number of Genes", line = 2, cex = 1.3) #Add y-axis label
mtext(side = 1, " Hyper, Hypo DML", line = 3, cex = 1.3, outer = TRUE) #Add x-axis label
mtext(side = 3, line = -1, at = c(0.7, 240), text = "e") #Add figure label
#Hyper/hypo breakdown for genes with 4 DML
barplot(genes4DMLBreakdown$DMLcount,
axes = FALSE,
names.arg = c("4,0", "", "", "", "0,4"),
ylim = c(0, 3),
col = plotColors) #Create base plot
axis(side = 2, las = 1, at = c(0, 3), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 3.
mtext(side = 3, line = -1, at = c(0.75, 240), text = "f") #Add figure label
#dev.off()
```
### Scaled DML distribution
Scaling each gene from 0% to 100%, I want to see where DML occurs. This is useful to see if methylation is occuring in any consistent location for each gene.
```{r}
head(DMLGeneAnnotationNomRNA)
```
```{r}
DMLGeneAnnotationNomRNA$geneLength <- DMLGeneAnnotationNomRNA$gene.end - DMLGeneAnnotationNomRNA$gene.start #Calculate gene length
DMLGeneAnnotationNomRNA$absPosition <- DMLGeneAnnotationNomRNA$start - DMLGeneAnnotationNomRNA$gene.start #Calculate the absolute position of the DML in the gene
DMLGeneAnnotationNomRNA$scaledPosition <- DMLGeneAnnotationNomRNA$absPosition / DMLGeneAnnotationNomRNA$geneLength #Calculate the scaled position of the DML in the gene
head(DMLGeneAnnotationNomRNA) #Confirm calculations
```
```{r}
DMLGenePositionsHyper <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff > 0) #Subset hypermethylated DML
DMLGenePositionsHypo <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff < 0) #Subset hypomethylated DML
head(DMLGenePositionsHyper) #Confirm subset
head(DMLGenePositionsHypo) #Confirm subset
```
#### Create individual figures
I'm going to make back-to-back histograms to easily compare distributions of hyper- and hypomethtylated DML along a gene. First I'll create individual plots, then I'll add them to the same coordinate plane.
```{r}
#Hypermethylated DML
hist(DMLGenePositionsHyper$scaledPosition,
breaks = 100,
axes = FALSE,
xlab = "", ylab = "", main = "",
col = "grey20",
xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent.
axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis
mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label
axis(side = 2, at = seq(from = 0, to = 8, by = 1), col = "grey80", las = 2) #Add y-axis
mtext(side = 2, "Number of Hypermethylated DML", line = 2) #Add y-axis label
```
```{r}
#Hypomethylated DML
hist(DMLGenePositionsHypo$scaledPosition,
breaks = 100,
axes = FALSE,
xlab = "", ylab = "", main = "",
col = "grey100",
xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent.
axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis
mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label
axis(side = 2, at = seq(from = 0, to = 7, by = 1), col = "grey80", las = 2) #Add y-axis
mtext(side = 2, "Number of Hypomethylated DML", line = 2) #Add y-axis label
```
#### Create back-to-back plot
```{r}
hyperPlotDML <- hist(DMLGenePositionsHyper$scaledPosition, breaks = 100, plot = FALSE) #Save base hypermethylated plot
hypoPlotDML <- hist(DMLGenePositionsHypo$scaledPosition, breaks = 100, plot = FALSE) #Save base hypomethylated plot
```
```{r}
#Calculate plotting parameters
hypoPlotDML$counts <- -(hypoPlotDML$counts) #Make counts negative for hypoPlot
hmaxDML <- max(hyperPlotDML$counts) #Set maximum value as max value of hyperPlot
hminDML <- min(hypoPlotDML$counts) #Set minimum value as minimum value of hypoPlot
breaksDMLPlot <- c(hyperPlotDML$breaks, hypoPlotDML$breaks) #Set breaks for plots
xmaxDML <- max(breaksDMLPlot) #Set xmax based on breaks
xminDML <- min(breaksDMLPlot) #Set xmin based on breaks
```
```{r}
#pdf("2019-10-09-Scaled-Gene-DML-Distribution.pdf", height = 8.5, width = 11)
plot(hyperPlotDML,
xlim = c(xminDML, xmaxDML),
ylim = c(hminDML, hmaxDML),
axes = FALSE,
xlab = "", ylab = "", main = "",
col = plotColors[2]) #Create the base plot with just hyperPlot. Set xlim and ylim on parameters calculated above. Remove all axes and axis labels
lines(hypoPlotDML, col= plotColors[5]) #Add hypoPlot
axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80", cex = 1.2) #Add x-axis
mtext(side = 1, "Scaled Position on Gene", line = 3, cex = 1.5) #Add x-axis label
axis(side = 2, at = seq(from = hminDML, to = hmaxDML, by = 2), labels = c(seq(from = hminDML, to = 0, by = 2)*-1, seq(from = 2, to = hmaxDML, by = 2)), col = "grey80", las = 2, cex = 1.2) #Add y-axis. Add labels from minimum to maximum valules, but
mtext(side = 2, "Number of DML", line = 2, cex = 1.5) #Add y-axis label
#dev.off()
```
### Hypermethylated DML-gene overlaps
```{r}
DMLGeneAnnotationHypermethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff > 0) #Subset hypermethylated DML by selecting meth.diff > 0
head(DMLGeneAnnotationHypermethylated) #Confirm subset
```
```{r}
write.csv(DMLGeneAnnotationHypermethylated, "2019-06-20-DML-Gene-Annotation-Hypermethylated.csv") #Save file
```
### Hypomethylated DML-gene overlaps
```{r}
DMLGeneAnnotationHypomethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff < 0) #Subset hypomethylated DML by selecting meth.diff < 0
head(DMLGeneAnnotationHypomethylated) #Confirm subset
```
```{r}
write.csv(DMLGeneAnnotationHypomethylated, "2019-06-20-DML-Gene-Annotation-Hypomethylated.csv") #Save file
```
# Master exon overlap list
## Import DML-exon overlaps
```{r}
DMLExonOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-Exon.txt", header = FALSE) #Import file with DML and matching exons
DMLExonOverlaps <- DMLExonOverlaps[,-4] #Remove extra meth.diff column
colnames(DMLExonOverlaps) <- c("chr", "start", "end", "exon-chr", "exon-start", "exon-end")
DMLExonOverlaps$start <- DMLExonOverlaps$start + 1 #Add 1 to the start position since the file is BEDtools output
DMLExonOverlaps$end <- DMLExonOverlaps$end - 1 #Add 1 to the start position since the file is BEDtools output
head(DMLExonOverlaps) #Confirm changes
```
## Merge with DML-gene annotation table
```{r}
DMLExonAnnotation <- merge(x = DMLExonOverlaps, y = DMLGeneAnnotation, by = "start", sort = FALSE, all.x = TRUE) #Merge files by DML position. Don't sort the output and retain all rows from DMLExonOverlaps even if there are no matches in DMLGeneAnnotation
head(DMLExonAnnotation) #Confirm merge
```
```{r}
DMLExonAnnotation <- DMLExonAnnotation[,c(2, 1, 3, 9:12, 5:6, 15:28)] #Reorganize columns remove dist-to-closest-gene column since these is all overlaps.
colnames(DMLExonAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "exon-start", "exon-end", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-ID", "Uniprot-Accession", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Add column names
DMLExonAnnotation <- distinct(DMLExonAnnotation) #Remove duplicate rows (merging artifacts)
head(DMLExonAnnotation) #Confirm changes
```
```{r}
write.csv(DMLExonAnnotation, "2019-06-20-DML-Exon-Annotation.csv") #Save file
```
## Hypermethylated DML-exon overlaps
```{r}
DMLExonAnnotationHypermethylated <- DMLExonAnnotation[DMLExonAnnotation$meth.diff > 0,] #Obtain hypermethylated DML by subsetting meth.dif > 0
head(DMLExonAnnotationHypermethylated) #Confirm subset
```
```{r}
write.csv(DMLExonAnnotationHypermethylated, "2019-06-20-DML-Exon-Annotation-Hypermethylated.csv") #Save file
```
## Hypomethylated DML-exon overlaps
```{r}
DMLExonAnnotationHypomethylated <- DMLExonAnnotation[DMLExonAnnotation$meth.diff < 0,] #Obtain hypomethylated DML by subsetting meth.dif < 0
head(DMLExonAnnotationHypomethylated) #Confirm subset
```
```{r}
write.csv(DMLExonAnnotationHypomethylated, "2019-06-20-DML-Exon-Annotation-Hypomethylated.csv") #Save file
```
# Master intron overlap list
## Import DML-Intron overlaps
```{r}
DMLIntonOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-Intron.txt", header = FALSE) #Import file with DML and matching introns
DMLIntonOverlaps <- DMLIntonOverlaps[,-4] #Remove extra meth.diff column
colnames(DMLIntonOverlaps) <- c("chr", "start", "end", "intron-chr", "intron-start", "intron-end")
DMLIntonOverlaps$start <- DMLIntonOverlaps$start + 1 #Add 1 to the start position since the file is BEDtools output
DMLIntonOverlaps$end <- DMLIntonOverlaps$end - 1 #Add 1 to the start position since the file is BEDtools output
head(DMLIntonOverlaps) #Confirm changes
```
## Merge with DML-gene annotation table
```{r}
DMLIntronAnnotation <- merge(x = DMLIntonOverlaps, y = DMLGeneAnnotation, by = "start", sort = FALSE, all.x = TRUE) #Merge files by DML position. Don't sort the output and retain all rows from DMLIntronOverlaps even if there are no matches in DMLGeneAnnotation
head(DMLIntronAnnotation) #Confirm merge
```
```{r}
DMLIntronAnnotation <- DMLIntronAnnotation[,c(2, 1, 3, 9:12, 5:6, 15:28)] #Reorganize columns remove dist-to-closest-gene column since these is all overlaps.
colnames(DMLIntronAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "intron-start", "intron-end", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-ID", "Uniprot-Accession", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Add column names
DMLIntronAnnotation <- distinct(DMLIntronAnnotation) #Remove duplicate rows (merging artifacts)
head(DMLIntronAnnotation) #Confirm changes
```
```{r}
write.csv(DMLIntronAnnotation, "2019-06-20-DML-Intron-Annotation.csv") #Save file
```
## Hypermethylated DML-intron overlaps
```{r}
DMLIntronAnnotationHypermethylated <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$meth.diff > 0) #Obtain hypermethylated DML by subsetting meth.diff > 0
head(DMLIntronAnnotationHypermethylated) #Confirm subset
```
```{r}
write.csv(DMLIntronAnnotationHypermethylated, "2019-06-20-DML-Intron-Annotation-Hypermethylated.csv") #Save file
```
## Hypomethylated DML-intron overlaps
```{r}
DMLIntronAnnotationHypomethylated <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$meth.diff < 0) #Obtain hypomethylated DML by subsetting meth.diff < 0
head(DMLIntronAnnotationHypomethylated) #Confirm subset
```
```{r}
write.csv(DMLIntronAnnotationHypomethylated, "2019-06-20-DML-Intron-Annotation-Hypomethylated.csv") #Save file
```
# Master promoter overlap list
The `dist-to-closet-gene` has important information for DML distances to genes, but not the closest transcription start sites (i.e. mRNA). I will import the DML-promoter overlaps, then modify this dataframe to obtain a master DML-promoter overlap annotation.
## Import DML-promoter overlaps
```{r}
promoterList <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-Flanking-Analysis/2019-05-29-mRNA-1000bp-UpstreamFlanks-DML.txt", header = FALSE) #Import DML-promoter overlaps
promoterList <- promoterList[,-c(7:8)] #Remove extra columns
colnames(promoterList) <- c("flank-chr", "flank-start", "flank-end", "chr", "start", "end") #Rename columns
head(promoterList) #Confirm changes
```
```{r}
promoterList$start <- promoterList$start + 1 #Add 1 to the start position since the file is BEDtools output
promoterList$end <- promoterList$end -1 #Subtract 1 to the end position since the file is BEDtools output
promoterList$`mRNA-start` <- promoterList$`flank-end` #Duplicate The end of the flank is the same as the start of the mRNA. Create a duplicate column for easy merging.
head(promoterList) #Confrm changes
```
## Merge with `masterDMLAnnotation`
```{r}
DMLPromoterAnnotation <- merge(x = masterDMLAnnotation, y = promoterList, by = "mRNA-start", sort = FALSE, all.y = TRUE) #Merge the master list with promoter information by mRNA-start. Retain all rows in promoterList that do not have matching entries in masterDMLAnnotation
head(DMLPromoterAnnotation) #Confirm merge
```
```{r}
DMLPromoterAnnotation <- DMLPromoterAnnotation[,c(2:13, 25:26, 1, 14:23)] #Reorganize columns
colnames(DMLPromoterAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "flank-start", "flank-end", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match naming conventions
DMLPromoterAnnotation <- distinct(DMLPromoterAnnotation) #Remove duplicate rows (merge artifacts)
head(DMLPromoterAnnotation) #Confirm reorganization
```
```{r}
write.csv(DMLPromoterAnnotation, "2019-06-20-DML-Promoter-Annotation.csv") #Save file
```
# Master transposable element list
The last annotation list I need is one with DML-transposable element overlaps.
## Import DML-TE overlap file
```{r}
TEOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-TE-all.txt", header = FALSE) #Import DML-TE overlap file
TEOverlaps <- TEOverlaps[,-c(4, 6:7, 12)] #Remove extraneous columns
colnames(TEOverlaps) <- c("chr", "start", "end", "TE-chr", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif") #Rename columns
head(TEOverlaps) #Confirm import
```
## Merge with TE-gene overlap file
```{r}
TEGeneOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-07-Genes-TE-all.txt", header = FALSE)
TEGeneOverlaps <- TEGeneOverlaps[,-c(4:6, 11, 13)] #Remove extraneous columns
colnames(TEGeneOverlaps) <- c("gene-chr", "gene-start", "gene-end", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif") #Rename columns. Changed V9 to V10 to match column names in TEOverlaps
head(TEGeneOverlaps) #Confirm import
```
```{r}
TEDMLOverlaps <- merge(x = TEOverlaps, y = TEGeneOverlaps, by = c("TE-start", "TE-end"), sort = FALSE, all.x = TRUE) #Merge TEOverlaps with TEGeneOverlaps. Do not sort the output. Retain any information in the DML-TE overlaps that does not have matching gene annotations
head(TEDMLOverlaps) #Confirm merge
```
```{r}
TEDMLOverlaps <- TEDMLOverlaps[,c(3:5, 1:2, 7:9, 11:12)] #Reorganize columns
head(TEDMLOverlaps) #Confirm changes
```
## Merge with `masterDMLAnnotation`
This will allow me to match TE-DML-gene overlaps with genes.
```{r}
DMLTEAnnotation <- merge(x = TEDMLOverlaps, y = masterDMLAnnotation, by = c("gene-start", "gene-end"), sort = FALSE, all.x = TRUE) #Merge TEDMLOverlaps with masterDMLAnnotation by gene start and end positions. Do not sort the output and retain any TEDMLOverlap entries that do not have matches in masterDMLAnnotation
head(DMLTEAnnotation) #Confirm merge
```
```{r}
DMLTEAnnotation <- DMLTEAnnotation[,c(11:17, 6:10, 18, 1:2, 20:31)] #Reorganize columns
colnames(DMLTEAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif", "dist-to-closest-gene", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns
DMLTEAnnotation <- distinct(DMLTEAnnotation) #Retain unique rows
head(DMLTEAnnotation) #Confirm changes
```
```{r}
write.csv(DMLTEAnnotation, "2019-06-20-DML-TE-Annotation.csv") #Save file
```
# Gene background
## Import DMLBackground-mRNA overlaps
```{r}
DMLBackgroundmRNAOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-06-20-DMLBackground-mRNA.txt", header = FALSE)
DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,-c(4:5)] #Remove extra columns
colnames(DMLBackgroundmRNAOverlaps) <- c("chr", "DMLB-start", "DMLB-end", "mRNA-start", "mRNA-end") #Rename columns
head(DMLBackgroundmRNAOverlaps) #Confirm import
```
## Merge with `mRNAList`
```{r}
DMLBackgroundmRNAOverlaps <- merge(x = DMLBackgroundmRNAOverlaps, y = mRNAList, by = c("mRNA-start", "mRNA-end"), sort = FALSE) #Merge dataframes by mRNA start and end positions
head(DMLBackgroundmRNAOverlaps) #Confirm merge
```
```{r}
DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,c(3:6, 1:2, 9:11)] #Reorganize columns
head(DMLBackgroundmRNAOverlaps) #Confirm reorganization
```
```{r}
head(blastResults)
```
## Merge with Uniprot codes
```{r}
backgroundmRNAblast <- merge(x = blastResults, y = DMLBackgroundmRNAOverlaps, by = "Genbank", sort = FALSE) #Merge the gene background and Uniprot codes by the Genbank ID column
backgroundmRNAblast <- distinct(backgroundmRNAblast) #Remove duplicate rows from merging artifacts
head(backgroundmRNAblast) #Confirm merge
```
```{r}
backgroundmRNAblast <- backgroundmRNAblast[,c(15:21, 1:14, 22)] #Reorganize columns
backgroundmRNAblasteValueCutoff <- subset(backgroundmRNAblast, subset = backgroundmRNAblast$evalue < 1e-10) #Subset by evalue cutoff
head(backgroundmRNAblasteValueCutoff) #Confirm reorganization
```
```{r}
write.csv(backgroundmRNAblasteValueCutoff, "2019-06-20-mRNA-Gene-Background-Uniprot.csv") #Write out file
```
# Functional gene description
I will use `GO-BP` information to characterize Gene Ontology represented in DML-exon and DML-intron overlaps. I will also separate this out by hyper- and hypomethylated DML.
## Exons
### All exons
#### Subset by e-value cutoffs
```{r}
DMLExonAnnotationEvalueCutoff <- subset(x = DMLExonAnnotation, subset = DMLExonAnnotation$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLExonAnnotationEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Evalue-Cutoff.csv")
nrow(DMLExonAnnotationEvalueCutoff) #Approx. 8000 rows dropped
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLExonAnnotationGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-GO-BP-Expanded-Column-Numbers.txt")
max(DMLExonAnnotationGOBPColumns$X1) #I should import the file with 353 columns
```
```{r}
DMLExonAnnotationGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt", col.names = paste("V", 1:353, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLExonAnnotationGOBPExpanded <- DMLExonAnnotationGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file
head(DMLExonAnnotationGOBPExpanded) #Confirm import
```
```{r}
DMLExonAnnotationGOBPExpanded <- DMLExonAnnotationGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLExonAnnotationGOBPExpandedNoNA <- DMLExonAnnotationGOBPExpanded[is.na(DMLExonAnnotationGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLExonAnnotationGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLExonAnnotationGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLExonAnnotationGOBPExpandedNoNA$V2) #Remove space before GOterms
DMLExonAnnotationGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLExonAnnotationGOBPExpandedNoNA$V3) #Remove space before GOterms
head(DMLExonAnnotationGOBPExpandedNoNA) #Confim change
```
```{r}
DMLExonAnnotationGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationGOBPExpandedNoNA$V1),
as.vector(DMLExonAnnotationGOBPExpandedNoNA$V2),
as.vector(DMLExonAnnotationGOBPExpandedNoNA$V3)) #Convert to a vector
DMLExonAnnotationGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLExonAnnotationGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLExonAnnotationGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLExonAnnotationGOBPCounts <- count(DMLExonAnnotationGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLExonAnnotationGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLExonAnnotationGOBPCounts) #Count the number of rows
```
```{r}
write.csv(DMLExonAnnotationGOBPCounts, "2019-06-20-DML-Exon-Annotation-GO-BP-Counts.csv") #Save file
```
### Hypermethylated exons
#### Subset by e-value cutoffs
```{r}
DMLExonAnnotationHypermethylatedEvalueCutoff <- subset(x = DMLExonAnnotationHypermethylated, subset = DMLExonAnnotationHypermethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLExonAnnotationHypermethylatedEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Hypermethylated-Evalue-Cutoff.csv") #Save file
nrow(DMLExonAnnotationHypermethylatedEvalueCutoff) #Count the number of rows
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Hypermethylated-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLExonAnnotationHypermethylatedGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt")
max(DMLExonAnnotationHypermethylatedGOBPColumns$X1) #I should import the file with 147 columns
```
```{r}
DMLExonAnnotationHypermethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLExonAnnotationHypermethylatedGOBPExpanded <- DMLExonAnnotationHypermethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file
head(DMLExonAnnotationHypermethylatedGOBPExpanded) #Confirm import
```
```{r}
DMLExonAnnotationHypermethylatedGOBPExpanded <- DMLExonAnnotationHypermethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLExonAnnotationHypermethylatedGOBPExpandedNoNA <- DMLExonAnnotationHypermethylatedGOBPExpanded[is.na(DMLExonAnnotationHypermethylatedGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2) #Remove space before GOterms
DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3) #Remove space before GOterms
head(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA) #Confim change
```
```{r}
DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V1),
as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2),
as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3)) #Convert to a vector
DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLExonAnnotationHypermethylatedGOBPCounts <- count(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLExonAnnotationHypermethylatedGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLExonAnnotationHypermethylatedGOBPCounts) #Count the number of rows
```
```{r}
write.csv(DMLExonAnnotationHypermethylatedGOBPCounts, "2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Counts.csv") #Save file
```
### Hypomethylated exons
#### Subset by e-value cutoffs
```{r}
DMLExonAnnotationHypomethylatedEvalueCutoff <- subset(x = DMLExonAnnotationHypomethylated, subset = DMLExonAnnotationHypomethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLExonAnnotationHypomethylatedEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Hypomethylated-Evalue-Cutoff.csv") #Save file
nrow(DMLExonAnnotationHypomethylatedEvalueCutoff) #Count the number of rows
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Hypomethylated-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLExonAnnotationHypomethylatedGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt")
max(DMLExonAnnotationHypomethylatedGOBPColumns$X1) #I should import the file with 353 columns
```
```{r}
DMLExonAnnotationHypomethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:353, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLExonAnnotationHypomethylatedGOBPExpanded <- DMLExonAnnotationHypomethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file
head(DMLExonAnnotationHypomethylatedGOBPExpanded) #Confirm import
```
```{r}
DMLExonAnnotationHypomethylatedGOBPExpanded <- DMLExonAnnotationHypomethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLExonAnnotationHypomethylatedGOBPExpandedNoNA <- DMLExonAnnotationHypomethylatedGOBPExpanded[is.na(DMLExonAnnotationHypomethylatedGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2) #Remove space before GOterms
DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3) #Remove space before GOterms
head(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA) #Confim change
```
```{r}
DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V1),
as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2),
as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3)) #Convert to a vector
DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLExonAnnotationHypomethylatedGOBPCounts <- count(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLExonAnnotationHypomethylatedGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLExonAnnotationHypomethylatedGOBPCounts) #Count rows
```
```{r}
write.csv(DMLExonAnnotationHypomethylatedGOBPCounts, "2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Counts.csv") #Save file
```
## Introns
### All introns
#### Subset by e-value cutoffs
```{r}
DMLIntronAnnotationEvalueCutoff <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLIntronAnnotationEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Evalue-Cutoff.csv")
nrow(DMLIntronAnnotationEvalueCutoff)
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLIntronAnnotationGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-GO-BP-Expanded-Column-Numbers.txt")
max(DMLIntronAnnotationGOBPColumns $X1) #I should import the file with 147 columns
```
```{r}
DMLIntronAnnotationGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLIntronAnnotationGOBPExpanded <- DMLIntronAnnotationGOBPExpanded[-1,] #Remove the first row, which is the original header from the file
head(DMLIntronAnnotationGOBPExpanded) #Confirm import
```
```{r}
DMLIntronAnnotationGOBPExpanded <- DMLIntronAnnotationGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLIntronAnnotationGOBPExpandedNoNA <- DMLIntronAnnotationGOBPExpanded[is.na(DMLIntronAnnotationGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLIntronAnnotationGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLIntronAnnotationGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLIntronAnnotationGOBPExpandedNoNA $V2) #Remove space before GOterms
DMLIntronAnnotationGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLIntronAnnotationGOBPExpandedNoNA $V3) #Remove space before GOterms
head(DMLIntronAnnotationGOBPExpandedNoNA) #Confim change
```
```{r}
DMLIntronAnnotationGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V1),
as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V2),
as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V3)) #Convert to a vector
DMLIntronAnnotationGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLIntronAnnotationGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLIntronAnnotationGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLIntronAnnotationGOBPCounts <- count(DMLIntronAnnotationGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLIntronAnnotationGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLIntronAnnotationGOBPCounts) #Count the number of GOterm categories
```
```{r}
write.csv(DMLIntronAnnotationGOBPCounts, "2019-06-20-DML-Intron-Annotation-GO-BP-Counts.csv") #Save file
```
### Hypermethylated introns
#### Subset by e-value cutoffs
```{r}
DMLIntronAnnotationHypermethylatedEvalueCutoff <- subset(x = DMLIntronAnnotationHypermethylated, subset = DMLIntronAnnotationHypermethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLIntronAnnotationHypermethylatedEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Hypermethylated-Evalue-Cutoff.csv") #Save file
nrow(DMLIntronAnnotationHypermethylatedEvalueCutoff) #Count the number of rows
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Hypermethylated-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLIntronAnnotationHypermethylatedGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt")
max(DMLIntronAnnotationHypermethylatedGOBPColumns$X1) #I should import the file with 147 columns
```
```{r}
DMLIntronAnnotationHypermethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLIntronAnnotationHypermethylatedGOBPExpanded <- DMLIntronAnnotationHypermethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file
head(DMLIntronAnnotationHypermethylatedGOBPExpanded) #Confirm import
```
```{r}
DMLIntronAnnotationHypermethylatedGOBPExpanded <- DMLIntronAnnotationHypermethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA <- DMLIntronAnnotationHypermethylatedGOBPExpanded[is.na(DMLIntronAnnotationHypermethylatedGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2) #Remove space before GOterms
DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3) #Remove space before GOterms
head(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA) #Confim change
```
```{r}
DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V1),
as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2),
as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3)) #Convert to a vector
DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLIntronAnnotationHypermethylatedGOBPCounts <- count(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLIntronAnnotationHypermethylatedGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLIntronAnnotationHypermethylatedGOBPCounts) #Count rows
```
```{r}
write.csv(DMLIntronAnnotationHypermethylatedGOBPCounts, "2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Counts.csv") #Save file
```
### Hypomethylated introns
#### Subset by e-value cutoffs
```{r}
DMLIntronAnnotationHypomethylatedEvalueCutoff <- subset(x = DMLIntronAnnotationHypomethylated, subset = DMLIntronAnnotationHypomethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10)
write.csv(DMLIntronAnnotationHypomethylatedEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Hypomethylated-Evalue-Cutoff.csv") #Save file
nrow(DMLIntronAnnotationHypomethylatedEvalueCutoff) #Count the number of rows
```
#### Separate GOterms into new columns
```{bash}
awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Hypomethylated-Evalue-Cutoff.csv \
| awk '{gsub(";","\t",$0); print;}' \
> 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file.
```
```{bash}
head -n5 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt #Confirm file creation
```
```{bash}
awk '{print NF}' 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row
```
#### Import and format isolated GO-BP information
```{r}
DMLIntronAnnotationHypomethylatedGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt")
max(DMLIntronAnnotationHypomethylatedGOBPColumns$X1) #I should import the file with 73 columns
```
```{r}
DMLIntronAnnotationHypomethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:73, sep = ""), header = FALSE) #Import expanded GO-BP information
DMLIntronAnnotationHypomethylatedGOBPExpanded <- DMLIntronAnnotationHypomethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file
head(DMLIntronAnnotationHypomethylatedGOBPExpanded) #Confirm import
```
```{r}
DMLIntronAnnotationHypomethylatedGOBPExpanded <- DMLIntronAnnotationHypomethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses
DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA <- DMLIntronAnnotationHypomethylatedGOBPExpanded[is.na(DMLIntronAnnotationHypomethylatedGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm)
nrow(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA) #Count the number of rows
```
```{r}
DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2) #Remove space before GOterms
DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3) #Remove space before GOterms
head(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA) #Confim change
```
```{r}
DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V1),
as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2),
as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3)) #Convert to a vector
DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList), by = 1),
"GO-BP" = DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function
head(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList) #Check format
```
#### Count GOterms
```{r}
DMLIntronAnnotationHypomethylatedGOBPCounts <- count(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm
head(DMLIntronAnnotationHypomethylatedGOBPCounts) #Confirm counting
```
```{r}
nrow(DMLIntronAnnotationHypomethylatedGOBPCounts)
```
```{r}
write.csv(DMLIntronAnnotationHypomethylatedGOBPCounts, "2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Counts.csv") #Save file
```