--- title: "Gene Enrichment Analysis" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` In this script, I'll create several annotation tables for DML, perform a functional gene description, and prepare Uniprot codes for enrichment in DAVID. # Install packages ```{r} require(RColorBrewer) require(dichromat) require(dplyr) ``` # Session information ```{r} sessionInfo() ``` # Obtain Uniprot codes Before I do any sort of gene enrichment analysis, I need to get Uniprot codes for all genes. To do this, I will use `blastx`. ## Make a Uniprot Database The Uniprot database was downloaded from this link as a .fasta file: http://www.uniprot.org/uniprot/?query=&fil=reviewed%3Ayes&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cgo(biological%20process)%2Cgo-id%2Ccomment(PATHWAY)%2Cdatabase(UniPathway)%2Cdatabase(CDD)%2Cdatabase(Pfam. ```{bash} head uniprot-filtered-reviewed.fasta #Check database creation ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb -help #Get the help menu ``` The first step involves taking the uniprot fasta file and making a new database. To do this, I will use the following code: 1. Path to `makeblastdb` to create the database needed 2. `in` to select the input file for the database 3. `dbtype` specifies the type of file which in our case, is a protein file ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb \ -in uniprot-filtered-reviewed.fasta \ -dbtype 'prot' ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx -help ``` Now I can perform a `blastx` to compare my query, the *C. virginica* transcripts, to our protein database. ## Perform `blastx` 1. Path to `blastx` 2. `-query` provides the file we want to blast (*C. virginica* transcripts) 3.`-db` specifies database created in the previous step 4. `-outfmt` specifies the type of output file. I will use 6, a tabular file 5. `-out ` will allow me to save the output as a new file 6. `-num_threads 4` uses 4 CPUs in the BLAST search 7. `-max_target_seqs 1` keeps only one aligned sequence (i.e. the best match) ```{bash} pwd #Obtain present working directory ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx \ -query 2018-09-06-Virginica-transcripts.fna \ -db uniprot-filtered-reviewed.fasta \ -outfmt 6 \ -out 2018-09-06-Transcript-Uniprot-blastx.txt \ -num_threads 4 \ -max_target_seqs 1 ``` ```{bash} head 2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt #Check file format of Uniprot codes. I'm using a file with the Uniprot code already isolated so I don't have to unfold any columns after merging files. ``` ```{r} blastResults <- read.delim("2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt", header = FALSE, sep = "\t", dec = ".") #Import blast results head(blastResults) #Confirm import colnames(blastResults) <- c("Genbank", "V2", "Uniprot", "V4", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") head(blastResults) #Confirm column headers ``` # Master gene list I want to create a master list with the following columns: - chr - start - end - p-value - q-value - meth.diff - dist-to-closest-gene - gene-chr - gene-start - gene-end - gene-ID - mRNA-start - mRNA-end - genbank - uniprot-ID - e-value - product I can start with my DML csv file, then merge that with the information from `closestBed`. Then I'll merge that document with my gene track to get gene IDs. I'll merge the resultant file with the mRNA track by gene ID to get Genbank IDs. Finally, I'll merge that document with the Uniprot codes by matching Genbank IDs. ## Merge DML list with `closestBed` list ```{r} allDML <- read.csv("../2018-10-25-MethylKit/2018-10-25-Loci-Analysis/2019-04-05-Differentially-Methylated-Loci-Filtered-Destrand-50-Cov5.csv", header = TRUE) #Import DML list allDML <- allDML[,-1] #Remove first column head(allDML) #Confirm import ``` ```{r} closestDML <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-Flanking-Analysis/2019-05-29-Genes-Closest-NoOverlap-DMLs.txt", header = FALSE) #Import DML with closestBed information colnames(closestDML) <- c("chr", "start", "end", "meth.diff", "gene-chr", "gene-start", "gene-end", "dist-to-closest-gene") #Add column names head(closestDML) #Confirm column names. DML downstream of genes have positive distances. There may be more lines than DML becuase there are some DML that overlap multiple gene features, or have ties for closest gene features. ``` ```{r} #Add 1 to the start position and subtract 1 from the end position to get DML positions compatible with the DML list closestDML$start <- closestDML$start + 1 closestDML$end <- closestDML$end -1 head(closestDML) #Confirm changes ``` ```{r} allDMLclosestBed <- merge(x = closestDML, y = allDML, by = "start", sort = FALSE) #Merge closest information and DML list. Do not resort the output. head(allDMLclosestBed) #Confirm merge ``` ```{r} allDMLclosestBed <- allDMLclosestBed[, c(2, 1, 3, 4, 11, 12, 13, 14, 8, 5, 6, 7)] #Reorganize columns head(allDMLclosestBed) #Confirm reorganization ``` ## Merge with gene track ### Import full gene list ```{r} geneList <- read.delim("../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3", header = FALSE) #Import gene gff3 colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "V6", "V7", "V8", "V9") #Rename columns head(geneList) #Confirm import ``` ```{r} geneList <- geneList[,-c(6:8)] #Remove columns 6-8 head(geneList) #Confirm changes ``` ### Isolate gene description from full list ```{bash} cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3 \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-Gene-Descriptions.txt #Isolate gene description column and convert ";" to "\t". Save the output as a new file ``` ```{bash} head 2019-06-20-Gene-Descriptions.txt #Confirm file creation. ``` Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`. ```{bash} awk '{print NF}' 2019-06-20-Gene-Descriptions.txt > 2019-06-20-Gene-Description-Column-Numbers.txt ``` ### Import isolated gene descriptions ```{r} geneDescriptionColumns <- read.delim("2019-06-20-Gene-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(geneDescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 9 columns per row. ``` ```{r} geneDescriptions <- read.delim2("2019-06-20-Gene-Descriptions.txt", header = FALSE, col.names = c("gene-number", "gene-ID", "V3", "V4", "V5", "V6", "V7", "V8", "V9")) #Import document with gene descrptions. Name 9 columns so all rows are imported with 9 columns head(geneDescriptions) #Confirm import ``` ```{r} geneID <- geneDescriptions$gene.ID #Isolate gene ID column geneID <- gsub("Dbxref=GeneID:", "", geneID) #Remove extraneous characters head(geneID) #Confirm changes ``` ```{r} geneList <- data.frame(geneList, geneID) #Replace geneList dataframe with geneList and geneID geneList <- geneList[,-6] #Remove extra column colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "gene-ID") #Rename columns to match previous naming conventions head(geneList) #Confirm merge ``` ### Merge gene track with `allDMLclosestBed` ```{r} allDMLclosestBedgeneList <- merge(x = allDMLclosestBed, y = geneList, by = c("gene-start", "gene-end"), sort = FALSE) #Merge dataframes by gene start and end positions head(allDMLclosestBedgeneList) #Confirm merge ``` ```{r} allDMLclosestBedgeneList <- allDMLclosestBedgeneList[,c(3:12, 14:15, 1:2, 16)] #Reorganize columns head(allDMLclosestBedgeneList) #Confirm reorganization ``` ## Merge with mRNA track ### Import full mRNA track ```{r} mRNAList <- read.delim("../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3", header = FALSE) #Import mRNA track mRNAList <- mRNAList[,-c(6,8)] #Remove extra columns colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "mRNA-strand", "V9") #Rename columns head(mRNAList) #Confirm import ``` ### Isolate mRNA description from full list ```{bash} cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \ | awk '{gsub(";","\t",$0); print;}' \ | awk '{gsub(",","\t",$0); print;}' \ > 2019-06-20-mRNA-Descriptions.txt #Isolate mRNA description column and convert ";" and "," to "\t". Save the output as a new file ``` ```{bash} head 2019-06-20-mRNA-Descriptions.txt #Confirm file creation. ``` Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`. ```{bash} awk '{print NF}' 2019-06-20-mRNA-Descriptions.txt > 2019-06-20-mRNA-Description-Column-Numbers.txt ``` ### Import isolated mRNA descriptions ```{r} mRNADescriptionColumns <- read.delim("2019-06-20-mRNA-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(mRNADescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 78 columns per row. ``` ```{r} mRNADescriptions <- read.delim2("2019-06-20-mRNA-Descriptions.txt", header = FALSE, col.names = paste("V", 1:78, sep = "")) #Import document with gene descrptions. Name 78 columns so all rows are imported with 78 columns head(mRNADescriptions) #Confirm import ``` ```{r} geneIDmRNAList <- mRNADescriptions$V3 #Isolate gene ID column geneIDmRNAList <- gsub("Dbxref=GeneID:", "", geneIDmRNAList) #Remove extraneous characters head(geneIDmRNAList) #Confirm changes ``` ```{r} genbankIDmRNAList <- mRNADescriptions$V4 #Isolate Genbank ID genbankIDmRNAList <- gsub("Genbank:", "", genbankIDmRNAList) #Remove extraneous characters head(genbankIDmRNAList) #Confirm changes ``` Buried in the `mRNAList` is mRNA product information. I'm going to extract that information and add it to the master list. ### Isolate product descriptions ```{bash} cut -f9 ../2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \ | awk '{gsub("product=","\t",$0); print;}' \ | cut -f2 \ | awk '{gsub(";","\t",$0); print;}' \ | awk '{gsub("%","\t",$0); print;}' \ > 2019-06-20-mRNA-Products.txt #Isolate mRNA description column and convert "product=" to "\t". This separates the descriptions into 2 columns. The beginning of second column has the product description. Take the second column of the output and convert ";" and "%" to "\t" to better isolate mRNA products. Save the output as a new file. ``` ```{bash} head 2019-06-20-mRNA-Products.txt #Confirm product description isolation ``` ```{bash} awk '{print NF}' 2019-06-20-mRNA-Products.txt > 2019-06-20-mRNA-Products-Column-Numbers.txt ``` ### Import isolated mRNA products ```{r} mRNAProductColumns <- read.delim("2019-06-20-mRNA-Products-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(mRNAProductColumns) #Max number of columns. I'll need to import my dataset with 16 columns per row. ``` ```{r} mRNAProducts <- read.delim2("2019-06-20-mRNA-Products.txt", header = FALSE, col.names = paste("V", 1:16, sep = "")) #Import document with gene descrptions. Name 16 columns so all rows are imported with 16 columns. head(mRNAProducts) #Confirm import ``` ```{r} mRNAProducts <- mRNAProducts$V1 #Extract the first column with product information ``` ### Merge mRNA track with `allDMLclosestBedgeneList` ```{r} mRNAList <- data.frame(mRNAList, geneIDmRNAList, genbankIDmRNAList, mRNAProducts) #Replace mRNAList dataframe with mRNAList, geneIDmRNAList, genbankIDmRNAList, and mRNAProducts mRNAList <- mRNAList[,-c(6:7)] #Remove extra columns colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "gene-ID", "Genbank", "Product") #Rename columns to match previous naming conventions head(mRNAList) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAList <- merge(x = allDMLclosestBedgeneList, y = mRNAList, by = "gene-ID", sort = FALSE, all.x = TRUE) #Merge dataframes by gene ID head(allDMLclosestBedgeneListmRNAList) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAList <- allDMLclosestBedgeneListmRNAList[,c(2:15, 1, 19:22)] #Reorganize columns head(allDMLclosestBedgeneListmRNAList) #Confirm reorganization ``` ```{r} allDMLclosestBedgeneListmRNAList <- distinct(allDMLclosestBedgeneListmRNAList) #Keep only unique rows ``` ## Merge with Uniprot codes ```{r} allDMLclosestBedgeneListmRNAListUniprot <- merge(x = allDMLclosestBedgeneListmRNAList, y = blastResults, by = "Genbank", sort = FALSE, all.x = TRUE) #Merge DML list with Uniprot codes by Genbank IDs head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAListUniprot <- allDMLclosestBedgeneListmRNAListUniprot[,c(2:4, 9, 6:8, 10:11, 14:18, 1, 21:22, 31, 19)] #Reorganize columns and save as a new dataframe colnames(allDMLclosestBedgeneListmRNAListUniprot) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product") #Rename columns allDMLclosestBedgeneListmRNAListUniprot <- distinct(allDMLclosestBedgeneListmRNAListUniprot) #Keep only unique rows head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm reorganization ``` ## Merge with GOterms I will use a Uniprot-SwissProt database I downloaded to match Uniprot-Accession with Gene Ontology information, which includes GOterms and IDs. I downloaded a tab-delimited SwissProt database with five additional columns: Gene Ontology IDs, Gene ontology (GO), Gene ontology (biological processes), Gene ontology (cellular component), and Gene ontology (molecular function). ### Import Uniprot-Gene Ontology databse ```{bash} gunzip uniprot-reviewed_yes.tab.gz #Unzip file ``` ```{bash} head -n2 uniprot-reviewed_yes.tab #View file header ``` ```{r} uniprotGOTerms <- read.delim("uniprot-reviewed_yes.tab", header = TRUE) #Import tab-delimited file with header head(uniprotGOTerms) #Confirm import ``` ```{r} uniprotGOTerms <- uniprotGOTerms[,c(1:2, 8, 10:12)] #Remove unnecessary columns colnames(uniprotGOTerms) <- c("Uniprot-Accession", "Uniprot-ID", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match previous naming conventions head(uniprotGOTerms) #Confirm changes ``` ### Merge with `allDMLclosestBedgeneListmRNAListUniprot` ```{r} allDMLclosestBedgeneListmRNAListUniprotGOTerms <- merge(x = allDMLclosestBedgeneListmRNAListUniprot, y = uniprotGOTerms, by = c("Uniprot-Accession", "Uniprot-ID"), sort = FALSE, all.x = TRUE) #Merge files by Uniprot Accesssion code and Uniprot ID head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAListUniprotGOTerms <- allDMLclosestBedgeneListmRNAListUniprotGOTerms[,c(3:17, 1:2, 18:23)] #Reorganize columns colnames(allDMLclosestBedgeneListmRNAListUniprotGOTerms) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match naming conventions head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm changes ``` ## Export master gene list ```{r} masterDMLAnnotation <- distinct(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Remove duplicated rows (merging artifacts) head(masterDMLAnnotation) #Confirm changes ``` ```{r} write.csv(masterDMLAnnotation, "2019-06-20-Master-DML-Annotation.csv") #Write out masterDMLAnnotation as a .csv file ``` Now that I have a master gene list, I can create relevant subset documents. ## DML-Gene overlaps ```{r} DMLGeneAnnotation <- subset(x = masterDMLAnnotation, subset = masterDMLAnnotation$`dist-to-closest-gene` == 0) #Subset dataset where the distance to the closest gene = 0 (i.e. DML and gene are overlapping) head(DMLGeneAnnotation) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotation, "2019-06-20-DML-Gene-Annotation.csv") #Save file ``` ### Characterize DML distribution in genes ##### Count the number of DML per gene ```{r} DMLGeneAnnotationNomRNA <- unique(DMLGeneAnnotation[,-c(8:9, 13:18, 20:23)]) #Remove mRNA information from annotation to quantify #DML/gene (and not by mRNA). Also eliminate Uniprot columns. Use unique to remove any lines that are merging artifacts head(DMLGeneAnnotationNomRNA) #Confirm column removal ``` ```{r} DMLbyGeneCounts <- as.data.frame(table(DMLGeneAnnotationNomRNA$geneID)) #Count the number of DML/gene colnames(DMLbyGeneCounts) <- c("geneID", "DMLCount") #Change column names head(DMLbyGeneCounts) #Confirm formatting ``` ```{r} mean(DMLbyGeneCounts$DMLCount) #Calculate mean number of DML in genes with DML median(DMLbyGeneCounts$DMLCount) #Calculate median number of DML in genes with DML max(DMLbyGeneCounts$DMLCount) #Calculate max number of DML in a single gene ``` ```{r} sum(DMLbyGeneCounts$DMLCount == 1) #Count number of genes with 1 DML sum(DMLbyGeneCounts$DMLCount == 2) #Count number of genes with 2 DML sum(DMLbyGeneCounts$DMLCount == 3) #Count number of genes with 3 DML sum(DMLbyGeneCounts$DMLCount == 4) #Count number of genes with 4 DML sum(DMLbyGeneCounts$DMLCount == 5) #Count number of genes with 5 DML ``` ```{r} numberDMLinGenes <- data.frame("DMLCount" = c("1", "2", "3", "4", "5"), "geneCount" = c(sum(DMLbyGeneCounts$DMLCount == 1), sum(DMLbyGeneCounts$DMLCount == 2), sum(DMLbyGeneCounts$DMLCount == 3), sum(DMLbyGeneCounts$DMLCount == 4), sum(DMLbyGeneCounts$DMLCount == 5))) #Create dataframe for plotting head(numberDMLinGenes) #Confirm creation ``` ```{r} genes1DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 1,] #Subset genes with 1 DML genes1DMLAnnotations <- unique(merge(x = genes1DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts head(genes1DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes1DMLAnnotations, "2019-10-01-Genes-with-1-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes2DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 2,] #Subset genes with 2 DML genes2DMLAnnotations <- unique(merge(x = genes2DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts head(genes2DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes2DMLAnnotations, "2019-10-01-Genes-with-2-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes3DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 3,] #Subset genes with 3 DML genes3DMLAnnotations <- unique(merge(x = genes3DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 3 DML with annotation information. Use unique to remove merging artifacts head(genes3DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes3DMLAnnotations, "2019-10-01-Genes-with-3-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes4DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 4,] #Subset genes with 4 DML genes4DMLAnnotations <- unique(merge(x = genes4DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 4 DML with annotation information. Use unique to remove merging artifacts head(genes4DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes4DMLAnnotations, "2019-10-01-Genes-with-4-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes5DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 5,] #Subset genes with 2 DML genes5DMLAnnotations <- unique(merge(x = genes5DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 5 DML with annotation information. Use unique to remove merging artifacts head(genes5DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes5DMLAnnotations, "2019-10-01-Genes-with-5-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` #### Characterize types of DML in genes ```{r} sum(genes1DMLAnnotations$meth.diff > 0) #Count number of genes with 1 DML hypermethylated sum(genes1DMLAnnotations$meth.diff < 0) #Count number of genes with 1 DML hypomethylated ``` ```{r} genes1DMLBreakdown <- data.frame("type" = c("1hyper", "0hyper"), "DMLcount" = c(sum(genes1DMLAnnotations$meth.diff > 0), sum(genes1DMLAnnotations$meth.diff < 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes1DMLBreakdown) ``` I did some Excel manipulations to count the number of hypermethylated DML in genes with multiple DML. ```{r} genes2DMLCounts <- read.csv("2019-10-01-Genes-with-2-DML-withCounts.csv", header = TRUE) #Import file genes2DMLCounts$no.hypo <- genes2DMLCounts$DMLCount - genes2DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes2DMLCounts) #Confirm changes ``` ```{r} sum(genes2DMLCounts$no.hyper == 2) #Count # genes with both DML hypermethylated sum(genes2DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated sum(genes2DMLCounts$no.hyper == 0) #Count # genes with both DML hypomethylated ``` ```{r} genes2DMLBreakdown <- data.frame("type" = c("2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes2DMLCounts$no.hyper == 2), sum(genes2DMLCounts$no.hyper == 1), sum(genes2DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes2DMLBreakdown) ``` ```{r} genes3DMLCounts <- read.csv("2019-10-01-Genes-with-3-DML-withCounts.csv", header = TRUE) #Import file genes3DMLCounts$no.hypo <- genes3DMLCounts$DMLCount - genes3DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes3DMLCounts) #Confirm changes ``` ```{r} sum(genes3DMLCounts$no.hyper == 3) #Count # genes with all DML hypermethylated sum(genes3DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 1 hypomethylated sum(genes3DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated sum(genes3DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ```{r} genes3DMLBreakdown <- data.frame("type" = c("3hyper", "2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes3DMLCounts$no.hyper == 3), sum(genes3DMLCounts$no.hyper == 2), sum(genes3DMLCounts$no.hyper == 1), sum(genes3DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes3DMLBreakdown) ``` ```{r} genes4DMLCounts <- read.csv("2019-10-01-Genes-with-4-DML-withCounts.csv", header = TRUE) #Import file genes4DMLCounts$no.hypo <- genes4DMLCounts$DMLCount - genes4DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes4DMLCounts) #Confirm changes ``` ```{r} sum(genes4DMLCounts$no.hyper == 4) #Count # genes with all DML hypermethylated sum(genes4DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 1 hypomethylated sum(genes4DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 2 hypomethylated sum(genes4DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 3 hypomethylated sum(genes4DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ```{r} genes4DMLBreakdown <- data.frame("type" = c("4hyper", "3hyper", "2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes4DMLCounts$no.hyper == 4), sum(genes4DMLCounts$no.hyper == 3), sum(genes4DMLCounts$no.hyper == 2), sum(genes4DMLCounts$no.hyper == 1), sum(genes4DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes4DMLBreakdown) ``` ```{r} genes5DMLCounts <- read.csv("2019-10-01-Genes-with-5-DML-withCounts.csv", header = TRUE) #Import file genes5DMLCounts$no.hypo <- genes5DMLCounts$DMLCount - genes5DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes5DMLCounts) #Confirm changes ``` ```{r} sum(genes5DMLCounts$no.hyper == 5) #Count # genes with all DML hypermethylated sum(genes5DMLCounts$no.hyper == 4) #Count # genes with 4 DML hypermethylated and 1 hypomethylated sum(genes5DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 2 hypomethylated sum(genes5DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 3 hypomethylated sum(genes5DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 4 hypomethylated sum(genes5DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ##### Create individual figures I'm going to create a multipanel plot with a breakdown of the number hyper- and hypomethylated in genes. I will start by creating each figure separately. ```{r} #DML/gene barplot(numberDMLinGenes$geneCount, axes = FALSE, names.arg = numberDMLinGenes$DMLCount, ylim = c(0, 450)) #Create base plot axis(side = 2, las = 1, at = c(0, 450), col = "grey80") #Put the axis labels only at 0 and 450. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 1 DML barplot(genes1DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("1,0", "0,1"), ylim = c(0, 250), col = c("grey20", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 250), col = "grey80") #Put the axis labels only at 0 and 250. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 2 DML barplot(genes2DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("2,0", "1,1", "0,2"), ylim = c(0, 20), col = c("grey20", "grey60", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 20), col = "grey80") #Put the axis labels only at 0 and 20. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 3 DML barplot(genes3DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("3,0", "2,1", "1,2", "0,3"), ylim = c(0, 5), col = c("grey20", "grey40", "grey60", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 5), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 5. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 4 DML barplot(genes4DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("4,0", "3,1", "2,2", "1,3", "0,4"), ylim = c(0, 3), col = c("grey20", "grey40", "grey60", "grey80", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 3), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 3. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` #### Create multipanel figure ##### Create color scheme ```{r} RColorBrewer::display.brewer.all() #Show all RColorBrewer palettes. I will choose greens. plotColors <- rev(RColorBrewer::brewer.pal(5, "GnBu")) #Create a color palette for the barplots. Use 5 green-blue shades from RColorBrewer. Reverse theorder so the darkest shade is used first. barplot(t(t(proportionData)), col = plotColors) #See what plot looks like with new scheme barplot(t(t(proportionData)), col = dichromat(plotColors)) #Check that the plot colors will be easy to interpret for those with color blindess ``` ##### Plot! I'm going to combine the breakdowns with the number of DML in each chromosome. ```{r} #pdf("2019-10-03-DML-Distribution-in-Chr-Genes.pdf", width = 11, height = 8.5) plotMatrix <- matrix(c(1,1,0,3,4, 2,2,0,5,6), nrow = 2, ncol = 5, byrow = TRUE) #Create a matrix with plot location information. Fill matrix by row. par(mar = c(0, 0, 6, 2.5), oma = c(6, 10, 0, 0)) #Specify inner and outer margins layout(mat = plotMatrix, width = c(2, 2, 0.7, 1, 1)) #Create a layout based on plotMatrix. The first and second plots (first column) should be wider. #layout.show(n = 6) #Confirm layout looks good #Chromosome distribution DMLbarplot <- barplot(as.matrix(t(DMLchrCounts$DMLCount)), axes = FALSE, names.arg = DMLchrCounts$chr, cex.names = 1.5, xlim = c(0.7,11.5), ylim = c(0,150), col = plotColors[5]) #Create a barplot and save as a new object. Use axes = FALSE to remove the y-axis and names.arg to set labels on the x-axis. The object contains x coordinates for bars, so xlim is set at 12 to compensate for maximum value of 11.5 mtext(side = 1, "Chromosome", line = 3, cex = 1.3) #Add x-axis label axis(side = 2, line = 1.5, at = seq(0, 150, by = 50), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for DML counts mtext(side = 2, "Number of DML", line = 5, cex = 1.3) #Add y-axis label for DML counts mtext(side = 3, line = -1, at = c(0.5, 130), text = "a") #Add figure label #Add gene number infomration par(new = TRUE) #Create a new plot plot(x = DMLbarplot, y = DMLchrCounts$geneCount, type = "b", axes = FALSE, xlab = "", ylab = "", xaxs = "i", yaxs = "i", pch = 16, col = plotColors[2], xlim = c(0,12), ylim = c(0,6500)) #Plot points and lines (type = "b") for gene count by chromosome. Use the coordinates from DMLbarplot (x = DMLbarplot) and set xlim = (0,12) so plots are lined up. Use axes = FALSE to remove both axes. Remove x and y lables (xlab = ""; ylab = ""). Set ylim = (0,6500) to account for max y-values. Use xaxs and yxaxs to remove space between axes. axis(side = 4, line = 1.5, at = seq(0, 6500, by = 500), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for gene sequence counts mtext(side = 4, "Number of Genes", line = 7, cex = 1.3) #Add y-axis label for gene sequence counts #DML/gene distribution barplot(numberDMLinGenes$geneCount, axes = FALSE, names.arg = numberDMLinGenes$DMLCount, col = plotColors[5], ylim = c(0, 450)) #Create base plot axis(side = 2, las = 1, at = c(0, 450), col = "grey80") #Put the axis labels only at 0 and 450. mtext(side = 2, "Number of Genes with DML", line = 3, cex = 1.3) #Add y-axis label mtext(side = 1, "Number of DML", line = 3, cex = 1.3) #Add x-axis label mtext(side = 3, line = -1, at = c(0.15, 440), text = "b") #Add figure label #Hyper/hypo breakdown for genes with 1 DML barplot(genes1DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("1,0", "0,1"), ylim = c(0, 250), col = c(plotColors[1], plotColors[5])) #Create base plot axis(side = 2, las = 1, at = c(0, 250), col = "grey80") #Put the axis labels only at 0 and 250. mtext(side = 3, line = -1, at = c(0.4, 240), text = "c") #Add figure label #Hyper/hypo breakdown for genes with 2 DML barplot(genes2DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("2,0", "", "0,2"), ylim = c(0, 20), col = c(plotColors[1], plotColors[3], plotColors[5])) #Create base plot axis(side = 2, las = 1, at = c(0, 20), col = "grey80") #Put the axis labels only at 0 and 20. mtext(side = 3, line = -1, at = c(0.5, 24.5), text = "d") #Add figure label #Hyper/hypo breakdown for genes with 3 DML barplot(genes3DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("3,0", "", "", "0,3"), ylim = c(0, 5), col = c(plotColors[1], plotColors[2], plotColors[4], plotColors[5])) #Create base plot axis(side = 2, las = 1, at = c(0, 5), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 5. mtext(side = 2, "Number of Genes", line = 2, cex = 1.3) #Add y-axis label mtext(side = 1, " Hyper, Hypo DML", line = 3, cex = 1.3, outer = TRUE) #Add x-axis label mtext(side = 3, line = -1, at = c(0.7, 240), text = "e") #Add figure label #Hyper/hypo breakdown for genes with 4 DML barplot(genes4DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("4,0", "", "", "", "0,4"), ylim = c(0, 3), col = plotColors) #Create base plot axis(side = 2, las = 1, at = c(0, 3), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 3. mtext(side = 3, line = -1, at = c(0.75, 240), text = "f") #Add figure label #dev.off() ``` ### Scaled DML distribution Scaling each gene from 0% to 100%, I want to see where DML occurs. This is useful to see if methylation is occuring in any consistent location for each gene. ```{r} head(DMLGeneAnnotationNomRNA) ``` ```{r} DMLGeneAnnotationNomRNA$geneLength <- DMLGeneAnnotationNomRNA$gene.end - DMLGeneAnnotationNomRNA$gene.start #Calculate gene length DMLGeneAnnotationNomRNA$absPosition <- DMLGeneAnnotationNomRNA$start - DMLGeneAnnotationNomRNA$gene.start #Calculate the absolute position of the DML in the gene DMLGeneAnnotationNomRNA$scaledPosition <- DMLGeneAnnotationNomRNA$absPosition / DMLGeneAnnotationNomRNA$geneLength #Calculate the scaled position of the DML in the gene head(DMLGeneAnnotationNomRNA) #Confirm calculations ``` ```{r} DMLGenePositionsHyper <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff > 0) #Subset hypermethylated DML DMLGenePositionsHypo <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff < 0) #Subset hypomethylated DML head(DMLGenePositionsHyper) #Confirm subset head(DMLGenePositionsHypo) #Confirm subset ``` #### Create individual figures I'm going to make back-to-back histograms to easily compare distributions of hyper- and hypomethtylated DML along a gene. First I'll create individual plots, then I'll add them to the same coordinate plane. ```{r} #Hypermethylated DML hist(DMLGenePositionsHyper$scaledPosition, breaks = 100, axes = FALSE, xlab = "", ylab = "", main = "", col = "grey20", xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent. axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label axis(side = 2, at = seq(from = 0, to = 8, by = 1), col = "grey80", las = 2) #Add y-axis mtext(side = 2, "Number of Hypermethylated DML", line = 2) #Add y-axis label ``` ```{r} #Hypomethylated DML hist(DMLGenePositionsHypo$scaledPosition, breaks = 100, axes = FALSE, xlab = "", ylab = "", main = "", col = "grey100", xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent. axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label axis(side = 2, at = seq(from = 0, to = 7, by = 1), col = "grey80", las = 2) #Add y-axis mtext(side = 2, "Number of Hypomethylated DML", line = 2) #Add y-axis label ``` #### Create back-to-back plot ```{r} hyperPlotDML <- hist(DMLGenePositionsHyper$scaledPosition, breaks = 100, plot = FALSE) #Save base hypermethylated plot hypoPlotDML <- hist(DMLGenePositionsHypo$scaledPosition, breaks = 100, plot = FALSE) #Save base hypomethylated plot ``` ```{r} #Calculate plotting parameters hypoPlotDML$counts <- -(hypoPlotDML$counts) #Make counts negative for hypoPlot hmaxDML <- max(hyperPlotDML$counts) #Set maximum value as max value of hyperPlot hminDML <- min(hypoPlotDML$counts) #Set minimum value as minimum value of hypoPlot breaksDMLPlot <- c(hyperPlotDML$breaks, hypoPlotDML$breaks) #Set breaks for plots xmaxDML <- max(breaksDMLPlot) #Set xmax based on breaks xminDML <- min(breaksDMLPlot) #Set xmin based on breaks ``` ```{r} #pdf("2019-10-09-Scaled-Gene-DML-Distribution.pdf", height = 8.5, width = 11) plot(hyperPlotDML, xlim = c(xminDML, xmaxDML), ylim = c(hminDML, hmaxDML), axes = FALSE, xlab = "", ylab = "", main = "", col = plotColors[2]) #Create the base plot with just hyperPlot. Set xlim and ylim on parameters calculated above. Remove all axes and axis labels lines(hypoPlotDML, col= plotColors[5]) #Add hypoPlot axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80", cex = 1.2) #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3, cex = 1.5) #Add x-axis label axis(side = 2, at = seq(from = hminDML, to = hmaxDML, by = 2), labels = c(seq(from = hminDML, to = 0, by = 2)*-1, seq(from = 2, to = hmaxDML, by = 2)), col = "grey80", las = 2, cex = 1.2) #Add y-axis. Add labels from minimum to maximum valules, but mtext(side = 2, "Number of DML", line = 2, cex = 1.5) #Add y-axis label #dev.off() ``` ### Hypermethylated DML-gene overlaps ```{r} DMLGeneAnnotationHypermethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff > 0) #Subset hypermethylated DML by selecting meth.diff > 0 head(DMLGeneAnnotationHypermethylated) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotationHypermethylated, "2019-06-20-DML-Gene-Annotation-Hypermethylated.csv") #Save file ``` ### Hypomethylated DML-gene overlaps ```{r} DMLGeneAnnotationHypomethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff < 0) #Subset hypomethylated DML by selecting meth.diff < 0 head(DMLGeneAnnotationHypomethylated) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotationHypomethylated, "2019-06-20-DML-Gene-Annotation-Hypomethylated.csv") #Save file ``` # Master exon overlap list ## Import DML-exon overlaps ```{r} DMLExonOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-Exon.txt", header = FALSE) #Import file with DML and matching exons DMLExonOverlaps <- DMLExonOverlaps[,-4] #Remove extra meth.diff column colnames(DMLExonOverlaps) <- c("chr", "start", "end", "exon-chr", "exon-start", "exon-end") DMLExonOverlaps$start <- DMLExonOverlaps$start + 1 #Add 1 to the start position since the file is BEDtools output DMLExonOverlaps$end <- DMLExonOverlaps$end - 1 #Add 1 to the start position since the file is BEDtools output head(DMLExonOverlaps) #Confirm changes ``` ## Merge with DML-gene annotation table ```{r} DMLExonAnnotation <- merge(x = DMLExonOverlaps, y = DMLGeneAnnotation, by = "start", sort = FALSE, all.x = TRUE) #Merge files by DML position. Don't sort the output and retain all rows from DMLExonOverlaps even if there are no matches in DMLGeneAnnotation head(DMLExonAnnotation) #Confirm merge ``` ```{r} DMLExonAnnotation <- DMLExonAnnotation[,c(2, 1, 3, 9:12, 5:6, 15:28)] #Reorganize columns remove dist-to-closest-gene column since these is all overlaps. colnames(DMLExonAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "exon-start", "exon-end", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-ID", "Uniprot-Accession", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Add column names DMLExonAnnotation <- distinct(DMLExonAnnotation) #Remove duplicate rows (merging artifacts) head(DMLExonAnnotation) #Confirm changes ``` ```{r} write.csv(DMLExonAnnotation, "2019-06-20-DML-Exon-Annotation.csv") #Save file ``` ## Hypermethylated DML-exon overlaps ```{r} DMLExonAnnotationHypermethylated <- DMLExonAnnotation[DMLExonAnnotation$meth.diff > 0,] #Obtain hypermethylated DML by subsetting meth.dif > 0 head(DMLExonAnnotationHypermethylated) #Confirm subset ``` ```{r} write.csv(DMLExonAnnotationHypermethylated, "2019-06-20-DML-Exon-Annotation-Hypermethylated.csv") #Save file ``` ## Hypomethylated DML-exon overlaps ```{r} DMLExonAnnotationHypomethylated <- DMLExonAnnotation[DMLExonAnnotation$meth.diff < 0,] #Obtain hypomethylated DML by subsetting meth.dif < 0 head(DMLExonAnnotationHypomethylated) #Confirm subset ``` ```{r} write.csv(DMLExonAnnotationHypomethylated, "2019-06-20-DML-Exon-Annotation-Hypomethylated.csv") #Save file ``` # Master intron overlap list ## Import DML-Intron overlaps ```{r} DMLIntonOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-Intron.txt", header = FALSE) #Import file with DML and matching introns DMLIntonOverlaps <- DMLIntonOverlaps[,-4] #Remove extra meth.diff column colnames(DMLIntonOverlaps) <- c("chr", "start", "end", "intron-chr", "intron-start", "intron-end") DMLIntonOverlaps$start <- DMLIntonOverlaps$start + 1 #Add 1 to the start position since the file is BEDtools output DMLIntonOverlaps$end <- DMLIntonOverlaps$end - 1 #Add 1 to the start position since the file is BEDtools output head(DMLIntonOverlaps) #Confirm changes ``` ## Merge with DML-gene annotation table ```{r} DMLIntronAnnotation <- merge(x = DMLIntonOverlaps, y = DMLGeneAnnotation, by = "start", sort = FALSE, all.x = TRUE) #Merge files by DML position. Don't sort the output and retain all rows from DMLIntronOverlaps even if there are no matches in DMLGeneAnnotation head(DMLIntronAnnotation) #Confirm merge ``` ```{r} DMLIntronAnnotation <- DMLIntronAnnotation[,c(2, 1, 3, 9:12, 5:6, 15:28)] #Reorganize columns remove dist-to-closest-gene column since these is all overlaps. colnames(DMLIntronAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "intron-start", "intron-end", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-ID", "Uniprot-Accession", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Add column names DMLIntronAnnotation <- distinct(DMLIntronAnnotation) #Remove duplicate rows (merging artifacts) head(DMLIntronAnnotation) #Confirm changes ``` ```{r} write.csv(DMLIntronAnnotation, "2019-06-20-DML-Intron-Annotation.csv") #Save file ``` ## Hypermethylated DML-intron overlaps ```{r} DMLIntronAnnotationHypermethylated <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$meth.diff > 0) #Obtain hypermethylated DML by subsetting meth.diff > 0 head(DMLIntronAnnotationHypermethylated) #Confirm subset ``` ```{r} write.csv(DMLIntronAnnotationHypermethylated, "2019-06-20-DML-Intron-Annotation-Hypermethylated.csv") #Save file ``` ## Hypomethylated DML-intron overlaps ```{r} DMLIntronAnnotationHypomethylated <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$meth.diff < 0) #Obtain hypomethylated DML by subsetting meth.diff < 0 head(DMLIntronAnnotationHypomethylated) #Confirm subset ``` ```{r} write.csv(DMLIntronAnnotationHypomethylated, "2019-06-20-DML-Intron-Annotation-Hypomethylated.csv") #Save file ``` # Master promoter overlap list The `dist-to-closet-gene` has important information for DML distances to genes, but not the closest transcription start sites (i.e. mRNA). I will import the DML-promoter overlaps, then modify this dataframe to obtain a master DML-promoter overlap annotation. ## Import DML-promoter overlaps ```{r} promoterList <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-Flanking-Analysis/2019-05-29-mRNA-1000bp-UpstreamFlanks-DML.txt", header = FALSE) #Import DML-promoter overlaps promoterList <- promoterList[,-c(7:8)] #Remove extra columns colnames(promoterList) <- c("flank-chr", "flank-start", "flank-end", "chr", "start", "end") #Rename columns head(promoterList) #Confirm changes ``` ```{r} promoterList$start <- promoterList$start + 1 #Add 1 to the start position since the file is BEDtools output promoterList$end <- promoterList$end -1 #Subtract 1 to the end position since the file is BEDtools output promoterList$`mRNA-start` <- promoterList$`flank-end` #Duplicate The end of the flank is the same as the start of the mRNA. Create a duplicate column for easy merging. head(promoterList) #Confrm changes ``` ## Merge with `masterDMLAnnotation` ```{r} DMLPromoterAnnotation <- merge(x = masterDMLAnnotation, y = promoterList, by = "mRNA-start", sort = FALSE, all.y = TRUE) #Merge the master list with promoter information by mRNA-start. Retain all rows in promoterList that do not have matching entries in masterDMLAnnotation head(DMLPromoterAnnotation) #Confirm merge ``` ```{r} DMLPromoterAnnotation <- DMLPromoterAnnotation[,c(2:13, 25:26, 1, 14:23)] #Reorganize columns colnames(DMLPromoterAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "flank-start", "flank-end", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match naming conventions DMLPromoterAnnotation <- distinct(DMLPromoterAnnotation) #Remove duplicate rows (merge artifacts) head(DMLPromoterAnnotation) #Confirm reorganization ``` ```{r} write.csv(DMLPromoterAnnotation, "2019-06-20-DML-Promoter-Annotation.csv") #Save file ``` # Master transposable element list The last annotation list I need is one with DML-transposable element overlaps. ## Import DML-TE overlap file ```{r} TEOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-05-29-DML-TE-all.txt", header = FALSE) #Import DML-TE overlap file TEOverlaps <- TEOverlaps[,-c(4, 6:7, 12)] #Remove extraneous columns colnames(TEOverlaps) <- c("chr", "start", "end", "TE-chr", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif") #Rename columns head(TEOverlaps) #Confirm import ``` ## Merge with TE-gene overlap file ```{r} TEGeneOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2018-11-07-Genes-TE-all.txt", header = FALSE) TEGeneOverlaps <- TEGeneOverlaps[,-c(4:6, 11, 13)] #Remove extraneous columns colnames(TEGeneOverlaps) <- c("gene-chr", "gene-start", "gene-end", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif") #Rename columns. Changed V9 to V10 to match column names in TEOverlaps head(TEGeneOverlaps) #Confirm import ``` ```{r} TEDMLOverlaps <- merge(x = TEOverlaps, y = TEGeneOverlaps, by = c("TE-start", "TE-end"), sort = FALSE, all.x = TRUE) #Merge TEOverlaps with TEGeneOverlaps. Do not sort the output. Retain any information in the DML-TE overlaps that does not have matching gene annotations head(TEDMLOverlaps) #Confirm merge ``` ```{r} TEDMLOverlaps <- TEDMLOverlaps[,c(3:5, 1:2, 7:9, 11:12)] #Reorganize columns head(TEDMLOverlaps) #Confirm changes ``` ## Merge with `masterDMLAnnotation` This will allow me to match TE-DML-gene overlaps with genes. ```{r} DMLTEAnnotation <- merge(x = TEDMLOverlaps, y = masterDMLAnnotation, by = c("gene-start", "gene-end"), sort = FALSE, all.x = TRUE) #Merge TEDMLOverlaps with masterDMLAnnotation by gene start and end positions. Do not sort the output and retain any TEDMLOverlap entries that do not have matches in masterDMLAnnotation head(DMLTEAnnotation) #Confirm merge ``` ```{r} DMLTEAnnotation <- DMLTEAnnotation[,c(11:17, 6:10, 18, 1:2, 20:31)] #Reorganize columns colnames(DMLTEAnnotation) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "TE-start", "TE-end", "V10", "TE-strand", "Target-Motif", "dist-to-closest-gene", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns DMLTEAnnotation <- distinct(DMLTEAnnotation) #Retain unique rows head(DMLTEAnnotation) #Confirm changes ``` ```{r} write.csv(DMLTEAnnotation, "2019-06-20-DML-TE-Annotation.csv") #Save file ``` # Gene background ## Import DMLBackground-mRNA overlaps ```{r} DMLBackgroundmRNAOverlaps <- read.delim("../2018-11-01-DML-and-DMR-Analysis/2019-06-20-DMLBackground-mRNA.txt", header = FALSE) DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,-c(4:5)] #Remove extra columns colnames(DMLBackgroundmRNAOverlaps) <- c("chr", "DMLB-start", "DMLB-end", "mRNA-start", "mRNA-end") #Rename columns head(DMLBackgroundmRNAOverlaps) #Confirm import ``` ## Merge with `mRNAList` ```{r} DMLBackgroundmRNAOverlaps <- merge(x = DMLBackgroundmRNAOverlaps, y = mRNAList, by = c("mRNA-start", "mRNA-end"), sort = FALSE) #Merge dataframes by mRNA start and end positions head(DMLBackgroundmRNAOverlaps) #Confirm merge ``` ```{r} DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,c(3:6, 1:2, 9:11)] #Reorganize columns head(DMLBackgroundmRNAOverlaps) #Confirm reorganization ``` ```{r} head(blastResults) ``` ## Merge with Uniprot codes ```{r} backgroundmRNAblast <- merge(x = blastResults, y = DMLBackgroundmRNAOverlaps, by = "Genbank", sort = FALSE) #Merge the gene background and Uniprot codes by the Genbank ID column backgroundmRNAblast <- distinct(backgroundmRNAblast) #Remove duplicate rows from merging artifacts head(backgroundmRNAblast) #Confirm merge ``` ```{r} backgroundmRNAblast <- backgroundmRNAblast[,c(15:21, 1:14, 22)] #Reorganize columns backgroundmRNAblasteValueCutoff <- subset(backgroundmRNAblast, subset = backgroundmRNAblast$evalue < 1e-10) #Subset by evalue cutoff head(backgroundmRNAblasteValueCutoff) #Confirm reorganization ``` ```{r} write.csv(backgroundmRNAblasteValueCutoff, "2019-06-20-mRNA-Gene-Background-Uniprot.csv") #Write out file ``` # Functional gene description I will use `GO-BP` information to characterize Gene Ontology represented in DML-exon and DML-intron overlaps. I will also separate this out by hyper- and hypomethylated DML. ## Exons ### All exons #### Subset by e-value cutoffs ```{r} DMLExonAnnotationEvalueCutoff <- subset(x = DMLExonAnnotation, subset = DMLExonAnnotation$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLExonAnnotationEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Evalue-Cutoff.csv") nrow(DMLExonAnnotationEvalueCutoff) #Approx. 8000 rows dropped ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLExonAnnotationGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-GO-BP-Expanded-Column-Numbers.txt") max(DMLExonAnnotationGOBPColumns$X1) #I should import the file with 353 columns ``` ```{r} DMLExonAnnotationGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-GO-BP-Expanded.txt", col.names = paste("V", 1:353, sep = ""), header = FALSE) #Import expanded GO-BP information DMLExonAnnotationGOBPExpanded <- DMLExonAnnotationGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file head(DMLExonAnnotationGOBPExpanded) #Confirm import ``` ```{r} DMLExonAnnotationGOBPExpanded <- DMLExonAnnotationGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLExonAnnotationGOBPExpandedNoNA <- DMLExonAnnotationGOBPExpanded[is.na(DMLExonAnnotationGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLExonAnnotationGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLExonAnnotationGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLExonAnnotationGOBPExpandedNoNA$V2) #Remove space before GOterms DMLExonAnnotationGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLExonAnnotationGOBPExpandedNoNA$V3) #Remove space before GOterms head(DMLExonAnnotationGOBPExpandedNoNA) #Confim change ``` ```{r} DMLExonAnnotationGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationGOBPExpandedNoNA$V1), as.vector(DMLExonAnnotationGOBPExpandedNoNA$V2), as.vector(DMLExonAnnotationGOBPExpandedNoNA$V3)) #Convert to a vector DMLExonAnnotationGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLExonAnnotationGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLExonAnnotationGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLExonAnnotationGOBPCounts <- count(DMLExonAnnotationGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLExonAnnotationGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLExonAnnotationGOBPCounts) #Count the number of rows ``` ```{r} write.csv(DMLExonAnnotationGOBPCounts, "2019-06-20-DML-Exon-Annotation-GO-BP-Counts.csv") #Save file ``` ### Hypermethylated exons #### Subset by e-value cutoffs ```{r} DMLExonAnnotationHypermethylatedEvalueCutoff <- subset(x = DMLExonAnnotationHypermethylated, subset = DMLExonAnnotationHypermethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLExonAnnotationHypermethylatedEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Hypermethylated-Evalue-Cutoff.csv") #Save file nrow(DMLExonAnnotationHypermethylatedEvalueCutoff) #Count the number of rows ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Hypermethylated-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLExonAnnotationHypermethylatedGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt") max(DMLExonAnnotationHypermethylatedGOBPColumns$X1) #I should import the file with 147 columns ``` ```{r} DMLExonAnnotationHypermethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information DMLExonAnnotationHypermethylatedGOBPExpanded <- DMLExonAnnotationHypermethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file head(DMLExonAnnotationHypermethylatedGOBPExpanded) #Confirm import ``` ```{r} DMLExonAnnotationHypermethylatedGOBPExpanded <- DMLExonAnnotationHypermethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLExonAnnotationHypermethylatedGOBPExpandedNoNA <- DMLExonAnnotationHypermethylatedGOBPExpanded[is.na(DMLExonAnnotationHypermethylatedGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2) #Remove space before GOterms DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3) #Remove space before GOterms head(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA) #Confim change ``` ```{r} DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V1), as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V2), as.vector(DMLExonAnnotationHypermethylatedGOBPExpandedNoNA$V3)) #Convert to a vector DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLExonAnnotationHypermethylatedGOBPCounts <- count(DMLExonAnnotationHypermethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLExonAnnotationHypermethylatedGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLExonAnnotationHypermethylatedGOBPCounts) #Count the number of rows ``` ```{r} write.csv(DMLExonAnnotationHypermethylatedGOBPCounts, "2019-06-20-DML-Exon-Annotation-Hypermethylated-GO-BP-Counts.csv") #Save file ``` ### Hypomethylated exons #### Subset by e-value cutoffs ```{r} DMLExonAnnotationHypomethylatedEvalueCutoff <- subset(x = DMLExonAnnotationHypomethylated, subset = DMLExonAnnotationHypomethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLExonAnnotationHypomethylatedEvalueCutoff, "2019-06-20-DML-Exon-Annotation-Hypomethylated-Evalue-Cutoff.csv") #Save file nrow(DMLExonAnnotationHypomethylatedEvalueCutoff) #Count the number of rows ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Exon-Annotation-Hypomethylated-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLExonAnnotationHypomethylatedGOBPColumns <- read.delim("2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt") max(DMLExonAnnotationHypomethylatedGOBPColumns$X1) #I should import the file with 353 columns ``` ```{r} DMLExonAnnotationHypomethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:353, sep = ""), header = FALSE) #Import expanded GO-BP information DMLExonAnnotationHypomethylatedGOBPExpanded <- DMLExonAnnotationHypomethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file head(DMLExonAnnotationHypomethylatedGOBPExpanded) #Confirm import ``` ```{r} DMLExonAnnotationHypomethylatedGOBPExpanded <- DMLExonAnnotationHypomethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLExonAnnotationHypomethylatedGOBPExpandedNoNA <- DMLExonAnnotationHypomethylatedGOBPExpanded[is.na(DMLExonAnnotationHypomethylatedGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2) #Remove space before GOterms DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3) #Remove space before GOterms head(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA) #Confim change ``` ```{r} DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList <- c(as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V1), as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V2), as.vector(DMLExonAnnotationHypomethylatedGOBPExpandedNoNA $V3)) #Convert to a vector DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLExonAnnotationHypomethylatedGOBPCounts <- count(DMLExonAnnotationHypomethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLExonAnnotationHypomethylatedGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLExonAnnotationHypomethylatedGOBPCounts) #Count rows ``` ```{r} write.csv(DMLExonAnnotationHypomethylatedGOBPCounts, "2019-06-20-DML-Exon-Annotation-Hypomethylated-GO-BP-Counts.csv") #Save file ``` ## Introns ### All introns #### Subset by e-value cutoffs ```{r} DMLIntronAnnotationEvalueCutoff <- subset(x = DMLIntronAnnotation, subset = DMLIntronAnnotation$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLIntronAnnotationEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Evalue-Cutoff.csv") nrow(DMLIntronAnnotationEvalueCutoff) ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLIntronAnnotationGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-GO-BP-Expanded-Column-Numbers.txt") max(DMLIntronAnnotationGOBPColumns $X1) #I should import the file with 147 columns ``` ```{r} DMLIntronAnnotationGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information DMLIntronAnnotationGOBPExpanded <- DMLIntronAnnotationGOBPExpanded[-1,] #Remove the first row, which is the original header from the file head(DMLIntronAnnotationGOBPExpanded) #Confirm import ``` ```{r} DMLIntronAnnotationGOBPExpanded <- DMLIntronAnnotationGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLIntronAnnotationGOBPExpandedNoNA <- DMLIntronAnnotationGOBPExpanded[is.na(DMLIntronAnnotationGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLIntronAnnotationGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLIntronAnnotationGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLIntronAnnotationGOBPExpandedNoNA $V2) #Remove space before GOterms DMLIntronAnnotationGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLIntronAnnotationGOBPExpandedNoNA $V3) #Remove space before GOterms head(DMLIntronAnnotationGOBPExpandedNoNA) #Confim change ``` ```{r} DMLIntronAnnotationGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V1), as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V2), as.vector(DMLIntronAnnotationGOBPExpandedNoNA$V3)) #Convert to a vector DMLIntronAnnotationGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLIntronAnnotationGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLIntronAnnotationGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLIntronAnnotationGOBPCounts <- count(DMLIntronAnnotationGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLIntronAnnotationGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLIntronAnnotationGOBPCounts) #Count the number of GOterm categories ``` ```{r} write.csv(DMLIntronAnnotationGOBPCounts, "2019-06-20-DML-Intron-Annotation-GO-BP-Counts.csv") #Save file ``` ### Hypermethylated introns #### Subset by e-value cutoffs ```{r} DMLIntronAnnotationHypermethylatedEvalueCutoff <- subset(x = DMLIntronAnnotationHypermethylated, subset = DMLIntronAnnotationHypermethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLIntronAnnotationHypermethylatedEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Hypermethylated-Evalue-Cutoff.csv") #Save file nrow(DMLIntronAnnotationHypermethylatedEvalueCutoff) #Count the number of rows ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Hypermethylated-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLIntronAnnotationHypermethylatedGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded-Column-Numbers.txt") max(DMLIntronAnnotationHypermethylatedGOBPColumns$X1) #I should import the file with 147 columns ``` ```{r} DMLIntronAnnotationHypermethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:147, sep = ""), header = FALSE) #Import expanded GO-BP information DMLIntronAnnotationHypermethylatedGOBPExpanded <- DMLIntronAnnotationHypermethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file head(DMLIntronAnnotationHypermethylatedGOBPExpanded) #Confirm import ``` ```{r} DMLIntronAnnotationHypermethylatedGOBPExpanded <- DMLIntronAnnotationHypermethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA <- DMLIntronAnnotationHypermethylatedGOBPExpanded[is.na(DMLIntronAnnotationHypermethylatedGOBPExpanded$V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2 <- gsub('{1}$', '', DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2) #Remove space before GOterms DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3 <- gsub('{1}$', '', DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3) #Remove space before GOterms head(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA) #Confim change ``` ```{r} DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V1), as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V2), as.vector(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNA$V3)) #Convert to a vector DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLIntronAnnotationHypermethylatedGOBPCounts <- count(DMLIntronAnnotationHypermethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLIntronAnnotationHypermethylatedGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLIntronAnnotationHypermethylatedGOBPCounts) #Count rows ``` ```{r} write.csv(DMLIntronAnnotationHypermethylatedGOBPCounts, "2019-06-20-DML-Intron-Annotation-Hypermethylated-GO-BP-Counts.csv") #Save file ``` ### Hypomethylated introns #### Subset by e-value cutoffs ```{r} DMLIntronAnnotationHypomethylatedEvalueCutoff <- subset(x = DMLIntronAnnotationHypomethylated, subset = DMLIntronAnnotationHypomethylated$evalue < 1e-10) #Subset rows with acceptably low e-value cutoffs (no larger than 1e-10) write.csv(DMLIntronAnnotationHypomethylatedEvalueCutoff, "2019-06-20-DML-Intron-Annotation-Hypomethylated-Evalue-Cutoff.csv") #Save file nrow(DMLIntronAnnotationHypomethylatedEvalueCutoff) #Count the number of rows ``` #### Separate GOterms into new columns ```{bash} awk -F "\"*,\"*" '{print $22}' 2019-06-20-DML-Intron-Annotation-Hypomethylated-Evalue-Cutoff.csv \ | awk '{gsub(";","\t",$0); print;}' \ > 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt #Isolate GO-BP column and convert ";" to ",". Save the output as a new file. ``` ```{bash} head -n5 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt #Confirm file creation ``` ```{bash} awk '{print NF}' 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt > 2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt #Count the number of columns in each row ``` #### Import and format isolated GO-BP information ```{r} DMLIntronAnnotationHypomethylatedGOBPColumns <- read.delim("2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded-Column-Numbers.txt") max(DMLIntronAnnotationHypomethylatedGOBPColumns$X1) #I should import the file with 73 columns ``` ```{r} DMLIntronAnnotationHypomethylatedGOBPExpanded <- read.delim("2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Expanded.txt", col.names = paste("V", 1:73, sep = ""), header = FALSE) #Import expanded GO-BP information DMLIntronAnnotationHypomethylatedGOBPExpanded <- DMLIntronAnnotationHypomethylatedGOBPExpanded[-1,] #Remove the first row, which is the original haeder from the file head(DMLIntronAnnotationHypomethylatedGOBPExpanded) #Confirm import ``` ```{r} DMLIntronAnnotationHypomethylatedGOBPExpanded <- DMLIntronAnnotationHypomethylatedGOBPExpanded[,c(1:3)] #Not every Uniprot Accession code has the same number of GOterms. I will use the first 3 GOterms assigned in my analyses DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA <- DMLIntronAnnotationHypomethylatedGOBPExpanded[is.na(DMLIntronAnnotationHypomethylatedGOBPExpanded $V1) == FALSE,] #Keeps rows with without NA in the first column (columns with at least one associated GOterm) nrow(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA) #Count the number of rows ``` ```{r} DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2 <- gsub('{1}$', '', DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2) #Remove space before GOterms DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3 <- gsub('{1}$', '', DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3) #Remove space before GOterms head(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA) #Confim change ``` ```{r} DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList <- c(as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V1), as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V2), as.vector(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNA $V3)) #Convert to a vector DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList <- data.frame("rowNumber" = seq(from = 1, to = length(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList), by = 1), "GO-BP" = DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList) #Add row numbers to use in count function head(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList) #Check format ``` #### Count GOterms ```{r} DMLIntronAnnotationHypomethylatedGOBPCounts <- count(DMLIntronAnnotationHypomethylatedGOBPExpandedNoNAList, GO.BP) #Count instances of each GOterm head(DMLIntronAnnotationHypomethylatedGOBPCounts) #Confirm counting ``` ```{r} nrow(DMLIntronAnnotationHypomethylatedGOBPCounts) ``` ```{r} write.csv(DMLIntronAnnotationHypomethylatedGOBPCounts, "2019-06-20-DML-Intron-Annotation-Hypomethylated-GO-BP-Counts.csv") #Save file ```