--- title: "DML Characterization" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` In this script, I'll create annotation tables for DML and examine patterns in DML locations across chromosomes and genes. # Install packages ```{r} require(RColorBrewer) require(dichromat) require(dplyr) ``` # Session information ```{r} sessionInfo() ``` # Obtain Uniprot codes Before I do any sort of gene enrichment analysis, I need to get Uniprot codes for all genes. To do this, I will use `blastx`. ## Make a Uniprot Database The Uniprot database was downloaded from this link as a .fasta file: http://www.uniprot.org/uniprot/?query=&fil=reviewed%3Ayes&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cgo(biological%20process)%2Cgo-id%2Ccomment(PATHWAY)%2Cdatabase(UniPathway)%2Cdatabase(CDD)%2Cdatabase(Pfam. ```{bash} head uniprot-filtered-reviewed.fasta #Check database creation ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb -help #Get the help menu ``` The first step involves taking the uniprot fasta file and making a new database. To do this, I will use the following code: 1. Path to `makeblastdb` to create the database needed 2. `in` to select the input file for the database 3. `dbtype` specifies the type of file which in our case, is a protein file ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/makeblastdb \ -in uniprot-filtered-reviewed.fasta \ -dbtype 'prot' ``` ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx -help ``` Now I can perform a `blastx` to compare my query, the *C. virginica* transcripts, to our protein database. ## Perform `blastx` 1. Path to `blastx` 2. `-query` provides the file we want to blast (*C. virginica* transcripts) 3.`-db` specifies database created in the previous step 4. `-outfmt` specifies the type of output file. I will use 6, a tabular file 5. `-out ` will allow me to save the output as a new file 6. `-num_threads 4` uses 4 CPUs in the BLAST search 7. `-max_target_seqs 1` keeps only one aligned sequence (i.e. the best match) ```{bash} /Users/Shared/Apps/ncbi-blast-2.2.29+/bin/blastx \ -query ../analyses/2018-12-02-Gene-Enrichment-Analysis/2018-09-06-Virginica-transcripts.fna \ -db uniprot-filtered-reviewed.fasta \ -outfmt 6 \ -out ../analyses/2018-12-02-Gene-Enrichment-Analysis/2018-09-06-Transcript-Uniprot-blastx.txt \ -num_threads 4 \ -max_target_seqs 1 ``` ```{bash} head ../analyses/2018-12-02-Gene-Enrichment-Analysis/2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt #Check file format of Uniprot codes. I'm using a file with the Uniprot code already isolated so I don't have to unfold any columns after merging files. ``` ```{r} blastResults <- read.delim("../analyses/2018-12-02-Gene-Enrichment-Analysis/2018-09-11-Transcript-Uniprot-blastx-codeIsolated.txt", header = FALSE, sep = "\t", dec = ".") #Import blast results head(blastResults) #Confirm import colnames(blastResults) <- c("Genbank", "V2", "Uniprot", "V4", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") head(blastResults) #Confirm column headers ``` # Master gene list I want to create a master list with the following columns: - chr - start - end - p-value - q-value - meth.diff - dist-to-closest-gene - gene-chr - gene-start - gene-end - gene-ID - mRNA-start - mRNA-end - genbank - uniprot-ID - e-value - product I can start with my DML csv file, then merge that with the information from `closestBed`. Then I'll merge that document with my gene track to get gene IDs. I'll merge the resultant file with the mRNA track by gene ID to get Genbank IDs. Finally, I'll merge that document with the Uniprot codes by matching Genbank IDs. ## Merge DML list with `closestBed` list ```{r} allDML <- read.csv("../analyses/2018-10-25-MethylKit/2018-10-25-Loci-Analysis/2019-04-05-Differentially-Methylated-Loci-Filtered-Destrand-50-Cov5.csv", header = TRUE) #Import DML list allDML <- allDML[,-1] #Remove first column head(allDML) #Confirm import ``` ```{r} closestDML <- read.delim("../analyses/2018-11-01-DML-Analysis/2019-05-29-Flanking-Analysis/2019-05-29-Genes-Closest-NoOverlap-DMLs.txt", header = FALSE) #Import DML with closestBed information colnames(closestDML) <- c("chr", "start", "end", "meth.diff", "gene-chr", "gene-start", "gene-end", "dist-to-closest-gene") #Add column names head(closestDML) #Confirm column names. DML downstream of genes have positive distances. There may be more lines than DML becuase there are some DML that overlap multiple gene features, or have ties for closest gene features. ``` ```{r} #Add 1 to the start position and subtract 1 from the end position to get DML positions compatible with the DML list closestDML$start <- closestDML$start + 1 closestDML$end <- closestDML$end -1 head(closestDML) #Confirm changes ``` ```{r} allDMLclosestBed <- merge(x = closestDML, y = allDML, by = "start", sort = FALSE) #Merge closest information and DML list. Do not resort the output. head(allDMLclosestBed) #Confirm merge ``` ```{r} allDMLclosestBed <- allDMLclosestBed[, c(2, 1, 3, 4, 11, 12, 13, 14, 8, 5, 6, 7)] #Reorganize columns head(allDMLclosestBed) #Confirm reorganization ``` ## Merge with gene track ### Import full gene list ```{r} geneList <- read.delim("../genome-feature-tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3", header = FALSE) #Import gene gff3 colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "V6", "V7", "V8", "V9") #Rename columns head(geneList) #Confirm import ``` ```{r} geneList <- geneList[,-c(6:8)] #Remove columns 6-8 head(geneList) #Confirm changes ``` ### Isolate gene description from full list ```{bash} cut -f9 ../analyses/2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3 \ | awk '{gsub(";","\t",$0); print;}' \ > ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-Gene-Descriptions.txt #Isolate gene description column and convert ";" to "\t". Save the output as a new file ``` ```{bash} head ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-Gene-Descriptions.txt #Confirm file creation. ``` Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`. ```{bash} awk '{print NF}' 2019-06-20-Gene-Descriptions.txt > 2019-06-20-Gene-Description-Column-Numbers.txt ``` ### Import isolated gene descriptions ```{r} geneDescriptionColumns <- read.delim("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-Gene-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(geneDescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 9 columns per row. ``` ```{r} geneDescriptions <- read.delim2("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-Gene-Descriptions.txt", header = FALSE, col.names = c("gene-number", "gene-ID", "V3", "V4", "V5", "V6", "V7", "V8", "V9")) #Import document with gene descrptions. Name 9 columns so all rows are imported with 9 columns head(geneDescriptions) #Confirm import ``` ```{r} geneID <- geneDescriptions$gene.ID #Isolate gene ID column geneID <- gsub("Dbxref=GeneID:", "", geneID) #Remove extraneous characters head(geneID) #Confirm changes ``` ```{r} geneList <- data.frame(geneList, geneID) #Replace geneList dataframe with geneList and geneID geneList <- geneList[,-6] #Remove extra column colnames(geneList) <- c("gene-chr", "Gnomon", "gene", "gene-start", "gene-end", "gene-ID") #Rename columns to match previous naming conventions head(geneList) #Confirm merge ``` ### Merge gene track with `allDMLclosestBed` ```{r} allDMLclosestBedgeneList <- merge(x = allDMLclosestBed, y = geneList, by = c("gene-start", "gene-end"), sort = FALSE) #Merge dataframes by gene start and end positions head(allDMLclosestBedgeneList) #Confirm merge ``` ```{r} allDMLclosestBedgeneList <- allDMLclosestBedgeneList[,c(3:12, 14:15, 1:2, 16)] #Reorganize columns head(allDMLclosestBedgeneList) #Confirm reorganization ``` ## Merge with mRNA track ### Import full mRNA track ```{r} mRNAList <- read.delim("../genome-feature-tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3", header = FALSE) #Import mRNA track mRNAList <- mRNAList[,-c(6,8)] #Remove extra columns colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "mRNA-strand", "V9") #Rename columns head(mRNAList) #Confirm import ``` ### Isolate mRNA description from full list ```{bash} cut -f9 ../analyses/2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \ | awk '{gsub(";","\t",$0); print;}' \ | awk '{gsub(",","\t",$0); print;}' \ > ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Descriptions.txt #Isolate mRNA description column and convert ";" and "," to "\t". Save the output as a new file ``` ```{bash} head ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Descriptions.txt #Confirm file creation. ``` Before I import this new data set into R, I need to count the number of columns. I can do this with `awk`. ```{bash} awk '{print NF}' ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Descriptions.txt > ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Description-Column-Numbers.txt ``` ### Import isolated mRNA descriptions ```{r} mRNADescriptionColumns <- read.delim("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Description-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(mRNADescriptionColumns) #Max number of columns. I'll need to import my gene description dataset with 78 columns per row. ``` ```{r} mRNADescriptions <- read.delim2("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Descriptions.txt", header = FALSE, col.names = paste("V", 1:78, sep = "")) #Import document with gene descrptions. Name 78 columns so all rows are imported with 78 columns head(mRNADescriptions) #Confirm import ``` ```{r} geneIDmRNAList <- mRNADescriptions$V3 #Isolate gene ID column geneIDmRNAList <- gsub("Dbxref=GeneID:", "", geneIDmRNAList) #Remove extraneous characters head(geneIDmRNAList) #Confirm changes ``` ```{r} genbankIDmRNAList <- mRNADescriptions$V4 #Isolate Genbank ID genbankIDmRNAList <- gsub("Genbank:", "", genbankIDmRNAList) #Remove extraneous characters head(genbankIDmRNAList) #Confirm changes ``` Buried in the `mRNAList` is mRNA product information. I'm going to extract that information and add it to the master list. ### Isolate product descriptions ```{bash} cut -f9 ../analyses/2019-05-13-Generating-Genome-Feature-Tracks/C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \ | awk '{gsub("product=","\t",$0); print;}' \ | cut -f2 \ | awk '{gsub(";","\t",$0); print;}' \ | awk '{gsub("%","\t",$0); print;}' \ > ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products.txt #Isolate mRNA description column and convert "product=" to "\t". This separates the descriptions into 2 columns. The beginning of second column has the product description. Take the second column of the output and convert ";" and "%" to "\t" to better isolate mRNA products. Save the output as a new file. ``` ```{bash} head ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products.txt #Confirm product description isolation ``` ```{bash} awk '{print NF}' ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products.txt > ../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products-Column-Numbers.txt ``` ### Import isolated mRNA products ```{r} mRNAProductColumns <- read.delim("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products-Column-Numbers.txt", header = FALSE) #Import data with number of columns for each row max(mRNAProductColumns) #Max number of columns. I'll need to import my dataset with 16 columns per row. ``` ```{r} mRNAProducts <- read.delim2("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Products.txt", header = FALSE, col.names = paste("V", 1:16, sep = "")) #Import document with gene descrptions. Name 16 columns so all rows are imported with 16 columns. head(mRNAProducts) #Confirm import ``` ```{r} mRNAProducts <- mRNAProducts$V1 #Extract the first column with product information ``` ### Merge mRNA track with `allDMLclosestBedgeneList` ```{r} mRNAList <- data.frame(mRNAList, geneIDmRNAList, genbankIDmRNAList, mRNAProducts) #Replace mRNAList dataframe with mRNAList, geneIDmRNAList, genbankIDmRNAList, and mRNAProducts mRNAList <- mRNAList[,-c(6:7)] #Remove extra columns colnames(mRNAList) <- c("mRNA-chr", "Gnomon", "mRNA", "mRNA-start", "mRNA-end", "gene-ID", "Genbank", "Product") #Rename columns to match previous naming conventions head(mRNAList) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAList <- merge(x = allDMLclosestBedgeneList, y = mRNAList, by = "gene-ID", sort = FALSE, all.x = TRUE) #Merge dataframes by gene ID head(allDMLclosestBedgeneListmRNAList) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAList <- allDMLclosestBedgeneListmRNAList[,c(2:15, 1, 19:22)] #Reorganize columns head(allDMLclosestBedgeneListmRNAList) #Confirm reorganization ``` ```{r} allDMLclosestBedgeneListmRNAList <- distinct(allDMLclosestBedgeneListmRNAList) #Keep only unique rows ``` ## Merge with Uniprot codes ```{r} allDMLclosestBedgeneListmRNAListUniprot <- merge(x = allDMLclosestBedgeneListmRNAList, y = blastResults, by = "Genbank", sort = FALSE, all.x = TRUE) #Merge DML list with Uniprot codes by Genbank IDs head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAListUniprot <- allDMLclosestBedgeneListmRNAListUniprot[,c(2:4, 9, 6:8, 10:11, 14:18, 1, 21:22, 31, 19)] #Reorganize columns and save as a new dataframe colnames(allDMLclosestBedgeneListmRNAListUniprot) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product") #Rename columns allDMLclosestBedgeneListmRNAListUniprot <- distinct(allDMLclosestBedgeneListmRNAListUniprot) #Keep only unique rows head(allDMLclosestBedgeneListmRNAListUniprot) #Confirm reorganization ``` ## Merge with GOterms I will use a Uniprot-SwissProt database I downloaded to match Uniprot-Accession with Gene Ontology information, which includes GOterms and IDs. I downloaded a tab-delimited SwissProt database with five additional columns: Gene Ontology IDs, Gene ontology (GO), Gene ontology (biological processes), Gene ontology (cellular component), and Gene ontology (molecular function). ### Import Uniprot-Gene Ontology databse ```{bash} gunzip ../analyses/2018-12-02-Gene-Enrichment-Analysis/uniprot-reviewed_yes.tab.gz #Unzip file ``` ```{bash} head -n2 ../analyses/2018-12-02-Gene-Enrichment-Analysis/uniprot-reviewed_yes.tab #View file header ``` ```{r} uniprotGOTerms <- read.delim("../analyses/2018-12-02-Gene-Enrichment-Analysis/uniprot-reviewed_yes.tab", header = TRUE) #Import tab-delimited file with header head(uniprotGOTerms) #Confirm import ``` ```{r} uniprotGOTerms <- uniprotGOTerms[,c(1:2, 8, 10:12)] #Remove unnecessary columns colnames(uniprotGOTerms) <- c("Uniprot-Accession", "Uniprot-ID", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match previous naming conventions head(uniprotGOTerms) #Confirm changes ``` ### Merge with `allDMLclosestBedgeneListmRNAListUniprot` ```{r} allDMLclosestBedgeneListmRNAListUniprotGOTerms <- merge(x = allDMLclosestBedgeneListmRNAListUniprot, y = uniprotGOTerms, by = c("Uniprot-Accession", "Uniprot-ID"), sort = FALSE, all.x = TRUE) #Merge files by Uniprot Accesssion code and Uniprot ID head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm merge ``` ```{r} allDMLclosestBedgeneListmRNAListUniprotGOTerms <- allDMLclosestBedgeneListmRNAListUniprotGOTerms[,c(3:17, 1:2, 18:23)] #Reorganize columns colnames(allDMLclosestBedgeneListmRNAListUniprotGOTerms) <- c("chr", "start", "end", "meth.diff", "strand", "pvalue", "qvalue", "dist-to-closest-gene", "gene-chr", "gene-start", "gene-end", "geneID", "mRNA-start", "mRNA-end", "Genbank", "Uniprot-Accession", "Uniprot-ID", "evalue", "Product", "GO-ID", "GO-BP", "GO-CC", "GO-MF") #Rename columns to match naming conventions head(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Confirm changes ``` ## Export master gene list ```{r} masterDMLAnnotation <- distinct(allDMLclosestBedgeneListmRNAListUniprotGOTerms) #Remove duplicated rows (merging artifacts) head(masterDMLAnnotation) #Confirm changes ``` ```{r} write.csv(masterDMLAnnotation, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-Master-DML-Annotation.csv") #Write out masterDMLAnnotation as a .csv file ``` Now that I have a master gene list, I can create relevant subset documents. ## DML-Gene overlaps ```{r} DMLGeneAnnotation <- subset(x = masterDMLAnnotation, subset = masterDMLAnnotation$`dist-to-closest-gene` == 0) #Subset dataset where the distance to the closest gene = 0 (i.e. DML and gene are overlapping) head(DMLGeneAnnotation) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotation, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-DML-Gene-Annotation.csv") #Save file ``` ### Characterize DML distribution in genes ##### Count the number of DML per gene ```{r} DMLGeneAnnotationNomRNA <- unique(DMLGeneAnnotation[,-c(8:9, 13:18, 20:23)]) #Remove mRNA information from annotation to quantify #DML/gene (and not by mRNA). Also eliminate Uniprot columns. Use unique to remove any lines that are merging artifacts head(DMLGeneAnnotationNomRNA) #Confirm column removal ``` ```{r} DMLbyGeneCounts <- as.data.frame(table(DMLGeneAnnotationNomRNA$geneID)) #Count the number of DML/gene colnames(DMLbyGeneCounts) <- c("geneID", "DMLCount") #Change column names head(DMLbyGeneCounts) #Confirm formatting ``` ```{r} mean(DMLbyGeneCounts$DMLCount) #Calculate mean number of DML in genes with DML median(DMLbyGeneCounts$DMLCount) #Calculate median number of DML in genes with DML max(DMLbyGeneCounts$DMLCount) #Calculate max number of DML in a single gene ``` ```{r} sum(DMLbyGeneCounts$DMLCount == 1) #Count number of genes with 1 DML sum(DMLbyGeneCounts$DMLCount == 2) #Count number of genes with 2 DML sum(DMLbyGeneCounts$DMLCount == 3) #Count number of genes with 3 DML sum(DMLbyGeneCounts$DMLCount == 4) #Count number of genes with 4 DML sum(DMLbyGeneCounts$DMLCount == 5) #Count number of genes with 5 DML ``` ```{r} numberDMLinGenes <- data.frame("DMLCount" = c("1", "2", "3", "4", "5"), "geneCount" = c(sum(DMLbyGeneCounts$DMLCount == 1), sum(DMLbyGeneCounts$DMLCount == 2), sum(DMLbyGeneCounts$DMLCount == 3), sum(DMLbyGeneCounts$DMLCount == 4), sum(DMLbyGeneCounts$DMLCount == 5))) #Create dataframe for plotting head(numberDMLinGenes) #Confirm creation ``` ```{r} genes1DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 1,] #Subset genes with 1 DML genes1DMLAnnotations <- unique(merge(x = genes1DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts head(genes1DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes1DMLAnnotations, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-1-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes2DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 2,] #Subset genes with 2 DML genes2DMLAnnotations <- unique(merge(x = genes2DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 2 DML with annotation information. Use unique to remove merging artifacts head(genes2DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes2DMLAnnotations, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-2-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes3DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 3,] #Subset genes with 3 DML genes3DMLAnnotations <- unique(merge(x = genes3DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 3 DML with annotation information. Use unique to remove merging artifacts head(genes3DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes3DMLAnnotations, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-3-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes4DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 4,] #Subset genes with 4 DML genes4DMLAnnotations <- unique(merge(x = genes4DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 4 DML with annotation information. Use unique to remove merging artifacts head(genes4DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes4DMLAnnotations, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-4-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` ```{r} genes5DML <- DMLbyGeneCounts[DMLbyGeneCounts$DMLCount == 5,] #Subset genes with 2 DML genes5DMLAnnotations <- unique(merge(x = genes5DML, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge genes with 5 DML with annotation information. Use unique to remove merging artifacts head(genes5DMLAnnotations) #Confirm merge ``` ```{r} write.csv(genes5DMLAnnotations, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-5-DML.csv", quote = FALSE, row.names = FALSE) #Save file ``` #### Characterize types of DML in genes ```{r} sum(genes1DMLAnnotations$meth.diff > 0) #Count number of genes with 1 DML hypermethylated sum(genes1DMLAnnotations$meth.diff < 0) #Count number of genes with 1 DML hypomethylated ``` ```{r} genes1DMLBreakdown <- data.frame("type" = c("1hyper", "0hyper"), "DMLcount" = c(sum(genes1DMLAnnotations$meth.diff > 0), sum(genes1DMLAnnotations$meth.diff < 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes1DMLBreakdown) ``` I did some Excel manipulations to count the number of hypermethylated DML in genes with multiple DML. ```{r} genes2DMLCounts <- read.csv("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-2-DML-withCounts.csv", header = TRUE) #Import file genes2DMLCounts$no.hypo <- genes2DMLCounts$DMLCount - genes2DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes2DMLCounts) #Confirm changes ``` ```{r} sum(genes2DMLCounts$no.hyper == 2) #Count # genes with both DML hypermethylated sum(genes2DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated sum(genes2DMLCounts$no.hyper == 0) #Count # genes with both DML hypomethylated ``` ```{r} genes2DMLBreakdown <- data.frame("type" = c("2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes2DMLCounts$no.hyper == 2), sum(genes2DMLCounts$no.hyper == 1), sum(genes2DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes2DMLBreakdown) ``` ```{r} genes3DMLCounts <- read.csv("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-3-DML-withCounts.csv", header = TRUE) #Import file genes3DMLCounts$no.hypo <- genes3DMLCounts$DMLCount - genes3DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes3DMLCounts) #Confirm changes ``` ```{r} sum(genes3DMLCounts$no.hyper == 3) #Count # genes with all DML hypermethylated sum(genes3DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 1 hypomethylated sum(genes3DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 1 hypomethylated sum(genes3DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ```{r} genes3DMLBreakdown <- data.frame("type" = c("3hyper", "2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes3DMLCounts$no.hyper == 3), sum(genes3DMLCounts$no.hyper == 2), sum(genes3DMLCounts$no.hyper == 1), sum(genes3DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes3DMLBreakdown) ``` ```{r} genes4DMLCounts <- read.csv("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-4-DML-withCounts.csv", header = TRUE) #Import file genes4DMLCounts$no.hypo <- genes4DMLCounts$DMLCount - genes4DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes4DMLCounts) #Confirm changes ``` ```{r} sum(genes4DMLCounts$no.hyper == 4) #Count # genes with all DML hypermethylated sum(genes4DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 1 hypomethylated sum(genes4DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 2 hypomethylated sum(genes4DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 3 hypomethylated sum(genes4DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ```{r} genes4DMLBreakdown <- data.frame("type" = c("4hyper", "3hyper", "2hyper", "1hyper", "0hyper"), "DMLcount" = c(sum(genes4DMLCounts$no.hyper == 4), sum(genes4DMLCounts$no.hyper == 3), sum(genes4DMLCounts$no.hyper == 2), sum(genes4DMLCounts$no.hyper == 1), sum(genes4DMLCounts$no.hyper == 0))) #Create dataframe with hyper/hypo breakdown to plot data head(genes4DMLBreakdown) ``` ```{r} genes5DMLCounts <- read.csv("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-01-Genes-with-5-DML-withCounts.csv", header = TRUE) #Import file genes5DMLCounts$no.hypo <- genes5DMLCounts$DMLCount - genes5DMLCounts$no.hyper #Calculate the number of hypomethylated DML in genes head(genes5DMLCounts) #Confirm changes ``` ```{r} sum(genes5DMLCounts$no.hyper == 5) #Count # genes with all DML hypermethylated sum(genes5DMLCounts$no.hyper == 4) #Count # genes with 4 DML hypermethylated and 1 hypomethylated sum(genes5DMLCounts$no.hyper == 3) #Count # genes with 3 DML hypermethylated and 2 hypomethylated sum(genes5DMLCounts$no.hyper == 2) #Count # genes with 2 DML hypermethylated and 3 hypomethylated sum(genes5DMLCounts$no.hyper == 1) #Count # genes with 1 DML hypermethylated and 4 hypomethylated sum(genes5DMLCounts$no.hyper == 0) #Count # genes with all DML hypomethylated ``` ```{r} genes1DMLBreakdown$totalDML <- rep(1, times = length(genes1DMLBreakdown$type)) #Add a column with the total number of DML in a gene head(genes1DMLBreakdown) #Confirm formatting genes2DMLBreakdown$totalDML <- rep(2, times = length(genes2DMLBreakdown$type)) #Add a column with the total number of DML in a gene head(genes2DMLBreakdown) #Confirm formatting genes3DMLBreakdown$totalDML <- rep(3, times = length(genes3DMLBreakdown$type)) #Add a column with the total number of DML in a gene head(genes3DMLBreakdown) #Confirm formatting genes4DMLBreakdown$totalDML <- rep(4, times = length(genes4DMLBreakdown$type)) #Add a column with the total number of DML in a gene head(genes4DMLBreakdown) #Confirm formatting ``` ```{r} combDMLBreakdown <- rbind(genes1DMLBreakdown, genes2DMLBreakdown, genes3DMLBreakdown, genes4DMLBreakdown) head(combDMLBreakdown) #Confirm changes ``` ```{r} combDMLCounts <- dcast(combDMLBreakdown, type ~ totalDML, value.var = "DMLcount") #Convert long dataframe to wide combDMLCounts[,6] <- c(0, 0, 0, 0, 1) #Add data for one gene with 5 DML (4 hyper, 1 hypo) row.names(combDMLCounts) <- combDMLCounts$type #Save number of hypermethylated DML as rownames combDMLCounts <- combDMLCounts[,-1] #Remove first column colnames(combDMLCounts) <- c(1, 2, 3, 4, 5) #Add total number of DML per gene as column names combDMLCounts[is.na(combDMLCounts)] <- 0 #Replace NA with 0 head(combDMLCounts) #Confirm formatting ``` ```{r} combDMLCounts[6,] <- 0 #Create an empty row with zeros nLength <- length(combDMLCounts$`2`) #Count the number of rows for (i in 1:nLength) { combDMLCounts[6,i] <- sum(combDMLCounts[,i]) #Sum the number of DML in each category } for (i in 1:nLength) { combDMLCounts[i,] <- (combDMLCounts[i,] / combDMLCounts[6,]) * 100 #Divide all entries by the total number of DML in each category and multiply by 100 } combDMLCounts <- combDMLCounts[-6,] #Remove row with totals head(combDMLCounts) #Confirm changes ``` ##### Create individual figures I'm going to create a multipanel plot with a breakdown of the number hyper- and hypomethylated in genes. I will start by creating each figure separately. ```{r} #DML/gene barplot(numberDMLinGenes$geneCount, axes = FALSE, names.arg = numberDMLinGenes$DMLCount, ylim = c(0, 450)) #Create base plot axis(side = 2, las = 1, at = c(0, 450), col = "grey80") #Put the axis labels only at 0 and 450. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 1 DML barplot(genes1DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("1,0", "0,1"), ylim = c(0, 250), col = c("grey20", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 250), col = "grey80") #Put the axis labels only at 0 and 250. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 2 DML barplot(genes2DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("2,0", "1,1", "0,2"), ylim = c(0, 20), col = c("grey20", "grey60", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 20), col = "grey80") #Put the axis labels only at 0 and 20. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 3 DML barplot(genes3DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("3,0", "2,1", "1,2", "0,3"), ylim = c(0, 5), col = c("grey20", "grey40", "grey60", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 5), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 5. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} #Hyper/hypo breakdown for genes with 4 DML barplot(genes4DMLBreakdown$DMLcount, axes = FALSE, names.arg = c("4,0", "3,1", "2,2", "1,3", "0,4"), ylim = c(0, 3), col = c("grey20", "grey40", "grey60", "grey80", "grey100")) #Create base plot axis(side = 2, las = 1, at = c(0, 3), col = "grey80") #Put the axis labels only at 0 and the maximum value specified by yPrettyTicks. Tick marks are at 0 and 3. mtext(side = 2, "Number of Genes with DML", line = 3) #Add y-axis label mtext(side = 1, "Number of DML (Hypermethylated, Hypomethylated)", line = 3) #Add x-axis label ``` ```{r} barplot(t(t(combDMLCounts)), axes = FALSE, names.arg = c("1", "2", "3", "4", "5"), cex.names = 1.5, ylim = c(0,100), col = plotColors[1:5]) #Create a stacked barplot. Use axes = FALSE to remove the y-axis and names.arg to set labels on the x-axis. mtext(side = 1, "Total DML in a Gene", line = 3, cex = 1.5) #Add x-axis label axis(side = 2, at = seq(0, 100, by = 10), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis mtext(side = 2, "Hypermethylated DML (%)", line = 3, cex = 1.5) #Add y-axis label par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot legend("top", legend = c("0", "1", "2", "3", "4"), xpd = TRUE, horiz = TRUE, pch = 22, col = "black", pt.bg = plotColors[1:5], bty = "n", cex = 1.5) #Place a legend in the top right of the figure with no box ``` #### Create multipanel figure ```{r} RColorBrewer::display.brewer.all() #Show all RColorBrewer palettes. I will choose greens. plotColors <- rev(RColorBrewer::brewer.pal(5, "GnBu")) #Create a color palette for the barplots. Use 5 green-blue shades from RColorBrewer. Reverse theorder so the darkest shade is used first. barplot(t(t(proportionData)), col = plotColors) #See what plot looks like with new scheme barplot(t(t(proportionData)), col = dichromat(plotColors)) #Check that the plot colors will be easy to interpret for those with color blindess ``` I'm going to combine the breakdowns with the number of DML in each chromosome. ```{r} #pdf("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-03-DML-Distribution-in-Chr-Genes.pdf", width = 11, height = 8.5) plotMatrix <- matrix(c(1, 2, 3), nrow = 3, ncol = 1, byrow = TRUE) #Create a matrix with plot location information. Fill matrix by row. par(mar = c(0, 0, 6, 2.5), oma = c(6, 8, 0, 7)) #Specify inner and outer margins layout(mat = plotMatrix) #Create a layout based on plotMatrix. The first and second plots (first column) should be wider. #layout.show(n = 3) #Confirm layout looks good #Chromosome distribution DMLbarplot <- barplot(as.matrix(t(DMLchrCounts$DMLbyChrCpG)), axes = FALSE, names.arg = DMLchrCounts$chr, cex.names = 1.5, xlim = c(0.7,11.5), ylim = c(0,7e-05), col = plotColors[6]) #Create a barplot and save as a new object. Use axes = FALSE to remove the y-axis and names.arg to set labels on the x-axis. The object contains x coordinates for bars, so xlim is set at 12 to compensate for maximum value of 11.5 mtext(side = 1, "Chromosome", line = 3, cex = 1.3) #Add x-axis label axis(side = 2, line = 1.5, at = seq(0, 7e-05, by = 1e-05), labels = seq(0, 7, by = 1), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for DML counts mtext(side = 2, "Number DML per 100,000 CpGs", line = 5, cex = 1.3) #Add y-axis label for DML counts mtext(side = 3, line = -1, adj = 0, text = "a.") #Add figure label #Add gene number infomration par(new = TRUE) #Create a new plot plot(x = DMLbarplot, y = DMLchrCounts$geneCount, type = "b", axes = FALSE, xlab = "", ylab = "", xaxs = "i", yaxs = "i", pch = 16, col = plotColors[2], xlim = c(0,12), ylim = c(0,6500)) #Plot points and lines (type = "b") for gene count by chromosome. Use the coordinates from DMLbarplot (x = DMLbarplot) and set xlim = (0,12) so plots are lined up. Use axes = FALSE to remove both axes. Remove x and y lables (xlab = ""; ylab = ""). Set ylim = (0,6500) to account for max y-values. Use xaxs and yxaxs to remove space between axes. axis(side = 4, line = 1.5, at = seq(0, 6500, by = 500), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis for gene sequence counts mtext(side = 4, "Number of Genes", line = 7, cex = 1.3) #Add y-axis label for gene sequence counts #DML/gene distribution barplot(numberDMLinGenes$geneCount, axes = FALSE, names.arg = numberDMLinGenes$DMLCount, cex.names = 1.5, col = plotColors[6], ylim = c(0, 450)) #Create base plot axis(side = 2, las = 1, at = c(0, 450), col = "grey80", cex.axis = 1.2) #Put the axis labels only at 0 and 450. mtext(side = 2, "Number of Genes with DML", line = 4, cex = 1.3) #Add y-axis label mtext(side = 3, line = -1, at = c(0.15, 440), text = "b.") #Add figure label #Hyper/hypo breakdown #Combined hyper/hypo plot barplot(t(t(combDMLCounts)), axes = FALSE, names.arg = c("1", "2", "3", "4", "5"), cex.names = 1.5, ylim = c(0,100), col = plotColors[1:5]) #Create a stacked barplot. Use axes = FALSE to remove the y-axis and names.arg to set labels on the x-axis. mtext(side = 1, "Total DML in a Gene", line = 3, cex = 1.3) #Add x-axis label axis(side = 2, at = seq(0, 100, by = 10), las = 2, col = "grey80", cex.axis = 1.2) #Add y-axis mtext(side = 2, "Genes with DML (%)", line = 4, cex = 1.3) #Add y-axis label mtext(side = 3, line = 1, adj = 0, text = " c.") #Add figure label par(fig = c(0, 1, 0, 1), oma = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), new = TRUE) #Create new plot plot(0, 0, type = "n", bty = "n", xaxt = "n", yaxt = "n") #Add new plot on top of current plot legend(x = 0.8, y = -0.4, legend = c("0", "1", "2", "3", "4"), pch = 22, col = "black", pt.bg = plotColors[1:5], bty = "n", cex = 2) #Place a legend in the top right of the figure with no box #dev.off() ``` ### Scaled DML distribution Scaling each gene from 0% to 100%, I want to see where DML occurs. This is useful to see if methylation is occuring in any consistent location for each gene. ```{r} head(DMLGeneAnnotationNomRNA) ``` ```{r} DMLGeneAnnotationNomRNA$geneLength <- DMLGeneAnnotationNomRNA$gene.end - DMLGeneAnnotationNomRNA$gene.start #Calculate gene length DMLGeneAnnotationNomRNA$absPosition <- DMLGeneAnnotationNomRNA$start - DMLGeneAnnotationNomRNA$gene.start #Calculate the absolute position of the DML in the gene DMLGeneAnnotationNomRNA$scaledPosition <- DMLGeneAnnotationNomRNA$absPosition / DMLGeneAnnotationNomRNA$geneLength #Calculate the scaled position of the DML in the gene head(DMLGeneAnnotationNomRNA) #Confirm calculations ``` ```{r} DMLGenePositionsHyper <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff > 0) #Subset hypermethylated DML DMLGenePositionsHypo <- subset(DMLGeneAnnotationNomRNA, subset = DMLGeneAnnotationNomRNA$meth.diff < 0) #Subset hypomethylated DML head(DMLGenePositionsHyper) #Confirm subset head(DMLGenePositionsHypo) #Confirm subset ``` #### Create individual figures I'm going to make back-to-back histograms to easily compare distributions of hyper- and hypomethtylated DML along a gene. First I'll create individual plots, then I'll add them to the same coordinate plane. ```{r} #Hypermethylated DML hist(DMLGenePositionsHyper$scaledPosition, breaks = 100, axes = FALSE, xlab = "", ylab = "", main = "", col = "grey20", xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent. axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label axis(side = 2, at = seq(from = 0, to = 8, by = 1), col = "grey80", las = 2) #Add y-axis mtext(side = 2, "Number of Hypermethylated DML", line = 2) #Add y-axis label ``` ```{r} #Hypomethylated DML hist(DMLGenePositionsHypo$scaledPosition, breaks = 100, axes = FALSE, xlab = "", ylab = "", main = "", col = "grey100", xaxs = "i", yaxs = "i") #Create base plot with no axes or labels. Include breaks at each percent. axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80") #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3) #Add x-axis label axis(side = 2, at = seq(from = 0, to = 7, by = 1), col = "grey80", las = 2) #Add y-axis mtext(side = 2, "Number of Hypomethylated DML", line = 2) #Add y-axis label ``` #### Create back-to-back plot ```{r} hyperPlotDML <- hist(DMLGenePositionsHyper$scaledPosition, breaks = 100, plot = FALSE) #Save base hypermethylated plot hypoPlotDML <- hist(DMLGenePositionsHypo$scaledPosition, breaks = 100, plot = FALSE) #Save base hypomethylated plot ``` ```{r} #Calculate plotting parameters hypoPlotDML$counts <- -(hypoPlotDML$counts) #Make counts negative for hypoPlot hmaxDML <- max(hyperPlotDML$counts) #Set maximum value as max value of hyperPlot hminDML <- min(hypoPlotDML$counts) #Set minimum value as minimum value of hypoPlot breaksDMLPlot <- c(hyperPlotDML$breaks, hypoPlotDML$breaks) #Set breaks for plots xmaxDML <- max(breaksDMLPlot) #Set xmax based on breaks xminDML <- min(breaksDMLPlot) #Set xmin based on breaks ``` ```{r} #pdf("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-10-09-Scaled-Gene-DML-Distribution.pdf", height = 8.5, width = 11) plot(hyperPlotDML, xlim = c(xminDML, xmaxDML), ylim = c(hminDML, hmaxDML), axes = FALSE, xlab = "", ylab = "", main = "", col = plotColors[2]) #Create the base plot with just hyperPlot. Set xlim and ylim on parameters calculated above. Remove all axes and axis labels lines(hypoPlotDML, col= plotColors[5]) #Add hypoPlot axis(side = 1, at = seq(from = 0, to = 1, by = 0.1), col = "grey80", cex = 1.2) #Add x-axis mtext(side = 1, "Scaled Position on Gene", line = 3, cex = 1.5) #Add x-axis label axis(side = 2, at = seq(from = hminDML, to = hmaxDML, by = 2), labels = c(seq(from = hminDML, to = 0, by = 2)*-1, seq(from = 2, to = hmaxDML, by = 2)), col = "grey80", las = 2, cex = 1.2) #Add y-axis. Add labels from minimum to maximum valules, but mtext(side = 2, "Number of DML", line = 2, cex = 1.5) #Add y-axis label #dev.off() ``` ### Biomineralization Genes If methylation changes in adults can impact biomineralization processes in larvae, then biomineralization genes should be differentially methylated in adult gonad tissue. I want to see how many (if any) biomineralization genes contain DML. ```{r} biomineralizationGenes <- read.csv("../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-12-09-Biomineralization-Genes.csv", header = TRUE) #Import biomineralization genes Alan compiled biomineralizationGenes$geneID <- gsub("LOC", "", biomineralizationGenes$geneID) #Remove extraneous characters. The biomineralization geneID have "LOC," while my master list geneID don't have those characters. I removed these characters length(biomineralizationGenes$geneID) #There are 82 biomineralization genes head(biomineralizationGenes) #Confirm changes ``` ```{r} DMLbiomineralizationGenes <- unique(merge(x = biomineralizationGenes, y = DMLGeneAnnotationNomRNA, by = "geneID")) #Merge DML-gene overlaps where the gene ID matches with biomineralization gene IDs DMLbiomineralizationGenes <- DMLbiomineralizationGenes[,-c(7, 12:15)] #Remove extraneous columns colnames(DMLbiomineralizationGenes) <- c("geneID", "Product", "chr", "start", "end", "meth.diff", "pvalue", "qvalue", "gene.start", "gene.end") length(unique(DMLbiomineralizationGenes$geneID)) #There are four unique biomineralization genes with DML head(DMLbiomineralizationGenes) #Confirm merge ``` ```{r} write.csv(DMLbiomineralizationGenes, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-12-09-Biomineralization-Genes-with-DML.csv", row.names = FALSE, quote = FALSE) #Save file ``` ### Hypermethylated DML-gene overlaps ```{r} DMLGeneAnnotationHypermethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff > 0) #Subset hypermethylated DML by selecting meth.diff > 0 head(DMLGeneAnnotationHypermethylated) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotationHypermethylated, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-DML-Gene-Annotation-Hypermethylated.csv") #Save file ``` ### Hypomethylated DML-gene overlaps ```{r} DMLGeneAnnotationHypomethylated <- subset(x = DMLGeneAnnotation, subset = DMLGeneAnnotation$meth.diff < 0) #Subset hypomethylated DML by selecting meth.diff < 0 head(DMLGeneAnnotationHypomethylated) #Confirm subset ``` ```{r} write.csv(DMLGeneAnnotationHypomethylated, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-DML-Gene-Annotation-Hypomethylated.csv") #Save file ``` # Gene background ## Import DMLBackground-mRNA overlaps ```{r} DMLBackgroundmRNAOverlaps <- read.delim("../analyses/2018-11-01-DML-Analysis/2019-06-20-DMLBackground-mRNA.txt", header = FALSE) DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,-c(4:5)] #Remove extra columns colnames(DMLBackgroundmRNAOverlaps) <- c("chr", "DMLB-start", "DMLB-end", "mRNA-start", "mRNA-end") #Rename columns head(DMLBackgroundmRNAOverlaps) #Confirm import ``` ## Merge with `mRNAList` ```{r} DMLBackgroundmRNAOverlaps <- merge(x = DMLBackgroundmRNAOverlaps, y = mRNAList, by = c("mRNA-start", "mRNA-end"), sort = FALSE) #Merge dataframes by mRNA start and end positions head(DMLBackgroundmRNAOverlaps) #Confirm merge ``` ```{r} DMLBackgroundmRNAOverlaps <- DMLBackgroundmRNAOverlaps[,c(3:6, 1:2, 9:11)] #Reorganize columns head(DMLBackgroundmRNAOverlaps) #Confirm reorganization ``` ```{r} head(blastResults) ``` ## Merge with Uniprot codes ```{r} backgroundmRNAblast <- merge(x = blastResults, y = DMLBackgroundmRNAOverlaps, by = "Genbank", sort = FALSE) #Merge the gene background and Uniprot codes by the Genbank ID column backgroundmRNAblast <- distinct(backgroundmRNAblast) #Remove duplicate rows from merging artifacts head(backgroundmRNAblast) #Confirm merge ``` ```{r} backgroundmRNAblast <- backgroundmRNAblast[,c(15:21, 1:14, 22)] #Reorganize columns backgroundmRNAblasteValueCutoff <- subset(backgroundmRNAblast, subset = backgroundmRNAblast$evalue < 1e-10) #Subset by evalue cutoff head(backgroundmRNAblasteValueCutoff) #Confirm reorganization ``` ```{r} write.csv(backgroundmRNAblasteValueCutoff, "../analyses/2018-12-02-Gene-Enrichment-Analysis/2019-06-20-mRNA-Gene-Background-Uniprot.csv") #Write out file ```