{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DML and DMR Analysis\n",
"\n",
"In this notebook, I will examine the location of differentially methylated loci (DML) and regions (DMR) in the *C. virginica* genome. The DML and DMR were identified using methylKit in [this R script](https://github.com/fish546-2018/yaamini-virginica/tree/master/analyses/2018-10-25-MethylKit).\n",
"\n",
"Methods:\n",
"\n",
"0. Prepare for Analyses\n",
"1. Locate Files and Set Variable Paths\n",
"2. Identify Overlaps with Genomic Feature Tracks\n",
"3. Identify Overlaps between CG motifs and other Feature Tracks\n",
"4. Identify Overlaps between Transposable Elements and Other Feature Tracks\n",
"5. Calculate Overlap Proprtions\n",
"6. Gene Flanking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. Prepare for Analyses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 0a. Set Working Directory"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'/Users/yaamini/Documents/yaamini-virginica/notebooks'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pwd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/yaamini/Documents/yaamini-virginica/analyses\n"
]
}
],
"source": [
"cd ../analyses/"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'/Users/yaamini/Documents/yaamini-virginica/analyses'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pwd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!mkdir 2018-11-01-DML-and-DMR-Analysis"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34m2018-10-25-MethylKit\u001b[m\u001b[m/ \u001b[34m2019-01-15-Sample-Clustering\u001b[m\u001b[m/\r\n",
"\u001b[34m2018-11-01-DML-and-DMR-Analysis\u001b[m\u001b[m/ \u001b[34m2019-03-07-IGV-Verification\u001b[m\u001b[m/\r\n",
"\u001b[34m2018-12-02-Gene-Enrichment-Analysis\u001b[m\u001b[m/ README.md\r\n"
]
}
],
"source": [
"ls -F"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/yaamini/Documents/yaamini-virginica/analyses/2018-11-01-DML-and-DMR-Analysis\n"
]
}
],
"source": [
"cd 2018-11-01-DML-and-DMR-Analysis/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 0b. Download Genome Feature Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will be using the following Genome Feature Tracks [Roberts Lab Genomic Resources wiki page](https://github.com/RobertsLab/resources/wiki/Genomic-Resources):\n",
"\n",
"1. Exon: Coding regions\n",
"2. Intron: Regions that are removed\n",
"3. mRNA: Code for proteins! The mRNA track includes both introns and exons\n",
"4. Transposable elements (all): Transposable elements located using information all species in the RepeatMasker databse (see [Sam's notes](http://onsnetwork.org/kubu4/2018/08/28/transposable-element-mapping-crassostrea-virginica-genome-cvirginica_v300-using-repeatmasker-4-07/) for more information)\n",
"5. Tranpsosable elements (_C. gigas_): Transposable elements located using information from _C. gigas_ only (see [Sam's notes](http://onsnetwork.org/kubu4/2018/08/28/transposable-element-mapping-crassostrea-virginica-genome-cvirginica_v300-using-repeatmasker-4-07/) for more information)\n",
"4. CG motifs: Regions with CGs where methylation can occur"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 20.7M 100 20.7M 0 0 45.7M 0 --:--:-- --:--:-- --:--:-- 47.4M\n"
]
}
],
"source": [
"!curl http://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_Gnomon_exon.bed > C_virginica-3.0_Gnomon_exon.bed"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 9260k 100 9260k 0 0 44.1M 0 --:--:-- --:--:-- --:--:-- 44.9M\n"
]
}
],
"source": [
"!curl http://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_intron.bed > C_virginica-3.0_intron.bed"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 26.4M 100 26.4M 0 0 53.0M 0 --:--:-- --:--:-- --:--:-- 53.2M\n"
]
}
],
"source": [
"!curl http://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_Gnomon_mRNA.gff3 > C_virginica-3.0_Gnomon_mRNA.gff3"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 63.0M 100 63.0M 0 0 45.2M 0 0:00:01 0:00:01 --:--:-- 45.3M\n"
]
}
],
"source": [
"!curl http://owl.fish.washington.edu/halfshell/genomic-databank/C_virginica-3.0_TE-all.gff > C_virginica-3.0_TE-all.gff"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 57.4M 100 57.4M 0 0 47.4M 0 0:00:01 0:00:01 --:--:-- 47.5M\n"
]
}
],
"source": [
"!curl http://owl.fish.washington.edu/halfshell/genomic-databank/C_virginica-3.0_TE-Cg.gff > C_virginica-3.0_TE-Cg.gff"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-11-07-DML-Exon.txt\r\n",
"2018-11-07-DML-Intron.txt\r\n",
"2018-11-07-DML-TE-Cg.txt\r\n",
"2018-11-07-DML-TE-all.txt\r\n",
"2018-11-07-DML-mRNA.txt\r\n",
"2018-11-07-DMR-Exon.txt\r\n",
"2018-11-07-DMR-Intron.txt\r\n",
"2018-11-07-DMR-TE-Cg.txt\r\n",
"2018-11-07-DMR-TE-all.txt\r\n",
"2018-11-07-DMR-mRNA.txt\r\n",
"2018-11-07-Exon-CGmotif.txt\r\n",
"2018-11-07-Exon-TE-Cv.txt\r\n",
"2018-11-07-Exon-TE-all.txt\r\n",
"2018-11-07-Intron-CGmotif.txt\r\n",
"2018-11-07-Intron-TE-Cv.txt\r\n",
"2018-11-07-Intron-TE-all.txt\r\n",
"2018-11-07-TE-Cv-CGmotif.txt\r\n",
"2018-11-07-TE-all-CGmotif.txt\r\n",
"2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt\r\n",
"2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt\r\n",
"2018-11-07-mRNA-CGmotif.txt\r\n",
"2018-11-07-mRNA-TE-Cv.txt\r\n",
"2018-11-07-mRNA-TE-all.txt\r\n",
"C_virginica-3.0_CG-motif.bed\r\n",
"C_virginica-3.0_Gnomon_exon.bed\r\n",
"C_virginica-3.0_Gnomon_mRNA.gff3\r\n",
"C_virginica-3.0_TE-Cg.gff\r\n",
"C_virginica-3.0_TE-all.gff\r\n",
"C_virginica-3.0_intron.bed\r\n"
]
}
],
"source": [
"!ls -F"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Locate Relevant Files and Set Variable Path Names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1a. Set Variable Path Names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting the variable path names allows me to reuse this script with different input files or different paths to programs without manually changing the file names each time."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bedtoolsDirectory = \"/Users/Shared/bioinformatics/bedtools2/bin/\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"DMLlist = \"../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations.bed\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"hyperDML = \"../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations-Hypermethylated.bed\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"hypoDML = \"../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations-Hypomethylated.bed\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"DMRlist = \"../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed\""
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"exonList = \"C_virginica-3.0_Gnomon_exon.bed\""
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"intronList = \"C_virginica-3.0_intron.bed\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mRNAList = \"C_virginica-3.0_Gnomon_mRNA.gff3\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"transposableElementsAll = \"C_virginica-3.0_TE-all.gff\""
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"transposableElementsCg = \"C_virginica-3.0_TE-Cg.gff\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"CGMotifList = \"C_virginica-3.0_CG-motif.bed\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1b. Confirm Variable Path Works and Characterize Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The BEDfiles with DML and DMR can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. The files only have DML and DMR that were at least 50% different between the two treatments (control and elevated pCO2)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571138\t571140\t58\r\n",
"NC_035780.1\t1882691\t1882693\t64\r\n",
"NC_035780.1\t1885022\t1885024\t61\r\n",
"NC_035780.1\t1933499\t1933501\t51\r\n",
"NC_035780.1\t1958998\t1959000\t50\r\n",
"NC_035780.1\t2538924\t2538926\t-50\r\n",
"NC_035780.1\t2541726\t2541728\t-54\r\n",
"NC_035780.1\t2584492\t2584494\t56\r\n",
"NC_035780.1\t2586508\t2586510\t-53\r\n",
"NC_035780.1\t2588794\t2588796\t-53\r\n"
]
}
],
"source": [
"#Previewing the files\n",
"!head {DMLlist}"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 598 ../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations.bed\r\n"
]
}
],
"source": [
"#Counting the number of lines to count DML\n",
"!wc -l {DMLlist}"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571100\t571200\tDMR\t58\r\n",
"NC_035780.1\t573700\t573800\tDMR\t52\r\n",
"NC_035780.1\t1885000\t1885100\tDMR\t51\r\n",
"NC_035780.1\t1933500\t1933600\tDMR\t53\r\n",
"NC_035780.1\t4285700\t4285800\tDMR\t-51\r\n",
"NC_035780.1\t24159600\t24159700\tDMR\t51\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\r\n",
"NC_035780.1\t28563400\t28563500\tDMR\t59\r\n",
"NC_035780.1\t29883000\t29883100\tDMR\t-55\r\n",
"NC_035780.1\t31302900\t31303000\tDMR\t-61\r\n"
]
}
],
"source": [
"!head {DMRlist}"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 310 ../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations-Hypermethylated.bed\r\n"
]
}
],
"source": [
"!wc -l {hyperDML}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t401630\t401632\t53\r\n",
"NC_035780.1\t571138\t571140\t58\r\n",
"NC_035780.1\t1882691\t1882693\t64\r\n",
"NC_035780.1\t1885022\t1885024\t61\r\n",
"NC_035780.1\t1933499\t1933501\t51\r\n",
"NC_035780.1\t2584492\t2584494\t56\r\n",
"NC_035780.1\t2589720\t2589722\t57\r\n",
"NC_035780.1\t4286286\t4286288\t67\r\n",
"NC_035780.1\t8833124\t8833126\t60\r\n",
"NC_035780.1\t12631453\t12631455\t60\r\n"
]
}
],
"source": [
"!head {hyperDML}"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 288 ../../analyses/2018-10-25-MethylKit/2019-04-05-DML-Destrand-5x-Locations-Hypomethylated.bed\r\n"
]
}
],
"source": [
"!wc -l {hypoDML}"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t2538924\t2538926\t-50\r\n",
"NC_035780.1\t2541726\t2541728\t-54\r\n",
"NC_035780.1\t2586508\t2586510\t-53\r\n",
"NC_035780.1\t4286802\t4286804\t-62\r\n",
"NC_035780.1\t4288213\t4288215\t-58\r\n",
"NC_035780.1\t4289628\t4289630\t-52\r\n",
"NC_035780.1\t8693287\t8693289\t-52\r\n",
"NC_035780.1\t9110274\t9110276\t-63\r\n",
"NC_035780.1\t17093218\t17093220\t-52\r\n",
"NC_035780.1\t17488958\t17488960\t-57\r\n"
]
}
],
"source": [
"!head {hypoDML}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 162 ../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed\r\n"
]
}
],
"source": [
"#Counting the number of DMR\n",
"!wc -l {DMRlist}"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t13578\t13603\r\n",
"NC_035780.1\t14237\t14290\r\n",
"NC_035780.1\t14557\t14594\r\n",
"NC_035780.1\t28961\t29073\r\n",
"NC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t31736\t31887\r\n",
"NC_035780.1\t31977\t32565\r\n",
"NC_035780.1\t32959\t33324\r\n",
"NC_035780.1\t66869\t66897\r\n",
"NC_035780.1\t64123\t64334\r\n"
]
}
],
"source": [
"!head {exonList}"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 731279 C_virginica-3.0_Gnomon_exon.bed\r\n"
]
}
],
"source": [
"!wc -l {exonList}"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t28961\t28961\r\n",
"NC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t31558\t31736\r\n",
"NC_035780.1\t31888\t31977\r\n",
"NC_035780.1\t32566\t32959\r\n",
"NC_035780.1\t43110\t43112\r\n",
"NC_035780.1\t44359\t45913\r\n",
"NC_035780.1\t46507\t64123\r\n",
"NC_035780.1\t64335\t66869\r\n",
"NC_035780.1\t85606\t85606\r\n"
]
}
],
"source": [
"!head {intronList}"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 319262 C_virginica-3.0_intron.bed\r\n"
]
}
],
"source": [
"!wc -l {intronList}"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n"
]
}
],
"source": [
"!head -n 1 {mRNAList}"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 60201 C_virginica-3.0_Gnomon_mRNA.gff3\r\n"
]
}
],
"source": [
"!wc -l {mRNAList}"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"##gff-version 2\r\n",
"##date 2018-08-23\r\n",
"##sequence-region Cvirginica_v300.fa\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t262\t1389\t31.1\t+\t.\tTarget \"Motif:REP-6_LMi\" 2920 4055\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t1728\t1947\t26.1\t-\t.\tTarget \"Motif:REP-6_LMi\" 14320 14534\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t1866\t2013\t33.6\t+\t.\tTarget \"Motif:LSU-rRNA_Cel\" 2372 2520\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t2129\t2367\t20.5\t-\t.\tTarget \"Motif:REP-6_LMi\" 13886 14118\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t2836\t2980\t31.5\t+\t.\tTarget \"Motif:REP-6_LMi\" 6216 6359\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t3196\t3277\t30.5\t+\t.\tTarget \"Motif:REP-6_LMi\" 6572 6653\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t5168\t5532\t32.9\t+\t.\tTarget \"Motif:REP-6_LMi\" 4620 4983\r\n"
]
}
],
"source": [
"!head {transposableElementsAll}"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 692371 C_virginica-3,0_TE-all.gff\r\n"
]
}
],
"source": [
"!wc -l {transposableElementsAll}"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"##gff-version 2\r\n",
"##date 2018-08-27\r\n",
"##sequence-region Cvirginica_v300.fa\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t1866\t2013\t33.6\t+\t.\tTarget \"Motif:LSU-rRNA_Cel\" 2372 2520\r\n",
"NC_007175.2\tRepeatMasker\tsimilarity\t6529\t6628\t19.0\t+\t.\tTarget \"Motif:(TA)n\" 2 102\r\n",
"NC_035780.1\tRepeatMasker\tsimilarity\t1473\t1535\t 0.0\t+\t.\tTarget \"Motif:(TAACCC)n\" 1 63\r\n",
"NC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\tRepeatMasker\tsimilarity\t7423\t7489\t25.4\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2097 2163\r\n",
"NC_035780.1\tRepeatMasker\tsimilarity\t7623\t8079\t34.1\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 1516 1975\r\n",
"NC_035780.1\tRepeatMasker\tsimilarity\t8261\t8295\t14.1\t+\t.\tTarget \"Motif:(CTCCT)n\" 1 33\r\n"
]
}
],
"source": [
"!head {transposableElementsCg}"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 626665 C_virginica-3.0_TE-Cg.gff\r\n"
]
}
],
"source": [
"!wc -l {transposableElementsCg}"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t28\t30\tCG_motif\r\n",
"NC_035780.1\t54\t56\tCG_motif\r\n",
"NC_035780.1\t75\t77\tCG_motif\r\n",
"NC_035780.1\t93\t95\tCG_motif\r\n",
"NC_035780.1\t103\t105\tCG_motif\r\n",
"NC_035780.1\t116\t118\tCG_motif\r\n",
"NC_035780.1\t134\t136\tCG_motif\r\n",
"NC_035780.1\t159\t161\tCG_motif\r\n",
"NC_035780.1\t209\t211\tCG_motif\r\n",
"NC_035780.1\t224\t226\tCG_motif\r\n"
]
}
],
"source": [
"!head {CGMotifList}"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 14458703 C_virginica-3.0_CG-motif.bed\r\n"
]
}
],
"source": [
"!wc -l {CGMotifList}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Identify DML and DMR Overlaps with Genomic Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To identify the location of DML and DMR in the *C. virginica* genome, I will use `intersect` from `bedtools`. [The BEDtools suite](http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html) allows me to easily find overlapping regions of different bed files."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r\n",
"Tool: bedtools intersect (aka intersectBed)\r\n",
"Version: v2.26.0\r\n",
"Summary: Report overlaps between two feature files.\r\n",
"\r\n",
"Usage: bedtools intersect [OPTIONS] -a -b \r\n",
"\r\n",
"\tNote: -b may be followed with multiple databases and/or \r\n",
"\twildcard (*) character(s). \r\n",
"Options: \r\n",
"\t-wa\tWrite the original entry in A for each overlap.\r\n",
"\r\n",
"\t-wb\tWrite the original entry in B for each overlap.\r\n",
"\t\t- Useful for knowing _what_ A overlaps. Restricted by -f and -r.\r\n",
"\r\n",
"\t-loj\tPerform a \"left outer join\". That is, for each feature in A\r\n",
"\t\treport each overlap with B. If no overlaps are found, \r\n",
"\t\treport a NULL feature for B.\r\n",
"\r\n",
"\t-wo\tWrite the original A and B entries plus the number of base\r\n",
"\t\tpairs of overlap between the two features.\r\n",
"\t\t- Overlaps restricted by -f and -r.\r\n",
"\t\t Only A features with overlap are reported.\r\n",
"\r\n",
"\t-wao\tWrite the original A and B entries plus the number of base\r\n",
"\t\tpairs of overlap between the two features.\r\n",
"\t\t- Overlapping features restricted by -f and -r.\r\n",
"\t\t However, A features w/o overlap are also reported\r\n",
"\t\t with a NULL B feature and overlap = 0.\r\n",
"\r\n",
"\t-u\tWrite the original A entry _once_ if _any_ overlaps found in B.\r\n",
"\t\t- In other words, just report the fact >=1 hit was found.\r\n",
"\t\t- Overlaps restricted by -f and -r.\r\n",
"\r\n",
"\t-c\tFor each entry in A, report the number of overlaps with B.\r\n",
"\t\t- Reports 0 for A entries that have no overlap with B.\r\n",
"\t\t- Overlaps restricted by -f and -r.\r\n",
"\r\n",
"\t-v\tOnly report those entries in A that have _no overlaps_ with B.\r\n",
"\t\t- Similar to \"grep -v\" (an homage).\r\n",
"\r\n",
"\t-ubam\tWrite uncompressed BAM output. Default writes compressed BAM.\r\n",
"\r\n",
"\t-s\tRequire same strandedness. That is, only report hits in B\r\n",
"\t\tthat overlap A on the _same_ strand.\r\n",
"\t\t- By default, overlaps are reported without respect to strand.\r\n",
"\r\n",
"\t-S\tRequire different strandedness. That is, only report hits in B\r\n",
"\t\tthat overlap A on the _opposite_ strand.\r\n",
"\t\t- By default, overlaps are reported without respect to strand.\r\n",
"\r\n",
"\t-f\tMinimum overlap required as a fraction of A.\r\n",
"\t\t- Default is 1E-9 (i.e., 1bp).\r\n",
"\t\t- FLOAT (e.g. 0.50)\r\n",
"\r\n",
"\t-F\tMinimum overlap required as a fraction of B.\r\n",
"\t\t- Default is 1E-9 (i.e., 1bp).\r\n",
"\t\t- FLOAT (e.g. 0.50)\r\n",
"\r\n",
"\t-r\tRequire that the fraction overlap be reciprocal for A AND B.\r\n",
"\t\t- In other words, if -f is 0.90 and -r is used, this requires\r\n",
"\t\t that B overlap 90% of A and A _also_ overlaps 90% of B.\r\n",
"\r\n",
"\t-e\tRequire that the minimum fraction be satisfied for A OR B.\r\n",
"\t\t- In other words, if -e is used with -f 0.90 and -F 0.10 this requires\r\n",
"\t\t that either 90% of A is covered OR 10% of B is covered.\r\n",
"\t\t Without -e, both fractions would have to be satisfied.\r\n",
"\r\n",
"\t-split\tTreat \"split\" BAM or BED12 entries as distinct BED intervals.\r\n",
"\r\n",
"\t-g\tProvide a genome file to enforce consistent chromosome sort order\r\n",
"\t\tacross input files. Only applies when used with -sorted option.\r\n",
"\r\n",
"\t-nonamecheck\tFor sorted data, don't throw an error if the file has different naming conventions\r\n",
"\t\t\tfor the same chromosome. ex. \"chr1\" vs \"chr01\".\r\n",
"\r\n",
"\t-sorted\tUse the \"chromsweep\" algorithm for sorted (-k1,1 -k2,2n) input.\r\n",
"\r\n",
"\t-names\tWhen using multiple databases, provide an alias for each that\r\n",
"\t\twill appear instead of a fileId when also printing the DB record.\r\n",
"\r\n",
"\t-filenames\tWhen using multiple databases, show each complete filename\r\n",
"\t\t\tinstead of a fileId when also printing the DB record.\r\n",
"\r\n",
"\t-sortout\tWhen using multiple databases, sort the output DB hits\r\n",
"\t\t\tfor each record.\r\n",
"\r\n",
"\t-bed\tIf using BAM input, write output as BED.\r\n",
"\r\n",
"\t-header\tPrint the header from the A file prior to results.\r\n",
"\r\n",
"\t-nobuf\tDisable buffered output. Using this option will cause each line\r\n",
"\t\tof output to be printed as it is generated, rather than saved\r\n",
"\t\tin a buffer. This will make printing large output files \r\n",
"\t\tnoticeably slower, but can be useful in conjunction with\r\n",
"\t\tother software tools and scripts that need to process one\r\n",
"\t\tline of bedtools output at a time.\r\n",
"\r\n",
"\t-iobuf\tSpecify amount of memory to use for input buffer.\r\n",
"\t\tTakes an integer argument. Optional suffixes K/M/G supported.\r\n",
"\t\tNote: currently has no effect with compressed files.\r\n",
"\r\n",
"Notes: \r\n",
"\t(1) When a BAM file is used for the A file, the alignment is retained if overlaps exist,\r\n",
"\tand exlcuded if an overlap cannot be found. If multiple overlaps exist, they are not\r\n",
"\treported, as we are only testing for one or more overlaps.\r\n",
"\r\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed -h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2a. Exons"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### All DML"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 368\n",
"DML overlaps with exons\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMLlist} \\\n",
"-b {exonList} \\\n",
"| wc -l\n",
"!echo \"DML overlaps with exons\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMLlist} \\\n",
"-b {exonList} \\\n",
"> 2018-11-07-DML-Exon.txt"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571138\t571140\t58\tNC_035780.1\t570942\t571194\r\n",
"NC_035780.1\t2538924\t2538926\t-50\tNC_035780.1\t2538624\t2538955\r\n",
"NC_035780.1\t2586508\t2586510\t-53\tNC_035780.1\t2586438\t2586557\r\n",
"NC_035780.1\t2589720\t2589722\t57\tNC_035780.1\t2589716\t2589955\r\n",
"NC_035780.1\t4286286\t4286288\t67\tNC_035780.1\t4286174\t4286407\r\n",
"NC_035780.1\t4286802\t4286804\t-62\tNC_035780.1\t4286783\t4286927\r\n",
"NC_035780.1\t4289628\t4289630\t-52\tNC_035780.1\t4288592\t4290756\r\n",
"NC_035780.1\t8693287\t8693289\t-52\tNC_035780.1\t8692509\t8693320\r\n",
"NC_035780.1\t9110274\t9110276\t-63\tNC_035780.1\t9109982\t9111843\r\n",
"NC_035780.1\t12631453\t12631455\t60\tNC_035780.1\t12630576\t12631487\r\n"
]
}
],
"source": [
"!head 2018-11-07-DML-Exon.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypermethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 190\n",
"hypermethylated DML overlaps with exons\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hyperDML} \\\n",
"-b {exonList} \\\n",
"| wc -l\n",
"!echo \"hypermethylated DML overlaps with exons\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hyperDML} \\\n",
"-b {exonList} \\\n",
"> 2019-04-30-Hypermethylated-DML-Exon.txt"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571138\t571140\t58\tNC_035780.1\t570942\t571194\r\n",
"NC_035780.1\t2589720\t2589722\t57\tNC_035780.1\t2589716\t2589955\r\n",
"NC_035780.1\t4286286\t4286288\t67\tNC_035780.1\t4286174\t4286407\r\n",
"NC_035780.1\t12631453\t12631455\t60\tNC_035780.1\t12630576\t12631487\r\n",
"NC_035780.1\t12631453\t12631455\t60\tNC_035780.1\t12630576\t12631487\r\n",
"NC_035780.1\t12631453\t12631455\t60\tNC_035780.1\t12630577\t12631487\r\n",
"NC_035780.1\t12631453\t12631455\t60\tNC_035780.1\t12630577\t12631487\r\n",
"NC_035780.1\t15412264\t15412266\t50\tNC_035780.1\t15412219\t15412410\r\n",
"NC_035780.1\t15412264\t15412266\t50\tNC_035780.1\t15412219\t15412410\r\n",
"NC_035780.1\t15414935\t15414936\t51\tNC_035780.1\t15414935\t15415225\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypermethylated-DML-Exon.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypomethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 178\n",
"hypomethylated DML overlaps with exons\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hypoDML} \\\n",
"-b {exonList} \\\n",
"| wc -l\n",
"!echo \"hypomethylated DML overlaps with exons\""
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hypoDML} \\\n",
"-b {exonList} \\\n",
"> 2019-04-30-Hypomethylated-DML-Exon.txt"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t2538924\t2538926\t-50\tNC_035780.1\t2538624\t2538955\r\n",
"NC_035780.1\t2586508\t2586510\t-53\tNC_035780.1\t2586438\t2586557\r\n",
"NC_035780.1\t4286802\t4286804\t-62\tNC_035780.1\t4286783\t4286927\r\n",
"NC_035780.1\t4289628\t4289630\t-52\tNC_035780.1\t4288592\t4290756\r\n",
"NC_035780.1\t8693287\t8693289\t-52\tNC_035780.1\t8692509\t8693320\r\n",
"NC_035780.1\t9110274\t9110276\t-63\tNC_035780.1\t9109982\t9111843\r\n",
"NC_035780.1\t17093218\t17093220\t-52\tNC_035780.1\t17092983\t17093548\r\n",
"NC_035780.1\t19149580\t19149582\t-61\tNC_035780.1\t19149513\t19149749\r\n",
"NC_035780.1\t19149580\t19149582\t-61\tNC_035780.1\t19149513\t19149749\r\n",
"NC_035780.1\t19149580\t19149582\t-61\tNC_035780.1\t19149513\t19150486\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypomethylated-DML-Exon.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 64\n",
"DMR overlaps with exons\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {exonList} \\\n",
"| wc -l\n",
"!echo \"DMR overlaps with exons\""
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMRlist} \\\n",
"-b {exonList} \\\n",
"> 2018-11-07-DMR-Exon.txt"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571100\t571194\tDMR\t58\tNC_035780.1\t570942\t571194\r\n",
"NC_035780.1\t573700\t573800\tDMR\t52\tNC_035780.1\t573630\t573906\r\n",
"NC_035780.1\t573700\t573800\tDMR\t52\tNC_035780.1\t573630\t573906\r\n",
"NC_035780.1\t1933574\t1933600\tDMR\t53\tNC_035780.1\t1933574\t1933615\r\n",
"NC_035780.1\t47335000\t47335100\tDMR\t-66\tNC_035780.1\t47334080\t47336192\r\n",
"NC_035780.1\t48394800\t48394900\tDMR\t-63\tNC_035780.1\t48394159\t48395287\r\n",
"NC_035780.1\t61138200\t61138300\tDMR\t-79\tNC_035780.1\t61138000\t61140417\r\n",
"NC_035781.1\t6831300\t6831302\tDMR\t-59\tNC_035781.1\t6831093\t6831302\r\n",
"NC_035781.1\t6831300\t6831302\tDMR\t-59\tNC_035781.1\t6831093\t6831302\r\n",
"NC_035781.1\t6831300\t6831302\tDMR\t-59\tNC_035781.1\t6831093\t6831302\r\n"
]
}
],
"source": [
"!head 2018-11-07-DMR-Exon.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2b. Introns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 191\n",
"DML overlaps with introns\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMLlist} \\\n",
"-b {intronList} \\\n",
"| wc -l\n",
"!echo \"DML overlaps with introns\""
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMLlist} \\\n",
"-b {intronList} \\\n",
"> 2018-11-07-DML-Intron.txt"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t401630\t401632\t53\tNC_035780.1\t401605\t401801\r\n",
"NC_035780.1\t1882691\t1882693\t64\tNC_035780.1\t1882356\t1882972\r\n",
"NC_035780.1\t1885022\t1885024\t61\tNC_035780.1\t1884755\t1886043\r\n",
"NC_035780.1\t1933499\t1933501\t51\tNC_035780.1\t1932877\t1933574\r\n",
"NC_035780.1\t2541726\t2541728\t-54\tNC_035780.1\t2538956\t2541769\r\n",
"NC_035780.1\t2584492\t2584494\t56\tNC_035780.1\t2584154\t2584505\r\n",
"NC_035780.1\t4288213\t4288215\t-58\tNC_035780.1\t4288129\t4288231\r\n",
"NC_035780.1\t8833124\t8833126\t60\tNC_035780.1\t8832172\t8833700\r\n",
"NC_035780.1\t15414934\t15414935\t51\tNC_035780.1\t15414677\t15414935\r\n",
"NC_035780.1\t17488958\t17488960\t-57\tNC_035780.1\t17488943\t17489179\r\n"
]
}
],
"source": [
"!head 2018-11-07-DML-Intron.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypermethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 99\n",
"hypermethylated DML overlaps with introns\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hyperDML} \\\n",
"-b {intronList} \\\n",
"| wc -l\n",
"!echo \"hypermethylated DML overlaps with introns\""
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hyperDML} \\\n",
"-b {intronList} \\\n",
"> 2019-04-30-Hypermethylated-DML-Intron.txt"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t401630\t401632\t53\tNC_035780.1\t401605\t401801\r\n",
"NC_035780.1\t1882691\t1882693\t64\tNC_035780.1\t1882356\t1882972\r\n",
"NC_035780.1\t1885022\t1885024\t61\tNC_035780.1\t1884755\t1886043\r\n",
"NC_035780.1\t1933499\t1933501\t51\tNC_035780.1\t1932877\t1933574\r\n",
"NC_035780.1\t2584492\t2584494\t56\tNC_035780.1\t2584154\t2584505\r\n",
"NC_035780.1\t8833124\t8833126\t60\tNC_035780.1\t8832172\t8833700\r\n",
"NC_035780.1\t15414934\t15414935\t51\tNC_035780.1\t15414677\t15414935\r\n",
"NC_035780.1\t27396182\t27396184\t52\tNC_035780.1\t27396141\t27396707\r\n",
"NC_035780.1\t32766797\t32766799\t58\tNC_035780.1\t32766347\t32769865\r\n",
"NC_035780.1\t38236493\t38236495\t50\tNC_035780.1\t38236122\t38236507\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypermethylated-DML-Intron.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypomethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 92\n",
"hypomethylated DML overlaps with introns\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hypoDML} \\\n",
"-b {intronList} \\\n",
"| wc -l\n",
"!echo \"hypomethylated DML overlaps with introns\""
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hypoDML} \\\n",
"-b {intronList} \\\n",
"> 2019-04-30-Hypomethylated-DML-Intron.txt"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t2541726\t2541728\t-54\tNC_035780.1\t2538956\t2541769\r\n",
"NC_035780.1\t4288213\t4288215\t-58\tNC_035780.1\t4288129\t4288231\r\n",
"NC_035780.1\t17488958\t17488960\t-57\tNC_035780.1\t17488943\t17489179\r\n",
"NC_035780.1\t22177828\t22177830\t-51\tNC_035780.1\t22154687\t22178241\r\n",
"NC_035780.1\t25858297\t25858299\t-51\tNC_035780.1\t25858282\t25863049\r\n",
"NC_035780.1\t31302904\t31302906\t-60\tNC_035780.1\t31302842\t31303152\r\n",
"NC_035780.1\t31302934\t31302936\t-58\tNC_035780.1\t31302842\t31303152\r\n",
"NC_035780.1\t32717030\t32717032\t-52\tNC_035780.1\t32716796\t32717180\r\n",
"NC_035780.1\t35969128\t35969130\t-53\tNC_035780.1\t35969071\t35986499\r\n",
"NC_035780.1\t57337100\t57337102\t-54\tNC_035780.1\t57336211\t57338140\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypomethylated-DML-Intron.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 112\n",
"DMR overlaps with introns\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {intronList} \\\n",
"| wc -l\n",
"!echo \"DMR overlaps with introns\""
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMRlist} \\\n",
"-b {intronList} \\\n",
"> 2018-11-07-DMR-Intron.txt"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571195\t571200\tDMR\t58\tNC_035780.1\t571195\t572677\r\n",
"NC_035780.1\t1885000\t1885100\tDMR\t51\tNC_035780.1\t1884755\t1886043\r\n",
"NC_035780.1\t1933500\t1933574\tDMR\t53\tNC_035780.1\t1932877\t1933574\r\n",
"NC_035780.1\t4285700\t4285800\tDMR\t-51\tNC_035780.1\t4285382\t4285831\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\tNC_035780.1\t26621919\t26632637\r\n",
"NC_035780.1\t28563400\t28563500\tDMR\t59\tNC_035780.1\t28563400\t28564616\r\n",
"NC_035780.1\t29883000\t29883100\tDMR\t-55\tNC_035780.1\t29882984\t29883643\r\n",
"NC_035780.1\t31302900\t31303000\tDMR\t-61\tNC_035780.1\t31302842\t31303152\r\n",
"NC_035780.1\t31303300\t31303400\tDMR\t-53\tNC_035780.1\t31303293\t31303603\r\n",
"NC_035780.1\t33209900\t33210000\tDMR\t-54\tNC_035780.1\t33209785\t33210978\r\n"
]
}
],
"source": [
"!head 2018-11-07-DMR-Intron.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2c. mRNA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 549\n",
"DML overlaps with mRNA\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMLlist} \\\n",
"-b {mRNAList} \\\n",
"| wc -l\n",
"!echo \"DML overlaps with mRNA\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMLlist} \\\n",
"-b {mRNAList} \\\n",
"> 2018-11-07-DML-mRNA.txt"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t401630\t401632\t53\tNC_035780.1\tGnomon\tmRNA\t394983\t409280\t.\t-\t.\tID=rna37;Parent=gene27;Dbxref=GeneID:111117672,Genbank:XM_022456873.1;Name=XM_022456873.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111117672;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=Niemann-Pick C1 protein-like;transcript_id=XM_022456873.1\r\n"
]
}
],
"source": [
"!head -n 1 2018-11-07-DML-mRNA.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I know how many overlaps there are, but I also want to know how many unique genes have DMLs in them. For this, I will use the following code:\n",
"\n",
"`cut -f13 2018-11-07-DML-mRNA.txt | sort | uniq -c`\n",
"\n",
"`cut` is the command that isolates the column information. The column is piped into `sort`, then that output is counted for unique lines by `uniq`. I will save the output from this command as a new file."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! cut -f13 2018-11-07-DML-mRNA.txt | sort | uniq -c > 2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1 ID=rna10013;Parent=gene5874;Dbxref=GeneID:111118920,Genbank:XM_022458635.1;Name=XM_022458635.1;gbkey=mRNA;gene=LOC111118920;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=structural maintenance of chromosomes protein 6-like%2C transcript variant X1;transcript_id=XM_022458635.1\r\n",
" 1 ID=rna10014;Parent=gene5874;Dbxref=GeneID:111118920,Genbank:XM_022458636.1;Name=XM_022458636.1;gbkey=mRNA;gene=LOC111118920;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=structural maintenance of chromosomes protein 6-like%2C transcript variant X2;transcript_id=XM_022458636.1\r\n",
" 1 ID=rna10055;Parent=gene5904;Dbxref=GeneID:111121117,Genbank:XM_022462246.1;Name=XM_022462246.1;gbkey=mRNA;gene=LOC111121117;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 18 samples with support for all annotated introns;product=serine/threonine-protein kinase Nek8-like;transcript_id=XM_022462246.1\r\n",
" 1 ID=rna10111;Parent=gene5940;Dbxref=GeneID:111120228,Genbank:XM_022460956.1;Name=XM_022460956.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 6 bases in 6 codons;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111120228;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 87%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC111120228;transcript_id=XM_022460956.1\r\n",
" 1 ID=rna10138;Parent=gene5965;Dbxref=GeneID:111119372,Genbank:XM_022459466.1;Name=XM_022459466.1;gbkey=mRNA;gene=LOC111119372;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 11 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 26 samples with support for all annotated introns;product=tyrosine-protein kinase SRK2-like;transcript_id=XM_022459466.1\r\n",
" 1 ID=rna10182;Parent=gene5989;Dbxref=GeneID:111121400,Genbank:XM_022462674.1;Name=XM_022462674.1;gbkey=mRNA;gene=LOC111121400;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=DNA excision repair protein ERCC-6-like;transcript_id=XM_022462674.1\r\n",
" 1 ID=rna10185;Parent=gene5992;Dbxref=GeneID:111119689,Genbank:XM_022460110.1;Name=XM_022460110.1;gbkey=mRNA;gene=LOC111119689;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=rapamycin-insensitive companion of mTOR-like;transcript_id=XM_022460110.1\r\n",
" 1 ID=rna10187;Parent=gene5994;Dbxref=GeneID:111119688,Genbank:XM_022460109.1;Name=XM_022460109.1;gbkey=mRNA;gene=LOC111119688;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 11 samples with support for all annotated introns;product=uncharacterized LOC111119688;transcript_id=XM_022460109.1\r\n",
" 1 ID=rna10346;Parent=gene6079;Dbxref=GeneID:111119469,Genbank:XM_022459668.1;Name=XM_022459668.1;gbkey=mRNA;gene=LOC111119469;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=uncharacterized LOC111119469%2C transcript variant X2;transcript_id=XM_022459668.1\r\n",
" 1 ID=rna10347;Parent=gene6079;Dbxref=GeneID:111119469,Genbank:XM_022459667.1;Name=XM_022459667.1;gbkey=mRNA;gene=LOC111119469;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 19 samples with support for all annotated introns;product=uncharacterized LOC111119469%2C transcript variant X1;transcript_id=XM_022459667.1\r\n"
]
}
],
"source": [
"!head 2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1430 2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The DMLs overlap with 1430 unique genes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 139\n",
"DMR overlaps with mRNA\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {mRNAList} \\\n",
"| wc -l\n",
"!echo \"DMR overlaps with mRNA\""
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMRlist} \\\n",
"-b {mRNAList} \\\n",
"> 2018-11-07-DMR-mRNA.txt"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t571100\t571200\tDMR\t58\tNC_035780.1\tGnomon\tmRNA\t544088\t573497\t.\t+\t.\tID=rna48;Parent=gene35;Dbxref=GeneID:111114201,Genbank:XM_022452489.1;Name=XM_022452489.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 2 bases in 2 codons;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111114201;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=vacuolar protein sorting-associated protein 13B-like;transcript_id=XM_022452489.1\r\n",
"NC_035780.1\t573700\t573800\tDMR\t52\tNC_035780.1\tGnomon\tmRNA\t573630\t585444\t.\t-\t.\tID=rna49;Parent=gene36;Dbxref=GeneID:111114212,Genbank:XM_022452506.1;Name=XM_022452506.1;gbkey=mRNA;gene=LOC111114212;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=coatomer subunit alpha-like%2C transcript variant X2;transcript_id=XM_022452506.1\r\n",
"NC_035780.1\t573700\t573800\tDMR\t52\tNC_035780.1\tGnomon\tmRNA\t573630\t585444\t.\t-\t.\tID=rna50;Parent=gene36;Dbxref=GeneID:111114212,Genbank:XM_022452500.1;Name=XM_022452500.1;gbkey=mRNA;gene=LOC111114212;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=coatomer subunit alpha-like%2C transcript variant X1;transcript_id=XM_022452500.1\r\n",
"NC_035780.1\t1885000\t1885100\tDMR\t51\tNC_035780.1\tGnomon\tmRNA\t1882143\t1890106\t.\t-\t.\tID=rna155;Parent=gene95;Dbxref=GeneID:111102448,Genbank:XM_022435197.1;Name=XM_022435197.1;gbkey=mRNA;gene=LOC111102448;model_evidence=Supporting evidence includes similarity to: 25 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;product=NADH dehydrogenase [ubiquinone] flavoprotein 1%2C mitochondrial-like;transcript_id=XM_022435197.1\r\n",
"NC_035780.1\t1933500\t1933600\tDMR\t53\tNC_035780.1\tGnomon\tmRNA\t1928718\t1940217\t.\t+\t.\tID=rna160;Parent=gene98;Dbxref=GeneID:111100652,Genbank:XM_022432742.1;Name=XM_022432742.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 3 bases in 2 codons;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111100652;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 96%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=oxysterol-binding protein-related protein 11-like;transcript_id=XM_022432742.1\r\n",
"NC_035780.1\t4285700\t4285800\tDMR\t-51\tNC_035780.1\tGnomon\tmRNA\t4282771\t4298209\t.\t+\t.\tID=rna425;Parent=gene254;Dbxref=GeneID:111127966,Genbank:XM_022473296.1;Name=XM_022473296.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111127966;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=calmodulin-regulated spectrin-associated protein 1-like;transcript_id=XM_022473296.1\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\tNC_035780.1\tGnomon\tmRNA\t26607780\t26639094\t.\t+\t.\tID=rna2717;Parent=gene1618;Dbxref=GeneID:111137675,Genbank:XM_022489271.1;Name=XM_022489271.1;gbkey=mRNA;gene=LOC111137675;model_evidence=Supporting evidence includes similarity to: 16 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=zinc finger protein 26-like%2C transcript variant X3;transcript_id=XM_022489271.1\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\tNC_035780.1\tGnomon\tmRNA\t26609285\t26639094\t.\t+\t.\tID=rna2718;Parent=gene1618;Dbxref=GeneID:111137675,Genbank:XM_022489296.1;Name=XM_022489296.1;gbkey=mRNA;gene=LOC111137675;model_evidence=Supporting evidence includes similarity to: 16 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=zinc finger protein 26-like%2C transcript variant X6;transcript_id=XM_022489296.1\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\tNC_035780.1\tGnomon\tmRNA\t26609318\t26639094\t.\t+\t.\tID=rna2719;Parent=gene1618;Dbxref=GeneID:111137675,Genbank:XM_022489288.1;Name=XM_022489288.1;gbkey=mRNA;gene=LOC111137675;model_evidence=Supporting evidence includes similarity to: 16 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=zinc finger protein 26-like%2C transcript variant X5;transcript_id=XM_022489288.1\r\n",
"NC_035780.1\t26629500\t26629600\tDMR\t65\tNC_035780.1\tGnomon\tmRNA\t26612294\t26639094\t.\t+\t.\tID=rna2720;Parent=gene1618;Dbxref=GeneID:111137675,Genbank:XM_022489327.1;Name=XM_022489327.1;gbkey=mRNA;gene=LOC111137675;model_evidence=Supporting evidence includes similarity to: 16 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=zinc finger protein 26-like%2C transcript variant X10;transcript_id=XM_022489327.1\r\n"
]
}
],
"source": [
"!head 2018-11-07-DMR-mRNA.txt"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! cut -f14 2018-11-07-DMR-mRNA.txt | sort | uniq -c > 2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1 ID=rna10182;Parent=gene5989;Dbxref=GeneID:111121400,Genbank:XM_022462674.1;Name=XM_022462674.1;gbkey=mRNA;gene=LOC111121400;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=DNA excision repair protein ERCC-6-like;transcript_id=XM_022462674.1\r\n",
" 1 ID=rna10216;Parent=gene6005;Dbxref=GeneID:111120829,Genbank:XM_022461817.1;Name=XM_022461817.1;gbkey=mRNA;gene=LOC111120829;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=serine/arginine-rich splicing factor 7-like%2C transcript variant X1;transcript_id=XM_022461817.1\r\n",
" 1 ID=rna10452;Parent=gene6155;Dbxref=GeneID:111118143,Genbank:XM_022457464.1;Name=XM_022457464.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111118143;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 1 Protein%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 7 samples with support for all annotated introns;product=fas-binding factor 1-like;transcript_id=XM_022457464.1\r\n",
" 1 ID=rna11226;Parent=gene6614;Dbxref=GeneID:111122141,Genbank:XM_022463726.1;Name=XM_022463726.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111122141;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 4 Proteins%2C and 95%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=cystathionine gamma-lyase-like;transcript_id=XM_022463726.1\r\n",
" 1 ID=rna11483;Parent=gene6784;Dbxref=GeneID:111120321,Genbank:XM_022461045.1;Name=XM_022461045.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 2 bases in 2 codons%3B deleted 2 bases in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111120321;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 95%25 coverage of the annotated genomic feature by RNAseq alignments;product=centrosomal protein of 131 kDa-like;transcript_id=XM_022461045.1\r\n",
" 2 ID=rna11701;Parent=gene6918;Dbxref=GeneID:111122403,Genbank:XM_022464186.1;Name=XM_022464186.1;gbkey=mRNA;gene=LOC111122403;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=dynein assembly factor 5%2C axonemal-like;transcript_id=XM_022464186.1\r\n",
" 1 ID=rna11978;Parent=gene7119;Dbxref=GeneID:111120719,Genbank:XM_022461644.1;Name=XM_022461644.1;gbkey=mRNA;gene=LOC111120719;model_evidence=Supporting evidence includes similarity to: 8 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=protein transport protein Sec31A-like%2C transcript variant X2;transcript_id=XM_022461644.1\r\n",
" 1 ID=rna11979;Parent=gene7119;Dbxref=GeneID:111120719,Genbank:XM_022461643.1;Name=XM_022461643.1;gbkey=mRNA;gene=LOC111120719;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=protein transport protein Sec31A-like%2C transcript variant X1;transcript_id=XM_022461643.1\r\n",
" 1 ID=rna12339;Parent=gene7313;Dbxref=GeneID:111121848,Genbank:XM_022463302.1;Name=XM_022463302.1;gbkey=mRNA;gene=LOC111121848;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 12 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=40S ribosomal protein S7-like;transcript_id=XM_022463302.1\r\n",
" 1 ID=rna12474;Parent=gene7412;Dbxref=GeneID:111120384,Genbank:XM_022461101.1;Name=XM_022461101.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111120384;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 67%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC111120384;transcript_id=XM_022461101.1\r\n"
]
}
],
"source": [
"!head 2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 305 2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The DMRs overlap with 305 unique genes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2c. Transposable Elements (All)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 57\n",
"DML overlaps with transposable elements (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMLlist} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"DML overlaps with transposable elements (all)\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMLlist} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-DML-TE-all.txt"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t8833124\t8833126\t60\tNC_035780.1\tRepeatMasker\tsimilarity\t8833042\t8833288\t18.2\t-\t.\tTarget \"Motif:CVA\" 1 272\r\n",
"NC_035780.1\t22177828\t22177830\t-51\tNC_035780.1\tRepeatMasker\tsimilarity\t22177766\t22177877\t22.3\t-\t.\tTarget \"Motif:DNA9-6_CGi\" 1 115\r\n",
"NC_035780.1\t57337100\t57337102\t-54\tNC_035780.1\tRepeatMasker\tsimilarity\t57337042\t57337128\t18.6\t-\t.\tTarget \"Motif:DNA2-2_CGi\" 413 498\r\n",
"NC_035780.1\t58135767\t58135769\t74\tNC_035780.1\tRepeatMasker\tsimilarity\t58135699\t58135837\t22.4\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 169 314\r\n",
"NC_035781.1\t22439769\t22439771\t53\tNC_035781.1\tRepeatMasker\tsimilarity\t22439740\t22439796\t28.1\t+\t.\tTarget \"Motif:Mariner-6_AMi\" 698 754\r\n",
"NC_035781.1\t29178318\t29178320\t-55\tNC_035781.1\tRepeatMasker\tsimilarity\t29177336\t29178341\t16.0\t-\t.\tTarget \"Motif:CVA\" 2 863\r\n",
"NC_035781.1\t54151548\t54151550\t54\tNC_035781.1\tRepeatMasker\tsimilarity\t54150482\t54151750\t14.3\t+\t.\tTarget \"Motif:CVA\" 1 1018\r\n",
"NC_035781.1\t59742649\t59742651\t-65\tNC_035781.1\tRepeatMasker\tsimilarity\t59742603\t59742651\t 4.2\t+\t.\tTarget \"Motif:(ACTAACG)n\" 1 49\r\n",
"NC_035782.1\t6685343\t6685345\t-68\tNC_035782.1\tRepeatMasker\tsimilarity\t6685308\t6685646\t15.0\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 335\r\n",
"NC_035782.1\t6685349\t6685351\t-50\tNC_035782.1\tRepeatMasker\tsimilarity\t6685308\t6685646\t15.0\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 335\r\n"
]
}
],
"source": [
"!head 2018-11-07-DML-TE-all.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypermethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 26\n",
"hypermethylated DML overlaps with TE (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hyperDML} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"hypermethylated DML overlaps with TE (all)\""
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hyperDML} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2019-04-30-Hypermethylated-DML-TEall.txt"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t8833124\t8833126\t60\tNC_035780.1\tRepeatMasker\tsimilarity\t8833042\t8833288\t18.2\t-\t.\tTarget \"Motif:CVA\" 1 272\r\n",
"NC_035780.1\t58135767\t58135769\t74\tNC_035780.1\tRepeatMasker\tsimilarity\t58135699\t58135837\t22.4\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 169 314\r\n",
"NC_035781.1\t22439769\t22439771\t53\tNC_035781.1\tRepeatMasker\tsimilarity\t22439740\t22439796\t28.1\t+\t.\tTarget \"Motif:Mariner-6_AMi\" 698 754\r\n",
"NC_035781.1\t54151548\t54151550\t54\tNC_035781.1\tRepeatMasker\tsimilarity\t54150482\t54151750\t14.3\t+\t.\tTarget \"Motif:CVA\" 1 1018\r\n",
"NC_035782.1\t45857195\t45857197\t52\tNC_035782.1\tRepeatMasker\tsimilarity\t45857026\t45858123\t32.7\t+\t.\tTarget \"Motif:Mariner-21_LCh\" 450 1999\r\n",
"NC_035782.1\t53693367\t53693369\t61\tNC_035782.1\tRepeatMasker\tsimilarity\t53693299\t53693466\t19.6\t+\t.\tTarget \"Motif:Crypton-N6B_CGi\" 566 735\r\n",
"NC_035782.1\t58675269\t58675271\t50\tNC_035782.1\tRepeatMasker\tsimilarity\t58675249\t58675337\t19.1\t-\t.\tTarget \"Motif:DNA3-12_CGi\" 290 378\r\n",
"NC_035782.1\t61203970\t61203972\t51\tNC_035782.1\tRepeatMasker\tsimilarity\t61203541\t61204351\t12.3\t-\t.\tTarget \"Motif:CVA\" 1 686\r\n",
"NC_035783.1\t4336100\t4336102\t63\tNC_035783.1\tRepeatMasker\tsimilarity\t4335884\t4336135\t21.8\t+\t.\tTarget \"Motif:DNA8-4_CGi\" 42 268\r\n",
"NC_035783.1\t23130125\t23130127\t53\tNC_035783.1\tRepeatMasker\tsimilarity\t23130086\t23130209\t18.6\t+\t.\tTarget \"Motif:Crypton-8N1_CGi\" 516 639\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypermethylated-DML-TEall.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hypomethylated DML"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 31\n",
"hypomethylated DML overlaps with TE (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {hypoDML} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"hypomethylated DML overlaps with TE (all)\""
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {hypoDML} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2019-04-30-Hypomethylated-DML-TEall.txt"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t22177828\t22177830\t-51\tNC_035780.1\tRepeatMasker\tsimilarity\t22177766\t22177877\t22.3\t-\t.\tTarget \"Motif:DNA9-6_CGi\" 1 115\r\n",
"NC_035780.1\t57337100\t57337102\t-54\tNC_035780.1\tRepeatMasker\tsimilarity\t57337042\t57337128\t18.6\t-\t.\tTarget \"Motif:DNA2-2_CGi\" 413 498\r\n",
"NC_035781.1\t29178318\t29178320\t-55\tNC_035781.1\tRepeatMasker\tsimilarity\t29177336\t29178341\t16.0\t-\t.\tTarget \"Motif:CVA\" 2 863\r\n",
"NC_035781.1\t59742649\t59742651\t-65\tNC_035781.1\tRepeatMasker\tsimilarity\t59742603\t59742651\t 4.2\t+\t.\tTarget \"Motif:(ACTAACG)n\" 1 49\r\n",
"NC_035782.1\t6685343\t6685345\t-68\tNC_035782.1\tRepeatMasker\tsimilarity\t6685308\t6685646\t15.0\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 335\r\n",
"NC_035782.1\t6685349\t6685351\t-50\tNC_035782.1\tRepeatMasker\tsimilarity\t6685308\t6685646\t15.0\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 335\r\n",
"NC_035782.1\t34498893\t34498895\t-55\tNC_035782.1\tRepeatMasker\tsimilarity\t34498501\t34500091\t24.8\t+\t.\tTarget \"Motif:Helitron-N40_CGi\" 1 1569\r\n",
"NC_035782.1\t34498895\t34498897\t-71\tNC_035782.1\tRepeatMasker\tsimilarity\t34498501\t34500091\t24.8\t+\t.\tTarget \"Motif:Helitron-N40_CGi\" 1 1569\r\n",
"NC_035783.1\t32417238\t32417240\t-60\tNC_035783.1\tRepeatMasker\tsimilarity\t32417040\t32417268\t14.7\t-\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 2 319\r\n",
"NC_035783.1\t41484688\t41484690\t-59\tNC_035783.1\tRepeatMasker\tsimilarity\t41484586\t41484823\t11.8\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 61 337\r\n"
]
}
],
"source": [
"!head 2019-04-30-Hypomethylated-DML-TEall.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 39\n",
"DMR overlaps with transposable elements (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"DMR overlaps with transposable elements (all)\""
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-DMR-TE-all.txt"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t24159600\t24159700\tDMR\t51\r\n",
"NC_035780.1\t29883000\t29883100\tDMR\t-55\r\n",
"NC_035780.1\t33945900\t33946000\tDMR\t-51\r\n",
"NC_035780.1\t46044700\t46044800\tDMR\t-53\r\n",
"NC_035780.1\t47335000\t47335100\tDMR\t-66\r\n",
"NC_035781.1\t30962200\t30962300\tDMR\t63\r\n",
"NC_035781.1\t51566900\t51567000\tDMR\t-61\r\n",
"NC_035781.1\t54151500\t54151600\tDMR\t55\r\n",
"NC_035782.1\t2787300\t2787400\tDMR\t-53\r\n",
"NC_035782.1\t7518400\t7518500\tDMR\t64\r\n"
]
}
],
"source": [
"!head 2018-11-07-DMR-TE-all.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2e. Transposable Elements (_C. gigas_ only)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 39\n",
"DML overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMLlist} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"DML overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMLlist} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-DML-TE-Cg.txt"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t8833124\t8833126\t60\tNC_035780.1\tRepeatMasker\tsimilarity\t8833045\t8833287\t22.6\t-\t.\tTarget \"Motif:Helitron-N2f_CGi\" 1 276\r\n",
"NC_035780.1\t22177828\t22177830\t-51\tNC_035780.1\tRepeatMasker\tsimilarity\t22177766\t22177877\t22.3\t-\t.\tTarget \"Motif:DNA9-6_CGi\" 1 115\r\n",
"NC_035780.1\t57337100\t57337102\t-54\tNC_035780.1\tRepeatMasker\tsimilarity\t57337042\t57337128\t18.6\t-\t.\tTarget \"Motif:DNA2-2_CGi\" 413 498\r\n",
"NC_035781.1\t29178318\t29178320\t-55\tNC_035781.1\tRepeatMasker\tsimilarity\t29177333\t29178341\t24.4\t-\t.\tTarget \"Motif:Helitron-N2d_CGi\" 2 863\r\n",
"NC_035781.1\t54151548\t54151550\t54\tNC_035781.1\tRepeatMasker\tsimilarity\t54150483\t54151741\t23.3\t+\t.\tTarget \"Motif:Helitron-N2f_CGi\" 1 1018\r\n",
"NC_035781.1\t59742649\t59742651\t-65\tNC_035781.1\tRepeatMasker\tsimilarity\t59742603\t59742651\t 4.2\t+\t.\tTarget \"Motif:(ACTAACG)n\" 1 49\r\n",
"NC_035782.1\t34498893\t34498895\t-55\tNC_035782.1\tRepeatMasker\tsimilarity\t34498501\t34500091\t24.8\t+\t.\tTarget \"Motif:Helitron-N40_CGi\" 1 1569\r\n",
"NC_035782.1\t34498895\t34498897\t-71\tNC_035782.1\tRepeatMasker\tsimilarity\t34498501\t34500091\t24.8\t+\t.\tTarget \"Motif:Helitron-N40_CGi\" 1 1569\r\n",
"NC_035782.1\t53693367\t53693369\t61\tNC_035782.1\tRepeatMasker\tsimilarity\t53693299\t53693466\t19.6\t+\t.\tTarget \"Motif:Crypton-N6B_CGi\" 566 735\r\n",
"NC_035782.1\t58675269\t58675271\t50\tNC_035782.1\tRepeatMasker\tsimilarity\t58675249\t58675337\t19.1\t-\t.\tTarget \"Motif:DNA3-12_CGi\" 290 378\r\n"
]
}
],
"source": [
"!head 2018-11-07-DML-TE-Cg.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 23\n",
"DMR overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {DMRlist} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"DMR overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {DMRlist} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-DMR-TE-Cg.txt"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t29883075\t29883100\tDMR\t-55\tNC_035780.1\tRepeatMasker\tsimilarity\t29883076\t29883262\t25.4\t+\t.\tTarget \"Motif:DNA9-4_CGi\" 776 964\r\n",
"NC_035780.1\t33945935\t33945973\tDMR\t-51\tNC_035780.1\tRepeatMasker\tsimilarity\t33945936\t33945973\t 5.3\t+\t.\tTarget \"Motif:DNA9-5_CGi\" 1 38\r\n",
"NC_035780.1\t33945972\t33946000\tDMR\t-51\tNC_035780.1\tRepeatMasker\tsimilarity\t33945973\t33946026\t 3.7\t+\t.\tTarget \"Motif:DNA8-12_CGi\" 645 699\r\n",
"NC_035781.1\t30962200\t30962300\tDMR\t63\tNC_035781.1\tRepeatMasker\tsimilarity\t30962179\t30962531\t18.4\t+\t.\tTarget \"Motif:Helitron-N2d_CGi\" 3 346\r\n",
"NC_035781.1\t51566900\t51566921\tDMR\t-61\tNC_035781.1\tRepeatMasker\tsimilarity\t51566285\t51566921\t25.2\t-\t.\tTarget \"Motif:Sola3-1_CGi\" 3273 3922\r\n",
"NC_035781.1\t51566900\t51567000\tDMR\t-61\tNC_035781.1\tRepeatMasker\tsimilarity\t51566310\t51567155\t27.8\t-\t.\tTarget \"Motif:Helitron-N43_CGi\" 1 1221\r\n",
"NC_035781.1\t54151500\t54151600\tDMR\t55\tNC_035781.1\tRepeatMasker\tsimilarity\t54150483\t54151741\t23.3\t+\t.\tTarget \"Motif:Helitron-N2f_CGi\" 1 1018\r\n",
"NC_035782.1\t2787300\t2787318\tDMR\t-53\tNC_035782.1\tRepeatMasker\tsimilarity\t2787296\t2787318\t 0.0\t+\t.\tTarget \"Motif:(C)n\" 1 23\r\n",
"NC_035782.1\t7518400\t7518500\tDMR\t64\tNC_035782.1\tRepeatMasker\tsimilarity\t7518060\t7519217\t25.6\t+\t.\tTarget \"Motif:Helitron-N2d_CGi\" 1 1510\r\n",
"NC_035782.1\t30950900\t30950955\tDMR\t-51\tNC_035782.1\tRepeatMasker\tsimilarity\t30950209\t30950955\t23.9\t-\t.\tTarget \"Motif:Helitron-N2d_CGi\" 95 695\r\n"
]
}
],
"source": [
"!head 2018-11-07-DMR-TE-Cg.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Identify Overlaps between CG Motif and Other Genome Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also useful to understand where the CG regions are in relation to exons, introns, mRNA, and transposable elements!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3a. Exons"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2323389\n",
"CG motif overlaps with exons\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {CGMotifList} \\\n",
"-b {exonList} \\\n",
"| wc -l\n",
"!echo \"CG motif overlaps with exons\""
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {CGMotifList} \\\n",
"-b {exonList} \\\n",
"> 2018-11-07-Exon-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t13597\t13599\tCG_motif\tNC_035780.1\t13578\t13603\r\n",
"NC_035780.1\t28992\t28994\tCG_motif\tNC_035780.1\t28961\t29073\r\n",
"NC_035780.1\t29001\t29003\tCG_motif\tNC_035780.1\t28961\t29073\r\n",
"NC_035780.1\t29028\t29030\tCG_motif\tNC_035780.1\t28961\t29073\r\n",
"NC_035780.1\t30539\t30541\tCG_motif\tNC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t30574\t30576\tCG_motif\tNC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t30602\t30604\tCG_motif\tNC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t30676\t30678\tCG_motif\tNC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t30695\t30697\tCG_motif\tNC_035780.1\t30524\t31557\r\n",
"NC_035780.1\t30723\t30725\tCG_motif\tNC_035780.1\t30524\t31557\r\n"
]
}
],
"source": [
"!head 2018-11-07-Exon-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3b. Introns"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 5297975\n",
"CG motif overlaps with introns\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {CGMotifList} \\\n",
"-b {intronList} \\\n",
"| wc -l\n",
"!echo \"CG motif overlaps with introns\""
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {CGMotifList} \\\n",
"-b {intronList} \\\n",
"> 2018-11-07-Intron-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t29180\t29182\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29203\t29205\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29221\t29223\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29295\t29297\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29323\t29325\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29326\t29328\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29412\t29414\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29452\t29454\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29672\t29674\tCG_motif\tNC_035780.1\t29074\t30524\r\n",
"NC_035780.1\t29758\t29760\tCG_motif\tNC_035780.1\t29074\t30524\r\n"
]
}
],
"source": [
"!head 2018-11-07-Intron-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3c. mRNA"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 7507167\n",
"CG motif overlaps with mRNA\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {CGMotifList} \\\n",
"-b {mRNAList} \\\n",
"| wc -l\n",
"!echo \"CG motif overlaps with mRNA\""
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {CGMotifList} \\\n",
"-b {mRNAList} \\\n",
"> 2018-11-07-mRNA-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t28992\t28994\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29001\t29003\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29028\t29030\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29180\t29182\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29203\t29205\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29221\t29223\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29295\t29297\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29323\t29325\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29326\t29328\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\t29412\t29414\tCG_motif\tNC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n"
]
}
],
"source": [
"!head 2018-11-07-mRNA-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3d. Transposable Elements (All)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2828372\n",
"CG motif overlaps with transposable elements (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {CGMotifList} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"CG motif overlaps with transposable elements (all)\""
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {CGMotifList} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-TE-all-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t5079\t5080\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5159\t5161\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5162\t5164\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5174\t5176\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5191\t5193\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5220\t5222\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5317\t5319\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5357\t5359\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5381\t5383\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5398\t5400\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n"
]
}
],
"source": [
"!head 2018-11-07-TE-all-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3e. Transposable Elements (_C. gigas_ only)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 2142774\n",
"CG motif overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {CGMotifList} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"CG motif overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {CGMotifList} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-TE-Cg-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t5079\t5080\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5159\t5161\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5162\t5164\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5174\t5176\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5191\t5193\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5220\t5222\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5317\t5319\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5357\t5359\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5381\t5383\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n",
"NC_035780.1\t5398\t5400\tCG_motif\tNC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n"
]
}
],
"source": [
"!head 2018-11-07-TE-Cg-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Identify Overlaps between Transposable Elements and Other Genome Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To fully understand my results, I also need to know where TEs are located with respect to exons, introns, and mRNA coding regions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4a. Transposable Elements (All)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exons"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 50331\n",
"Exon overlaps with transposable elements (all)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {exonList} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"Exon overlaps with transposable elements (all)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion exon overlap with TE (all):"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.07269368589961163"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"50331/692371"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {exonList} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-Exon-TE-all.txt"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t109967\t109996\tNC_035780.1\tRepeatMasker\tsimilarity\t109968\t109996\t 0.0\t+\t.\tTarget \"Motif:(CCT)n\" 1 29\r\n",
"NC_035780.1\t164885\t164914\tNC_035780.1\tRepeatMasker\tsimilarity\t164886\t164914\t 7.3\t+\t.\tTarget \"Motif:(GAG)n\" 1 29\r\n",
"NC_035780.1\t166074\t166280\tNC_035780.1\tRepeatMasker\tsimilarity\t166075\t166280\t32.8\t+\t.\tTarget \"Motif:Harbinger1_DR\" 1472 1676\r\n",
"NC_035780.1\t166500\t166566\tNC_035780.1\tRepeatMasker\tsimilarity\t166501\t166566\t30.3\t+\t.\tTarget \"Motif:Harbinger-6_DR\" 1152 1217\r\n",
"NC_035780.1\t166597\t166642\tNC_035780.1\tRepeatMasker\tsimilarity\t166598\t166642\t17.8\t+\t.\tTarget \"Motif:hATw-1_HM\" 2778 2822\r\n",
"NC_035780.1\t220121\t220199\tNC_035780.1\tRepeatMasker\tsimilarity\t220122\t220199\t24.7\t-\t.\tTarget \"Motif:Gypsy-75_CQ-I\" 1012 1091\r\n",
"NC_035780.1\t228341\t228392\tNC_035780.1\tRepeatMasker\tsimilarity\t228342\t228392\t20.0\t+\t.\tTarget \"Motif:RTE-3_Hmel\" 1405 1455\r\n",
"NC_035780.1\t227767\t227819\tNC_035780.1\tRepeatMasker\tsimilarity\t227768\t227819\t25.0\t+\t.\tTarget \"Motif:A-rich\" 1 54\r\n",
"NC_035780.1\t228341\t228392\tNC_035780.1\tRepeatMasker\tsimilarity\t228342\t228392\t20.0\t+\t.\tTarget \"Motif:RTE-3_Hmel\" 1405 1455\r\n",
"NC_035780.1\t227767\t227819\tNC_035780.1\tRepeatMasker\tsimilarity\t227768\t227819\t25.0\t+\t.\tTarget \"Motif:A-rich\" 1 54\r\n"
]
}
],
"source": [
"!head 2018-11-07-Exon-TE-all.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Introns"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 105643\n",
"Intron overlaps with transposable elements (all)\n"
]
}
],
"source": [
"!{bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {intronList} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"Intron overlaps with transposable elements (all)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion intron overlap with TE (all):"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.1525814917147021"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"105643/692371"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {intronList} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-Intron-TE-all.txt"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t32719\t32819\tNC_035780.1\tRepeatMasker\tsimilarity\t32720\t32819\t18.2\t+\t.\tTarget \"Motif:Crypton-9N1_CGi\" 239 337\r\n",
"NC_035780.1\t48462\t48520\tNC_035780.1\tRepeatMasker\tsimilarity\t48463\t48520\t 8.8\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 280 337\r\n",
"NC_035780.1\t48665\t49000\tNC_035780.1\tRepeatMasker\tsimilarity\t48666\t49000\t10.9\t-\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 337\r\n",
"NC_035780.1\t50250\t50279\tNC_035780.1\tRepeatMasker\tsimilarity\t50251\t50279\t 0.0\t+\t.\tTarget \"Motif:(GGTTAG)n\" 1 29\r\n",
"NC_035780.1\t50605\t50760\tNC_035780.1\tRepeatMasker\tsimilarity\t50606\t50760\t21.3\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 1 166\r\n",
"NC_035780.1\t50976\t51034\tNC_035780.1\tRepeatMasker\tsimilarity\t50977\t51034\t 0.0\t+\t.\tTarget \"Motif:(TA)n\" 1 58\r\n",
"NC_035780.1\t51455\t51498\tNC_035780.1\tRepeatMasker\tsimilarity\t51456\t51498\t 0.0\t+\t.\tTarget \"Motif:(AG)n\" 1 43\r\n",
"NC_035780.1\t51720\t51922\tNC_035780.1\tRepeatMasker\tsimilarity\t51721\t51922\t21.8\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 2568 2776\r\n",
"NC_035780.1\t53155\t53294\tNC_035780.1\tRepeatMasker\tsimilarity\t53156\t53294\t20.9\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 127 306\r\n",
"NC_035780.1\t86824\t86942\tNC_035780.1\tRepeatMasker\tsimilarity\t86825\t86942\t26.5\t-\t.\tTarget \"Motif:CVA\" 81 203\r\n"
]
}
],
"source": [
"!head 2018-11-07-Intron-TE-all.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### mRNA"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 55069\n",
"mRNA overlaps with transposable elements (all)\n"
]
}
],
"source": [
"!{bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {mRNAList} \\\n",
"-b {transposableElementsAll} \\\n",
"| wc -l\n",
"!echo \"mRNA overlaps with transposable elements (all)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion mRNA overlap with TE (all):"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.07953683790915564"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"55069/692371"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {mRNAList} \\\n",
"-b {transposableElementsAll} \\\n",
"> 2018-11-07-mRNA-TE-all.txt"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t32720\t32819\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\tRepeatMasker\tsimilarity\t32720\t32819\t18.2\t+\t.\tTarget \"Motif:Crypton-9N1_CGi\" 239 337\r\n",
"NC_035780.1\tGnomon\tmRNA\t48463\t48520\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t48463\t48520\t 8.8\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 280 337\r\n",
"NC_035780.1\tGnomon\tmRNA\t48666\t49000\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t48666\t49000\t10.9\t-\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 1 337\r\n",
"NC_035780.1\tGnomon\tmRNA\t50251\t50279\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t50251\t50279\t 0.0\t+\t.\tTarget \"Motif:(GGTTAG)n\" 1 29\r\n",
"NC_035780.1\tGnomon\tmRNA\t50606\t50760\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t50606\t50760\t21.3\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 1 166\r\n",
"NC_035780.1\tGnomon\tmRNA\t50977\t51034\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t50977\t51034\t 0.0\t+\t.\tTarget \"Motif:(TA)n\" 1 58\r\n",
"NC_035780.1\tGnomon\tmRNA\t51456\t51498\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t51456\t51498\t 0.0\t+\t.\tTarget \"Motif:(AG)n\" 1 43\r\n",
"NC_035780.1\tGnomon\tmRNA\t51721\t51922\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t51721\t51922\t21.8\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 2568 2776\r\n",
"NC_035780.1\tGnomon\tmRNA\t53156\t53294\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\tRepeatMasker\tsimilarity\t53156\t53294\t20.9\t+\t.\tTarget \"Motif:BivaMD-SINE1_CrVi\" 127 306\r\n",
"NC_035780.1\tGnomon\tmRNA\t86825\t86942\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\tNC_035780.1\tRepeatMasker\tsimilarity\t86825\t86942\t26.5\t-\t.\tTarget \"Motif:CVA\" 81 203\r\n"
]
}
],
"source": [
"!head 2018-11-07-mRNA-TE-all.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4b. Transposable Elements (_C. virginica_ only)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exons"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 41511\n",
"Exon overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {exonList} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"Exon overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion exon overlap with TE (Cg):"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.059954850795310606"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"41511/692371"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {exonList} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-Exon-TE-Cg.txt"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t109967\t109996\tNC_035780.1\tRepeatMasker\tsimilarity\t109968\t109996\t 0.0\t+\t.\tTarget \"Motif:(CCT)n\" 1 29\r\n",
"NC_035780.1\t164885\t164914\tNC_035780.1\tRepeatMasker\tsimilarity\t164886\t164914\t 7.3\t+\t.\tTarget \"Motif:(GAG)n\" 1 29\r\n",
"NC_035780.1\t227767\t227819\tNC_035780.1\tRepeatMasker\tsimilarity\t227768\t227819\t25.0\t+\t.\tTarget \"Motif:A-rich\" 1 54\r\n",
"NC_035780.1\t227767\t227819\tNC_035780.1\tRepeatMasker\tsimilarity\t227768\t227819\t25.0\t+\t.\tTarget \"Motif:A-rich\" 1 54\r\n",
"NC_035780.1\t227767\t227819\tNC_035780.1\tRepeatMasker\tsimilarity\t227768\t227819\t25.0\t+\t.\tTarget \"Motif:A-rich\" 1 54\r\n",
"NC_035780.1\t233475\t233478\tNC_035780.1\tRepeatMasker\tsimilarity\t233445\t233478\t10.1\t+\t.\tTarget \"Motif:(CCTTT)n\" 1 35\r\n",
"NC_035780.1\t232863\t233028\tNC_035780.1\tRepeatMasker\tsimilarity\t232798\t233028\t29.7\t-\t.\tTarget \"Motif:ISL2EU-N8_CGi\" 15 237\r\n",
"NC_035780.1\t269562\t269603\tNC_035780.1\tRepeatMasker\tsimilarity\t269563\t269603\t17.1\t+\t.\tTarget \"Motif:(ATG)n\" 1 42\r\n",
"NC_035780.1\t258539\t258574\tNC_035780.1\tRepeatMasker\tsimilarity\t258540\t258574\t16.3\t+\t.\tTarget \"Motif:(ATACAAT)n\" 1 36\r\n",
"NC_035780.1\t269562\t269603\tNC_035780.1\tRepeatMasker\tsimilarity\t269563\t269603\t17.1\t+\t.\tTarget \"Motif:(ATG)n\" 1 42\r\n"
]
}
],
"source": [
"!head 2018-11-07-Exon-TE-Cg.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Introns"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 98494\n",
"Intron overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {intronList} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"Intron overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion intron overlap with TE (Cg):"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.1422561025808418"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"98494/692371"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {intronList} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-Intron-TE-Cg.txt"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t32719\t32819\tNC_035780.1\tRepeatMasker\tsimilarity\t32720\t32819\t18.2\t+\t.\tTarget \"Motif:Crypton-9N1_CGi\" 239 337\r\n",
"NC_035780.1\t46753\t46805\tNC_035780.1\tRepeatMasker\tsimilarity\t46754\t46805\t 6.8\t+\t.\tTarget \"Motif:DNA-22_CGi\" 631 722\r\n",
"NC_035780.1\t50250\t50279\tNC_035780.1\tRepeatMasker\tsimilarity\t50251\t50279\t 0.0\t+\t.\tTarget \"Motif:(GGTTAG)n\" 1 29\r\n",
"NC_035780.1\t50605\t50760\tNC_035780.1\tRepeatMasker\tsimilarity\t50606\t50760\t21.3\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 1 166\r\n",
"NC_035780.1\t50976\t51034\tNC_035780.1\tRepeatMasker\tsimilarity\t50977\t51034\t 0.0\t+\t.\tTarget \"Motif:(TA)n\" 1 58\r\n",
"NC_035780.1\t51455\t51498\tNC_035780.1\tRepeatMasker\tsimilarity\t51456\t51498\t 0.0\t+\t.\tTarget \"Motif:(AG)n\" 1 43\r\n",
"NC_035780.1\t51720\t51922\tNC_035780.1\tRepeatMasker\tsimilarity\t51721\t51922\t21.8\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 2568 2776\r\n",
"NC_035780.1\t86839\t86942\tNC_035780.1\tRepeatMasker\tsimilarity\t86840\t86942\t27.4\t-\t.\tTarget \"Motif:Helitron-N14_CGi\" 83 189\r\n",
"NC_035780.1\t87408\t87513\tNC_035780.1\tRepeatMasker\tsimilarity\t87409\t87513\t19.8\t-\t.\tTarget \"Motif:Helitron-7N1_CGi\" 748 850\r\n",
"NC_035780.1\t87525\t87837\tNC_035780.1\tRepeatMasker\tsimilarity\t87526\t87837\t24.3\t-\t.\tTarget \"Motif:DNA3-12_CGi\" 60 378\r\n"
]
}
],
"source": [
"!head 2018-11-07-Intron-TE-Cg.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### mRNA"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 53914\n",
"mRNA overlaps with transposable elements (Cg)\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-u \\\n",
"-a {mRNAList} \\\n",
"-b {transposableElementsCg} \\\n",
"| wc -l\n",
"!echo \"mRNA overlaps with transposable elements (Cg)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Proportion mRNA overlap with TE (Cv):"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.07786865712168765"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"53914/692371"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a {mRNAList} \\\n",
"-b {transposableElementsCg} \\\n",
"> 2018-11-07-mRNA-TE-Cg.txt"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t32719\t32819\tNC_035780.1\tRepeatMasker\tsimilarity\t32720\t32819\t18.2\t+\t.\tTarget \"Motif:Crypton-9N1_CGi\" 239 337\r\n",
"NC_035780.1\t46753\t46805\tNC_035780.1\tRepeatMasker\tsimilarity\t46754\t46805\t 6.8\t+\t.\tTarget \"Motif:DNA-22_CGi\" 631 722\r\n",
"NC_035780.1\t50250\t50279\tNC_035780.1\tRepeatMasker\tsimilarity\t50251\t50279\t 0.0\t+\t.\tTarget \"Motif:(GGTTAG)n\" 1 29\r\n",
"NC_035780.1\t50605\t50760\tNC_035780.1\tRepeatMasker\tsimilarity\t50606\t50760\t21.3\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 1 166\r\n",
"NC_035780.1\t50976\t51034\tNC_035780.1\tRepeatMasker\tsimilarity\t50977\t51034\t 0.0\t+\t.\tTarget \"Motif:(TA)n\" 1 58\r\n",
"NC_035780.1\t51455\t51498\tNC_035780.1\tRepeatMasker\tsimilarity\t51456\t51498\t 0.0\t+\t.\tTarget \"Motif:(AG)n\" 1 43\r\n",
"NC_035780.1\t51720\t51922\tNC_035780.1\tRepeatMasker\tsimilarity\t51721\t51922\t21.8\t+\t.\tTarget \"Motif:Harbinger-2N1_CGi\" 2568 2776\r\n",
"NC_035780.1\t86839\t86942\tNC_035780.1\tRepeatMasker\tsimilarity\t86840\t86942\t27.4\t-\t.\tTarget \"Motif:Helitron-N14_CGi\" 83 189\r\n",
"NC_035780.1\t87408\t87513\tNC_035780.1\tRepeatMasker\tsimilarity\t87409\t87513\t19.8\t-\t.\tTarget \"Motif:Helitron-7N1_CGi\" 748 850\r\n",
"NC_035780.1\t87525\t87837\tNC_035780.1\tRepeatMasker\tsimilarity\t87526\t87837\t24.3\t-\t.\tTarget \"Motif:DNA3-12_CGi\" 60 378\r\n"
]
}
],
"source": [
"!head 2018-11-07-Intron-TE-Cg.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Calculate Overlap Proportions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's important to understand how many overlaps are present between various feature tracks and CG motifs. CG motifs are where we expect methylation to happen. If there are more overlaps present bewteen a certain feature and the CG motifs, we would expect to see most of our DMLs in that region. I also want to understand overlap proportions with DMLS. \n",
"\n",
"Here are the questions I will answer:\n",
"\n",
"1. Out the total number of CG motifs, how many overlaped with a feature track?\n",
"2. Out of the total number of transposable elements, how many overlaped with a feature track?\n",
"2. What proportion of total overlaps does a certain feature track represent?\n",
"3. Out the total number of DML, how many overlaped with a feature track?\n",
"5. Out of the total number of DMR, how many overlaped with a feature track?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5a. CG motif Overlaps with Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"I already calculated the numbers associated with the first question in the first section. I'll remind you of those numbers:\n",
"\n",
"- Proportion exon overlap with CG motifs: 4.40% (0.04400602184027157)\n",
"- Proportion intron overlap with CG motifs: 1.70% (0.016979392964915317)\n",
"- Proportion mRNA overlap with CG motifs: 0.42% (0.004163236495002352)\n",
"- Proportion transposable element (all) overlap with CG motifs: 2.58% (0.025761577646349055) \n",
"- Proportion transposable element (_C. gigas_) overlap with CG motifs: 3.03% (0.03031841791065215)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5b. Transposable Element (all) Overlaps with Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See 4a for more details.\n",
"\n",
"- Proportion exon overlap with TE (all): 7.27% (0.07269368589961163)\n",
"- Proportion intron overlap with TE (all): 15.3% (0.1525814917147021)\n",
"- Proportion mRNA overlap with TE (all): 7.95% (0.07953683790915564)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5c. Transposable Element (_C. gigas_ only) Overlaps with Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See 4b for more details.\n",
"\n",
"- Proportion exon overlap with TE (Cg): 6.00% (0.059954850795310606)\n",
"- Proportion intron overlap with TE (Cg): 14.2% (0.1422561025808418)\n",
"- Proportion mRNA overlap with TE (Cg): 7.79% (0.07786865712168765)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### 5d. Proportion Total Overlaps by Feature Track"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Since I have two different transposable element tracks, I'll repeat these calculations for each track."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Transposable Elements (all)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, I need to calculate the total number of overlaps we had:\n",
"\n",
"(exon overlap with CG motifs) + (intron overlap with CG motifs) + (mRNA overlap with CG motifs) + (TE overlap with CG motifs)"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1380330"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"636270 + 245500 + 60195 + 438365"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Now, I calculate the proportions:"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Exons:"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.4609549890243637"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"636270/1380330"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Introns:"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.17785601993726138"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"245500/1380330"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"mRNA:"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.04360913694551303"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"60195/1380330"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposable Elements (all):"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.3175798540928619"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"438365/1380330"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"- Proportion exon overlap out of total overlaps: 46.10% (0.4609549890243637)\n",
"- Proportion intron overlap out of total overlaps: 17.79% (0.17785601993726138)\n",
"- Proportion mRNA overlap out of total overlaps: 4.36% (0.04360913694551303)\n",
"- Proportion transposable elements (all) overlap out of total overlaps: 31.76% (0.3175798540928619)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Transposable Elements (_C. gigas_ only)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Total number of overlaps:"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1314444"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"636270 + 245500 + 60195 + 372479"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exons:"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.4840601805782521"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"636270/1314444"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Introns:"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.18677098453794913"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"245500/1314444"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"mRNA:"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.04579502816399938"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"60195/1314444"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tranposable Elements (_C. gigas_ only):"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.2833738067197994"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"372479/1314444"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Proportion exon overlap out of total overlaps: 48.41% (0.4840601805782521)\n",
"- Proportion intron overlap out of total overlaps: 18.68% (0.18677098453794913)\n",
"- Proportion mRNA overlap out of total overlaps: 4.58% (0.04579502816399938)\n",
"- Proportion transposable elements (Cg) overlap out of total overlaps: 28.34% (0.2833738067197994)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5e. DML Overlaps with Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exons:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.6158730158730159"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"388/630"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Introns:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.31746031746031744"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"200/630"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"mRNA:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9142857142857143"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"576/630"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposable Elements (all):"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.09682539682539683"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"61/630"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposable elements (_C. gigas_ only):"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.06349206349206349"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"40/630"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Proportion exon overlap with DMLs: 61.59% (0.6158730158730159)\n",
"- Proportion intron overlap with DMLs: 31.75% (0.31746031746031744)\n",
"- Proportion mRNA overlap with DMLs: 91.43% (0.9142857142857143)\n",
"- Proportion transposable element (all) overlap with DMLs: 9.68% (0.09682539682539683)\n",
"- Proportion transposable element (_C. gigas_) overlap with DMLs: 6.35% (0.06349206349206349)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### 5f. DMR Overlaps with Feature Tracks"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Exons:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.3950617283950617"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"64/162"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Introns:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.691358024691358"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"112/162"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"mRNA:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.8580246913580247"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"139/162"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposable Elements (All):"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.24074074074074073"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"39/162"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposable Elements (_C. gigas_ only)"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.1419753086419753"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"23/162"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"- Proportion exon overlap with DMRs: 39.51% (0.3950617283950617)\n",
"- Proportion intron overlap with DMRs: 69.14% (0.691358024691358)\n",
"- Proportion mRNA overlap with DMRs: 85.80% (0.8580246913580247)\n",
"- Proportion transposable element (all) overlap with DMRs: 24.07% (0.24074074074074073)\n",
"- Proportion transposable element (_C. gigas_ only) overlap with DMRs: 14.20% (0.1419753086419753)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## 6. Gene Flanking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will perform a flanking analysis in two ways. First, I will use `bedtools flank` to add 1000 bp regions to each mRNA coding region. I can then isolate these flanks and intersect them with various genomic feature files. Second I will use `bedtools closest` to find the closest non-overlapping DML or DMR to each mRNA coding region."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mkdir 2018-11-14-Flanking-Analysis #Create a new directory for flanking analysis output"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-11-07-DML-Exon.txt\r\n",
"2018-11-07-DML-Intron.txt\r\n",
"2018-11-07-DML-TE-Cg.txt\r\n",
"2018-11-07-DML-TE-all.txt\r\n",
"2018-11-07-DML-mRNA.txt\r\n",
"2018-11-07-DMR-Exon.txt\r\n",
"2018-11-07-DMR-Intron.txt\r\n",
"2018-11-07-DMR-TE-Cg.txt\r\n",
"2018-11-07-DMR-TE-all.txt\r\n",
"2018-11-07-DMR-mRNA.txt\r\n",
"2018-11-07-Exon-CGmotif.txt\r\n",
"2018-11-07-Exon-TE-Cg.txt\r\n",
"2018-11-07-Exon-TE-all.txt\r\n",
"2018-11-07-Intron-CGmotif.txt\r\n",
"2018-11-07-Intron-TE-Cg.txt\r\n",
"2018-11-07-Intron-TE-all.txt\r\n",
"2018-11-07-TE-Cg-CGmotif.txt\r\n",
"2018-11-07-TE-all-CGmotif.txt\r\n",
"2018-11-07-Unique-Genes-in-DML-mRNA-Overlap.txt\r\n",
"2018-11-07-Unique-Genes-in-DMR-mRNA-Overlap.txt\r\n",
"2018-11-07-mRNA-CGmotif.txt\r\n",
"2018-11-07-mRNA-TE-Cg.txt\r\n",
"2018-11-07-mRNA-TE-all.txt\r\n",
"\u001b[34m2018-11-14-Flanking-Analysis\u001b[m\u001b[m/\r\n",
"C_virginica-3.0_CG-motif.bed\r\n",
"C_virginica-3.0_Gnomon_exon.bed\r\n",
"C_virginica-3.0_Gnomon_mRNA.gff3\r\n",
"C_virginica-3.0_TE-Cg.gff\r\n",
"C_virginica-3.0_TE-all.gff\r\n",
"C_virginica-3.0_intron.bed\r\n"
]
}
],
"source": [
"ls -F"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6a. `flank`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I also need to know if DMLs and CG motifs overlap with regions that flank mRNA. These flanking regions could be promoters or transcription factors that could regulate these processes. To do this, I will use `bedtools flank`:\n",
"\n",
"1. Path to `flankBed`\n",
"2. -i: Path to mRNA GFF file\n",
"3. -g: Path to C. virginica \"genome\" file. flankBed requires the start and stop position of each genome (see this issue). I created a file like in TextWrangler using chromosome lengths from NCBI.\n",
"4. -b 1000: Add 1000 bp flanks to each end of the coding region"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}flankBed \\\n",
"-i {mRNAList} \\\n",
"-g 2018-11-14-Flanking-Analysis/2018-11-14-bedtools-Chromosome-Length.txt \\\n",
"-b 1000 \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-1000bp-Flanks.bed"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t66897\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t46506\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t85606\t95254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t99840\t106460\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t108305\t110077\t.\t-\t.\tID=rna6;Parent=gene5;Dbxref=GeneID:111128944,Genbank:XM_022474921.1;Name=XM_022474921.1;gbkey=mRNA;gene=LOC111128944;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;partial=true;product=mucin-19-like;start_range=.,108305;transcript_id=XM_022474921.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t151859\t157536\t.\t+\t.\tID=rna7;Parent=gene6;Dbxref=GeneID:111128953,Genbank:XM_022474931.1;Name=XM_022474931.1;gbkey=mRNA;gene=LOC111128953;model_evidence=Supporting evidence includes similarity to: 1 Protein;product=GATA zinc finger domain-containing protein 14-like;transcript_id=XM_022474931.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t163809\t183798\t.\t-\t.\tID=rna8;Parent=gene7;Dbxref=GeneID:111105691,Genbank:XM_022440054.1;Name=XM_022440054.1;gbkey=mRNA;gene=LOC111105691;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=uncharacterized LOC111105691;transcript_id=XM_022440054.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t164820\t166793\t.\t+\t.\tID=rna9;Parent=gene8;Dbxref=GeneID:111105685,Genbank:XM_022440042.1;Name=XM_022440042.1;gbkey=mRNA;gene=LOC111105685;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=protein ANTAGONIST OF LIKE HETEROCHROMATIN PROTEIN 1-like;transcript_id=XM_022440042.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t190449\t193594\t.\t-\t.\tID=rna11;Parent=gene10;Dbxref=GeneID:111133554,Genbank:XM_022482070.1;Name=XM_022482070.1;gbkey=mRNA;gene=LOC111133554;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=putative uncharacterized protein DDB_G0277407;transcript_id=XM_022482070.1\r\n"
]
}
],
"source": [
"!head {mRNAList} #The original file, just for comparison"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t27961\t28960\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t33325\t34324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t42111\t43110\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t66898\t67897\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t42111\t43110\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t46507\t47506\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t84606\t85605\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t95255\t96254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t98840\t99839\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t106461\t107460\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-1000bp-Flanks.bed #Isolated flanks. The first entry is the upstream flank for the first mRNA coding region, second is the downstream flank for the mRNA coding region, etc."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 120400 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-1000bp-Flanks.bed\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-1000bp-Flanks.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that I have these flanks, I want to separate the upstream flank from the downstream flank. I will do this using `awk`. If th row number is odd, the rows go into the upstream flank file. If the row number is even, it goes into the downstream flank file."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!awk '{ if (NR%2) print > \"2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed\"; \\\n",
"else print > \"2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed\" }' \\\n",
"2018-11-14-Flanking-Analysis/2018-11-14-mRNA-1000bp-Flanks.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Upstream flank"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t27961\t28960\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t42111\t43110\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t42111\t43110\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t84606\t85605\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t98840\t99839\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t107305\t108304\t.\t-\t.\tID=rna6;Parent=gene5;Dbxref=GeneID:111128944,Genbank:XM_022474921.1;Name=XM_022474921.1;gbkey=mRNA;gene=LOC111128944;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;partial=true;product=mucin-19-like;start_range=.,108305;transcript_id=XM_022474921.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t150859\t151858\t.\t+\t.\tID=rna7;Parent=gene6;Dbxref=GeneID:111128953,Genbank:XM_022474931.1;Name=XM_022474931.1;gbkey=mRNA;gene=LOC111128953;model_evidence=Supporting evidence includes similarity to: 1 Protein;product=GATA zinc finger domain-containing protein 14-like;transcript_id=XM_022474931.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t162809\t163808\t.\t-\t.\tID=rna8;Parent=gene7;Dbxref=GeneID:111105691,Genbank:XM_022440054.1;Name=XM_022440054.1;gbkey=mRNA;gene=LOC111105691;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=uncharacterized LOC111105691;transcript_id=XM_022440054.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t163820\t164819\t.\t+\t.\tID=rna9;Parent=gene8;Dbxref=GeneID:111105685,Genbank:XM_022440042.1;Name=XM_022440042.1;gbkey=mRNA;gene=LOC111105685;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=protein ANTAGONIST OF LIKE HETEROCHROMATIN PROTEIN 1-like;transcript_id=XM_022440042.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t189449\t190448\t.\t-\t.\tID=rna11;Parent=gene10;Dbxref=GeneID:111133554,Genbank:XM_022482070.1;Name=XM_022482070.1;gbkey=mRNA;gene=LOC111133554;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=putative uncharacterized protein DDB_G0277407;transcript_id=XM_022482070.1\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 60200 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Downstream flanks"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t33325\t34324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t66898\t67897\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t46507\t47506\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t95255\t96254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t106461\t107460\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t110078\t111077\t.\t-\t.\tID=rna6;Parent=gene5;Dbxref=GeneID:111128944,Genbank:XM_022474921.1;Name=XM_022474921.1;gbkey=mRNA;gene=LOC111128944;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;partial=true;product=mucin-19-like;start_range=.,108305;transcript_id=XM_022474921.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t157537\t158536\t.\t+\t.\tID=rna7;Parent=gene6;Dbxref=GeneID:111128953,Genbank:XM_022474931.1;Name=XM_022474931.1;gbkey=mRNA;gene=LOC111128953;model_evidence=Supporting evidence includes similarity to: 1 Protein;product=GATA zinc finger domain-containing protein 14-like;transcript_id=XM_022474931.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t183799\t184798\t.\t-\t.\tID=rna8;Parent=gene7;Dbxref=GeneID:111105691,Genbank:XM_022440054.1;Name=XM_022440054.1;gbkey=mRNA;gene=LOC111105691;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=uncharacterized LOC111105691;transcript_id=XM_022440054.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t166794\t167793\t.\t+\t.\tID=rna9;Parent=gene8;Dbxref=GeneID:111105685,Genbank:XM_022440042.1;Name=XM_022440042.1;gbkey=mRNA;gene=LOC111105685;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=protein ANTAGONIST OF LIKE HETEROCHROMATIN PROTEIN 1-like;transcript_id=XM_022440042.1\r\n",
"NC_035780.1\tGnomon\tmRNA\t193595\t194594\t.\t-\t.\tID=rna11;Parent=gene10;Dbxref=GeneID:111133554,Genbank:XM_022482070.1;Name=XM_022482070.1;gbkey=mRNA;gene=LOC111133554;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=putative uncharacterized protein DDB_G0277407;transcript_id=XM_022482070.1\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 60200 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I'll take the upstream and downstream flank BEDfiles I made and use it in intersectBed to find overlaps with DML, DMR, and CG motifs!\n",
"\n",
"1. Path to intersectBed\n",
"2. -wb: Write output according to the second file\n",
"3. -a: Path to BEDfile created with flanks\n",
"4. -b: Specify either DML, DMR, or CG motif file. Overlaps between the flanks and CG motifs can be used as a background when comparing DML-flank and DMR-flank results\n",
"5. \">\" filename: Redirect output to a .txt file"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"-b {DMLlist} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t8833125\t8833126\t.\t+\t.\tID=rna875;Parent=gene522;Dbxref=GeneID:111138488,Genbank:XM_022490485.1;Name=XM_022490485.1;gbkey=mRNA;gene=LOC111138488;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 26 samples with support for all annotated introns;product=hsp70-binding protein 1-like;transcript_id=XM_022490485.1\tNC_035780.1\t8833124\t8833126\t60\r\n",
"NC_035780.1\tGnomon\tmRNA\t55484706\t55484707\t.\t-\t.\tID=rna5598;Parent=gene3286;Dbxref=GeneID:111102644,Genbank:XM_022435478.1;Name=XM_022435478.1;gbkey=mRNA;gene=LOC111102644;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=histone-lysine N-methyltransferase KMT5B-like%2C transcript variant X2;transcript_id=XM_022435478.1\tNC_035780.1\t55484705\t55484707\t-52\r\n",
"NC_035780.1\tGnomon\tmRNA\t55484706\t55484707\t.\t-\t.\tID=rna5599;Parent=gene3286;Dbxref=GeneID:111102644,Genbank:XM_022435471.1;Name=XM_022435471.1;gbkey=mRNA;gene=LOC111102644;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 19 samples with support for all annotated introns;product=histone-lysine N-methyltransferase KMT5B-like%2C transcript variant X1;transcript_id=XM_022435471.1\tNC_035780.1\t55484705\t55484707\t-52\r\n",
"NC_035780.1\tGnomon\tmRNA\t58135768\t58135769\t.\t-\t.\tID=rna5867;Parent=gene3472;Dbxref=GeneID:111135499,Genbank:XM_022485627.1;Name=XM_022485627.1;gbkey=mRNA;gene=LOC111135499;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=pre-mRNA-processing factor 6-like;transcript_id=XM_022485627.1\tNC_035780.1\t58135767\t58135769\t74\r\n",
"NC_035781.1\tGnomon\tmRNA\t7626511\t7626512\t.\t+\t.\tID=rna7537;Parent=gene4444;Dbxref=GeneID:111120066,Genbank:XM_022460716.1;Name=XM_022460716.1;gbkey=mRNA;gene=LOC111120066;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=short-chain dehydrogenase/reductase family 42E member 1-like%2C transcript variant X2;transcript_id=XM_022460716.1\tNC_035781.1\t7626510\t7626512\t-56\r\n",
"NC_035781.1\tGnomon\tmRNA\t7626511\t7626512\t.\t+\t.\tID=rna7538;Parent=gene4444;Dbxref=GeneID:111120066,Genbank:XM_022460717.1;Name=XM_022460717.1;gbkey=mRNA;gene=LOC111120066;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=short-chain dehydrogenase/reductase family 42E member 1-like%2C transcript variant X3;transcript_id=XM_022460717.1\tNC_035781.1\t7626510\t7626512\t-56\r\n",
"NC_035781.1\tGnomon\tmRNA\t30789624\t30789625\t.\t-\t.\tID=rna10183;Parent=gene5990;Dbxref=GeneID:111117669,Genbank:XM_022456835.1;Name=XM_022456835.1;gbkey=mRNA;gene=LOC111117669;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=beta-lactamase domain-containing protein 2-like;transcript_id=XM_022456835.1\tNC_035781.1\t30789623\t30789625\t-57\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t-\t.\tID=rna10243;Parent=gene6019;Dbxref=GeneID:111119728,Genbank:XM_022460170.1;Name=XM_022460170.1;gbkey=mRNA;gene=LOC111119728;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 7 samples with support for all annotated introns;product=uncharacterized LOC111119728%2C transcript variant X1;transcript_id=XM_022460170.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035782.1\tGnomon\tmRNA\t4729349\t4729350\t.\t+\t.\tID=rna13975;Parent=gene8312;Dbxref=GeneID:111123849,Genbank:XM_022466485.1;Name=XM_022466485.1;gbkey=mRNA;gene=LOC111123849;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=tripartite motif-containing protein 2-like%2C transcript variant X9;transcript_id=XM_022466485.1\tNC_035782.1\t4729348\t4729350\t55\r\n",
"NC_035782.1\tGnomon\tmRNA\t4729349\t4729350\t.\t+\t.\tID=rna13976;Parent=gene8312;Dbxref=GeneID:111123849,Genbank:XM_022466484.1;Name=XM_022466484.1;gbkey=mRNA;gene=LOC111123849;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=tripartite motif-containing protein 2-like%2C transcript variant X8;transcript_id=XM_022466484.1\tNC_035782.1\t4729348\t4729350\t55\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 67 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed \\\n",
"-b {DMLlist} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t1882692\t1882693\t.\t+\t.\tID=rna154;Parent=gene94;Dbxref=GeneID:111102439,Genbank:XM_022435187.1;Name=XM_022435187.1;gbkey=mRNA;gene=LOC111102439;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC111102439;transcript_id=XM_022435187.1\tNC_035780.1\t1882691\t1882693\t64\r\n",
"NC_035781.1\tGnomon\tmRNA\t20126030\t20126031\t.\t-\t.\tID=rna9080;Parent=gene5340;Dbxref=GeneID:111122222,Genbank:XM_022463862.1;Name=XM_022463862.1;gbkey=mRNA;gene=LOC111122222;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 18 samples with support for all annotated introns;product=phosphatidylinositol N-acetylglucosaminyltransferase subunit Q-like;transcript_id=XM_022463862.1\tNC_035781.1\t20126029\t20126031\t-52\r\n",
"NC_035781.1\tGnomon\tmRNA\t28992819\t28992820\t.\t-\t.\tID=rna9957;Parent=gene5845;Dbxref=GeneID:111119810,Genbank:XM_022460310.1;Name=XM_022460310.1;gbkey=mRNA;gene=LOC111119810;model_evidence=Supporting evidence includes similarity to: 20 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 12 samples with support for all annotated introns;product=ATP-dependent Clp protease proteolytic subunit-like;transcript_id=XM_022460310.1\tNC_035781.1\t28992818\t28992820\t52\r\n",
"NC_035781.1\tGnomon\tmRNA\t30062223\t30062224\t.\t+\t.\tID=rna10095;Parent=gene5931;Dbxref=GeneID:111122164,Genbank:XM_022463767.1;Name=XM_022463767.1;gbkey=mRNA;gene=LOC111122164;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=serine/threonine-protein kinase STK11-like;transcript_id=XM_022463767.1\tNC_035781.1\t30062222\t30062224\t60\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10236;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460166.1;Name=XM_022460166.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X3;transcript_id=XM_022460166.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10237;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460164.1;Name=XM_022460164.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 20 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X1;transcript_id=XM_022460164.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10238;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460169.1;Name=XM_022460169.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 20 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X6;transcript_id=XM_022460169.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10239;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460165.1;Name=XM_022460165.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X2;transcript_id=XM_022460165.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10240;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460168.1;Name=XM_022460168.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X5;transcript_id=XM_022460168.1\tNC_035781.1\t31150010\t31150012\t53\r\n",
"NC_035781.1\tGnomon\tmRNA\t31150011\t31150012\t.\t+\t.\tID=rna10241;Parent=gene6018;Dbxref=GeneID:111119727,Genbank:XM_022460167.1;Name=XM_022460167.1;gbkey=mRNA;gene=LOC111119727;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=TATA element modulatory factor-like%2C transcript variant X4;transcript_id=XM_022460167.1\tNC_035781.1\t31150010\t31150012\t53\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 49 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DML.txt"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### DMR"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"-b {DMRlist} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR.txt"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t46044701\t46044800\t.\t-\t.\tID=rna4633;Parent=gene2726;Dbxref=GeneID:111110504,Genbank:XM_022447031.1;Name=XM_022447031.1;gbkey=mRNA;gene=LOC111110504;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111110504%2C transcript variant X1;transcript_id=XM_022447031.1\tNC_035780.1\t46044700\t46044800\tDMR\t-53\r\n",
"NC_035780.1\tGnomon\tmRNA\t46044701\t46044800\t.\t-\t.\tID=rna4634;Parent=gene2726;Dbxref=GeneID:111110504,Genbank:XM_022447040.1;Name=XM_022447040.1;gbkey=mRNA;gene=LOC111110504;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111110504%2C transcript variant X2;transcript_id=XM_022447040.1\tNC_035780.1\t46044700\t46044800\tDMR\t-53\r\n",
"NC_035781.1\tGnomon\tmRNA\t2305401\t2305500\t.\t-\t.\tID=rna6973;Parent=gene4134;Dbxref=GeneID:111119708,Genbank:XM_022460139.1;Name=XM_022460139.1;gbkey=mRNA;gene=LOC111119708;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 11 samples with support for all annotated introns;product=tRNA methyltransferase 10 homolog A-like%2C transcript variant X3;transcript_id=XM_022460139.1\tNC_035781.1\t2305400\t2305500\tDMR\t-52\r\n",
"NC_035781.1\tGnomon\tmRNA\t2305401\t2305500\t.\t-\t.\tID=rna6974;Parent=gene4134;Dbxref=GeneID:111119708,Genbank:XM_022460137.1;Name=XM_022460137.1;gbkey=mRNA;gene=LOC111119708;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=tRNA methyltransferase 10 homolog A-like%2C transcript variant X1;transcript_id=XM_022460137.1\tNC_035781.1\t2305400\t2305500\tDMR\t-52\r\n",
"NC_035781.1\tGnomon\tmRNA\t2305401\t2305500\t.\t-\t.\tID=rna6975;Parent=gene4134;Dbxref=GeneID:111119708,Genbank:XM_022460138.1;Name=XM_022460138.1;gbkey=mRNA;gene=LOC111119708;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=tRNA methyltransferase 10 homolog A-like%2C transcript variant X2;transcript_id=XM_022460138.1\tNC_035781.1\t2305400\t2305500\tDMR\t-52\r\n",
"NC_035781.1\tGnomon\tmRNA\t7626501\t7626600\t.\t+\t.\tID=rna7537;Parent=gene4444;Dbxref=GeneID:111120066,Genbank:XM_022460716.1;Name=XM_022460716.1;gbkey=mRNA;gene=LOC111120066;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=short-chain dehydrogenase/reductase family 42E member 1-like%2C transcript variant X2;transcript_id=XM_022460716.1\tNC_035781.1\t7626500\t7626600\tDMR\t-58\r\n",
"NC_035781.1\tGnomon\tmRNA\t7626501\t7626600\t.\t+\t.\tID=rna7538;Parent=gene4444;Dbxref=GeneID:111120066,Genbank:XM_022460717.1;Name=XM_022460717.1;gbkey=mRNA;gene=LOC111120066;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=short-chain dehydrogenase/reductase family 42E member 1-like%2C transcript variant X3;transcript_id=XM_022460717.1\tNC_035781.1\t7626500\t7626600\tDMR\t-58\r\n",
"NC_035781.1\tGnomon\tmRNA\t30789601\t30789700\t.\t-\t.\tID=rna10183;Parent=gene5990;Dbxref=GeneID:111117669,Genbank:XM_022456835.1;Name=XM_022456835.1;gbkey=mRNA;gene=LOC111117669;model_evidence=Supporting evidence includes similarity to: 2 ESTs%2C 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=beta-lactamase domain-containing protein 2-like;transcript_id=XM_022456835.1\tNC_035781.1\t30789600\t30789700\tDMR\t-55\r\n",
"NC_035782.1\tGnomon\tmRNA\t60004801\t60004900\t.\t+\t.\tID=rna20171;Parent=gene11656;Dbxref=GeneID:111123863,Genbank:XM_022466501.1;Name=XM_022466501.1;gbkey=mRNA;gene=LOC111123863;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=multidrug resistance-associated protein 5-like;transcript_id=XM_022466501.1\tNC_035782.1\t60004800\t60004900\tDMR\t-55\r\n",
"NC_035783.1\tGnomon\tmRNA\t50620701\t50620723\t.\t+\t.\tID=rna27698;Parent=gene16060;Dbxref=GeneID:111130505,Genbank:XM_022477648.1;Name=XM_022477648.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon%3B deleted 2 bases in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111130505;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 12 samples with support for all annotated introns;product=kelch-like protein 40;transcript_id=XM_022477648.1\tNC_035783.1\t50620700\t50620800\tDMR\t51\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR.txt"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 12 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-UpstreamFlanks-DMR.txt"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed \\\n",
"-b {DMRlist} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR.txt"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t573701\t573800\t.\t+\t.\tID=rna48;Parent=gene35;Dbxref=GeneID:111114201,Genbank:XM_022452489.1;Name=XM_022452489.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 2 bases in 2 codons;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111114201;model_evidence=Supporting evidence includes similarity to: 4 Proteins%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=vacuolar protein sorting-associated protein 13B-like;transcript_id=XM_022452489.1\tNC_035780.1\t573700\t573800\tDMR\t52\r\n",
"NC_035780.1\tGnomon\tmRNA\t33945901\t33946000\t.\t-\t.\tID=rna3547;Parent=gene2078;Dbxref=GeneID:111131167,Genbank:XM_022478582.1;Name=XM_022478582.1;gbkey=mRNA;gene=LOC111131167;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=steroid 17-alpha-hydroxylase/17%2C20 lyase-like%2C transcript variant X2;transcript_id=XM_022478582.1\tNC_035780.1\t33945900\t33946000\tDMR\t-51\r\n",
"NC_035780.1\tGnomon\tmRNA\t33945901\t33946000\t.\t-\t.\tID=rna3548;Parent=gene2078;Dbxref=GeneID:111131167,Genbank:XM_022478575.1;Name=XM_022478575.1;gbkey=mRNA;gene=LOC111131167;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 15 samples with support for all annotated introns;product=steroid 17-alpha-hydroxylase/17%2C20 lyase-like%2C transcript variant X1;transcript_id=XM_022478575.1\tNC_035780.1\t33945900\t33946000\tDMR\t-51\r\n",
"NC_035781.1\tGnomon\tmRNA\t54133675\t54133700\t.\t-\t.\tID=rna12752;Parent=gene7569;Dbxref=GeneID:111118541,Genbank:XM_022458056.1;Name=XM_022458056.1;gbkey=mRNA;gene=LOC111118541;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns;product=protein N-terminal asparagine amidohydrolase-like%2C transcript variant X2;transcript_id=XM_022458056.1\tNC_035781.1\t54133600\t54133700\tDMR\t51\r\n",
"NC_035782.1\tGnomon\tmRNA\t16097901\t16098000\t.\t-\t.\tID=rna15041;Parent=gene8894;Dbxref=GeneID:111124055,Genbank:XM_022466916.1;Name=XM_022466916.1;gbkey=mRNA;gene=LOC111124055;model_evidence=Supporting evidence includes similarity to: 2 Proteins;product=uncharacterized LOC111124055;transcript_id=XM_022466916.1\tNC_035782.1\t16097900\t16098000\tDMR\t51\r\n",
"NC_035783.1\tGnomon\tmRNA\t21629901\t21630000\t.\t+\t.\tID=rna24637;Parent=gene14213;Dbxref=GeneID:111128659,Genbank:XM_022474408.1;Name=XM_022474408.1;gbkey=mRNA;gene=LOC111128659;model_evidence=Supporting evidence includes similarity to: 1 EST%2C 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 25 samples with support for all annotated introns;product=carnitine O-palmitoyltransferase 1%2C liver isoform-like;transcript_id=XM_022474408.1\tNC_035783.1\t21629900\t21630000\tDMR\t-51\r\n",
"NC_035783.1\tGnomon\tmRNA\t50620701\t50620800\t.\t-\t.\tID=rna27697;Parent=gene16059;Dbxref=GeneID:111129730,Genbank:XM_022476200.1;Name=XM_022476200.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111129730;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 89%25 coverage of the annotated genomic feature by RNAseq alignments;product=very-long-chain (3R)-3-hydroxyacyl-CoA dehydratase-like;transcript_id=XM_022476200.1\tNC_035783.1\t50620700\t50620800\tDMR\t51\r\n",
"NC_035784.1\tGnomon\tmRNA\t5237901\t5238000\t.\t+\t.\tID=rna29389;Parent=gene17054;Dbxref=GeneID:111136491,Genbank:XM_022487374.1;Name=XM_022487374.1;gbkey=mRNA;gene=LOC111136491;model_evidence=Supporting evidence includes similarity to: 3 Proteins;product=chaperone protein DnaJ-like;transcript_id=XM_022487374.1\tNC_035784.1\t5237900\t5238000\tDMR\t-51\r\n",
"NC_035784.1\tGnomon\tmRNA\t16957124\t16957200\t.\t+\t.\tID=rna30706;Parent=gene17832;Dbxref=GeneID:111133121,Genbank:XM_022481215.1;Name=XM_022481215.1;gbkey=mRNA;gene=LOC111133121;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111133121%2C transcript variant X3;transcript_id=XM_022481215.1\tNC_035784.1\t16957100\t16957200\tDMR\t73\r\n",
"NC_035784.1\tGnomon\tmRNA\t80199201\t80199300\t.\t+\t.\tID=rna38137;Parent=gene21988;Dbxref=GeneID:111138341,Genbank:XM_022490259.1;Name=XM_022490259.1;gbkey=mRNA;gene=LOC111138341;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 24 samples with support for all annotated introns;product=drebrin-like protein B%2C transcript variant X1;transcript_id=XM_022490259.1\tNC_035784.1\t80199200\t80199300\tDMR\t-60\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR.txt"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 18 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-100bp-DownstreamFlanks-DMR.txt"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### CG motifs"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"-b {CGMotifList} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-UpstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t27970\t27971\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t27969\t27971\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t27980\t27981\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t27979\t27981\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28082\t28083\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28081\t28083\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28131\t28132\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28130\t28132\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28148\t28149\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28147\t28149\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28170\t28171\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28169\t28171\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28210\t28211\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28209\t28211\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28212\t28213\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28211\t28213\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28229\t28230\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28228\t28230\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t28309\t28310\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28308\t28310\tCG_motif\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-UpstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1286944 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-UpstreamFlanks-CGmotif.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-UpstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-wb \\\n",
"-a 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Downstream-Flanks.bed \\\n",
"-b {CGMotifList} \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-DownstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t33408\t33409\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33407\t33409\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33452\t33453\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33451\t33453\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33481\t33482\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33480\t33482\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33638\t33639\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33637\t33639\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33647\t33648\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33646\t33648\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33784\t33785\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33783\t33785\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t33797\t33798\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t33796\t33798\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t34284\t34285\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t34283\t34285\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t34311\t34312\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t34310\t34312\tCG_motif\r\n",
"NC_035780.1\tGnomon\tmRNA\t34322\t34323\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t34321\t34323\tCG_motif\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-DownstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1285210 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-DownstreamFlanks-CGmotif.txt\r\n"
]
}
],
"source": [
"!wc -l 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-100bp-DownstreamFlanks-CGmotif.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6b. No overlaps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I also want to count the number of DML or DMR that do not overlap with any features (i.e. DML and DMR in unannotated intergenic regions). To do this, I'll use the `-v` argument in `bedtools`, which reports \"those entries in A that have no overlap in B.\" I can specify multiple files with `-b`. I'll use exons, introns, transposable elements identified using all species, and putative promoter regions (upstream flanks)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DML"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 20\n",
"DML do not overlap with exons, introns, transposable elements (all), or putative promoters\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-v \\\n",
"-a {DMLlist} \\\n",
"-b {exonList} {intronList} {transposableElementsAll} 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"| wc -l\n",
"!echo \"DML do not overlap with exons, introns, transposable elements (all), or putative promoters\""
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-v \\\n",
"-a {DMLlist} \\\n",
"-b {exonList} {intronList} {transposableElementsAll} 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"> 2019-03-17-No-Overlap-DML.txt"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t55852212\t55852214\t58\r\n",
"NC_035780.1\t60587461\t60587463\t-74\r\n",
"NC_035781.1\t20620123\t20620125\t57\r\n",
"NC_035781.1\t24474567\t24474569\t53\r\n",
"NC_035781.1\t30062222\t30062224\t60\r\n",
"NC_035781.1\t39583208\t39583210\t-50\r\n",
"NC_035781.1\t50711254\t50711256\t-71\r\n",
"NC_035782.1\t58675230\t58675232\t52\r\n",
"NC_035782.1\t65377028\t65377030\t51\r\n",
"NC_035784.1\t2011997\t2011999\t-60\r\n"
]
}
],
"source": [
"!head 2019-03-17-No-Overlap-DML.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### CG motifs"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 4760788\n",
"CG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters\n"
]
}
],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-v \\\n",
"-a {CGMotifList} \\\n",
"-b {exonList} {intronList} {transposableElementsAll} 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"| wc -l\n",
"!echo \"CG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters\""
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"! {bedtoolsDirectory}intersectBed \\\n",
"-v \\\n",
"-a {CGMotifList} \\\n",
"-b {exonList} {intronList} {transposableElementsAll} 2018-11-14-Flanking-Analysis/2018-11-15-mRNA-Upstream-Flanks.bed \\\n",
"> 2019-03-17-No-Overlap-CGmotifs.txt"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\t28\t30\tCG_motif\r\n",
"NC_035780.1\t54\t56\tCG_motif\r\n",
"NC_035780.1\t75\t77\tCG_motif\r\n",
"NC_035780.1\t93\t95\tCG_motif\r\n",
"NC_035780.1\t103\t105\tCG_motif\r\n",
"NC_035780.1\t116\t118\tCG_motif\r\n",
"NC_035780.1\t134\t136\tCG_motif\r\n",
"NC_035780.1\t159\t161\tCG_motif\r\n",
"NC_035780.1\t209\t211\tCG_motif\r\n",
"NC_035780.1\t224\t226\tCG_motif\r\n"
]
}
],
"source": [
"!head 2019-03-17-No-Overlap-CGmotifs.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6c. `closest`\n",
"\n",
"[`bedtools closest`](https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) will find the nearest DML or DMR to an mRNA coding region, but not necessarily a non-overlapping feature. I will use the following code:\n",
"\n",
"1. Path to `closestBed`\n",
"2. -io: Ignore features in b that overlap with a\n",
"3. -a: Path to mRNA gff\n",
"4. -b: Specify either DML, DMR, or CG motif file. The CG motif file will be used as a background for the DML and DMR\n",
"6. -t all: In case of a tie, report all matches\n",
"7. -D ref: Report distance to A in an extra column. Use negative distances to report upstream features with respect to the reference genome. B features with a lower (start, stop) are upstream.\n",
"8. \">\" filename: Redirect output to a .txt file"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error: Sorted input specified, but the file C_virginica-3.0_Gnomon_mRNA.gff3 has the following out of order record\r\n",
"NC_035780.1\tGnomon\tmRNA\t2413594\t2416601\t.\t-\t.\tID=rna199;Parent=gene122;Dbxref=GeneID:111129373,Genbank:XM_022475729.1;Name=XM_022475729.1;gbkey=mRNA;gene=LOC111129373;model_evidence=Supporting evidence includes similarity to: 2 Proteins;product=mucin-2-like;transcript_id=XM_022475729.1\r\n"
]
}
],
"source": [
"! {bedtoolsDirectory}closestBed \\\n",
"-io \\\n",
"-a {mRNAList} \\\n",
"-b {DMLlist} \\\n",
"-t all \\\n",
"-D ref \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file was created, but the mRNA file itself is unsorted. I need to see if this impacted the output."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t401630\t401632\t53\t368307\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t66897\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\t401630\t401632\t53\t334734\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t46506\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\tNC_035780.1\t401630\t401632\t53\t355125\r\n",
"NC_035780.1\tGnomon\tmRNA\t85606\t95254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\tNC_035780.1\t401630\t401632\t53\t306377\r\n",
"NC_035780.1\tGnomon\tmRNA\t99840\t106460\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\tNC_035780.1\t401630\t401632\t53\t295171\r\n",
"NC_035780.1\tGnomon\tmRNA\t108305\t110077\t.\t-\t.\tID=rna6;Parent=gene5;Dbxref=GeneID:111128944,Genbank:XM_022474921.1;Name=XM_022474921.1;gbkey=mRNA;gene=LOC111128944;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;partial=true;product=mucin-19-like;start_range=.,108305;transcript_id=XM_022474921.1\tNC_035780.1\t401630\t401632\t53\t291554\r\n",
"NC_035780.1\tGnomon\tmRNA\t151859\t157536\t.\t+\t.\tID=rna7;Parent=gene6;Dbxref=GeneID:111128953,Genbank:XM_022474931.1;Name=XM_022474931.1;gbkey=mRNA;gene=LOC111128953;model_evidence=Supporting evidence includes similarity to: 1 Protein;product=GATA zinc finger domain-containing protein 14-like;transcript_id=XM_022474931.1\tNC_035780.1\t401630\t401632\t53\t244095\r\n",
"NC_035780.1\tGnomon\tmRNA\t163809\t183798\t.\t-\t.\tID=rna8;Parent=gene7;Dbxref=GeneID:111105691,Genbank:XM_022440054.1;Name=XM_022440054.1;gbkey=mRNA;gene=LOC111105691;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=uncharacterized LOC111105691;transcript_id=XM_022440054.1\tNC_035780.1\t401630\t401632\t53\t217833\r\n",
"NC_035780.1\tGnomon\tmRNA\t164820\t166793\t.\t+\t.\tID=rna9;Parent=gene8;Dbxref=GeneID:111105685,Genbank:XM_022440042.1;Name=XM_022440042.1;gbkey=mRNA;gene=LOC111105685;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=protein ANTAGONIST OF LIKE HETEROCHROMATIN PROTEIN 1-like;transcript_id=XM_022440042.1\tNC_035780.1\t401630\t401632\t53\t234838\r\n",
"NC_035780.1\tGnomon\tmRNA\t190449\t193594\t.\t-\t.\tID=rna11;Parent=gene10;Dbxref=GeneID:111133554,Genbank:XM_022482070.1;Name=XM_022482070.1;gbkey=mRNA;gene=LOC111133554;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=putative uncharacterized protein DDB_G0277407;transcript_id=XM_022482070.1\tNC_035780.1\t401630\t401632\t53\t208037\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs.txt"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046722\t.\t-\t.\tID=rna177;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463175.1;Name=XM_022463175.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X6;transcript_id=XM_022463175.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046721\t.\t-\t.\tID=rna178;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463184.1;Name=XM_022463184.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X7;transcript_id=XM_022463184.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046721\t.\t-\t.\tID=rna179;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463137.1;Name=XM_022463137.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X2;transcript_id=XM_022463137.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046721\t.\t-\t.\tID=rna180;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463128.1;Name=XM_022463128.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X1;transcript_id=XM_022463128.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046720\t.\t-\t.\tID=rna181;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463166.1;Name=XM_022463166.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X5;transcript_id=XM_022463166.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046720\t.\t-\t.\tID=rna182;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463156.1;Name=XM_022463156.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X4;transcript_id=XM_022463156.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046720\t.\t-\t.\tID=rna183;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463147.1;Name=XM_022463147.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X3;transcript_id=XM_022463147.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2028071\t2046719\t.\t-\t.\tID=rna184;Parent=gene107;Dbxref=GeneID:111121733,Genbank:XM_022463190.1;Name=XM_022463190.1;gbkey=mRNA;gene=LOC111121733;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 20 samples with support for all annotated introns;product=LIM and SH3 domain protein 1-like%2C transcript variant X8;transcript_id=XM_022463190.1\tNC_035780.1\t1933499\t1933501\t51\t-94570\r\n",
"NC_035780.1\tGnomon\tmRNA\t2060215\t2063945\t.\t+\t.\tID=rna185;Parent=gene108;Dbxref=GeneID:111129336,Genbank:XM_022475652.1;Name=XM_022475652.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111129336;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 83%25 coverage of the annotated genomic feature by RNAseq alignments;product=unconventional myosin-Ic-like;transcript_id=XM_022475652.1\tNC_035780.1\t1933499\t1933501\t51\t-126714\r\n",
"NC_035780.1\tGnomon\tmRNA\t2063986\t2067024\t.\t+\t.\tID=rna186;Parent=gene109;Dbxref=GeneID:111109736,Genbank:XM_022445962.1;Name=XM_022445962.1;Note=The sequence of the model RefSeq transcript was modified relative to this genomic sequence to represent the inferred CDS: inserted 1 base in 1 codon;exception=unclassified transcription discrepancy;gbkey=mRNA;gene=LOC111109736;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 88%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=myosin ID heavy chain-like;transcript_id=XM_022445962.1\tNC_035780.1\t1933499\t1933501\t51\t-130485\r\n"
]
}
],
"source": [
"!tail 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMLs.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output looks decent? I'll keep going."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/bin/sh: {bedtoolsDirectory}closestBed: command not found\r\n"
]
}
],
"source": [
"! {bedtoolsDirectory}closestBed \\\n",
"-io \\\n",
"-a {mRNAList} \\\n",
"-b {DMRlist} \\\n",
"-t all \\\n",
"-D ref \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs.txt"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-DMRs.txt"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error: Sorted input specified, but the file C_virginica-3.0_Gnomon_mRNA.gff3 has the following out of order record\r\n",
"NC_035780.1\tGnomon\tmRNA\t2413594\t2416601\t.\t-\t.\tID=rna199;Parent=gene122;Dbxref=GeneID:111129373,Genbank:XM_022475729.1;Name=XM_022475729.1;gbkey=mRNA;gene=LOC111129373;model_evidence=Supporting evidence includes similarity to: 2 Proteins;product=mucin-2-like;transcript_id=XM_022475729.1\r\n"
]
}
],
"source": [
"! {bedtoolsDirectory}closestBed \\\n",
"-io \\\n",
"-a {mRNAList} \\\n",
"-b {CGMotifList} \\\n",
"-t all \\\n",
"-D ref \\\n",
"> 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-CGmotifs.txt"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\tNC_035780.1\t28924\t28926\tCG_motif\t-35\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t66897\t.\t-\t.\tID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\tNC_035780.1\t66897\t66899\tCG_motif\t1\r\n",
"NC_035780.1\tGnomon\tmRNA\t43111\t46506\t.\t-\t.\tID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\tNC_035780.1\t46512\t46514\tCG_motif\t7\r\n",
"NC_035780.1\tGnomon\tmRNA\t85606\t95254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\tNC_035780.1\t85601\t85603\tCG_motif\t-3\r\n",
"NC_035780.1\tGnomon\tmRNA\t99840\t106460\t.\t+\t.\tID=rna5;Parent=gene4;Dbxref=GeneID:111120752,Genbank:XM_022461698.1;Name=XM_022461698.1;gbkey=mRNA;gene=LOC111120752;model_evidence=Supporting evidence includes similarity to: 10 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 27 samples with support for all annotated introns;product=ribulose-phosphate 3-epimerase-like;transcript_id=XM_022461698.1\tNC_035780.1\t99819\t99821\tCG_motif\t-19\r\n",
"NC_035780.1\tGnomon\tmRNA\t108305\t110077\t.\t-\t.\tID=rna6;Parent=gene5;Dbxref=GeneID:111128944,Genbank:XM_022474921.1;Name=XM_022474921.1;gbkey=mRNA;gene=LOC111128944;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the annotated genomic feature by RNAseq alignments;partial=true;product=mucin-19-like;start_range=.,108305;transcript_id=XM_022474921.1\tNC_035780.1\t110095\t110097\tCG_motif\t19\r\n",
"NC_035780.1\tGnomon\tmRNA\t151859\t157536\t.\t+\t.\tID=rna7;Parent=gene6;Dbxref=GeneID:111128953,Genbank:XM_022474931.1;Name=XM_022474931.1;gbkey=mRNA;gene=LOC111128953;model_evidence=Supporting evidence includes similarity to: 1 Protein;product=GATA zinc finger domain-containing protein 14-like;transcript_id=XM_022474931.1\tNC_035780.1\t157537\t157539\tCG_motif\t2\r\n",
"NC_035780.1\tGnomon\tmRNA\t163809\t183798\t.\t-\t.\tID=rna8;Parent=gene7;Dbxref=GeneID:111105691,Genbank:XM_022440054.1;Name=XM_022440054.1;gbkey=mRNA;gene=LOC111105691;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 9 samples with support for all annotated introns;product=uncharacterized LOC111105691;transcript_id=XM_022440054.1\tNC_035780.1\t183810\t183812\tCG_motif\t13\r\n",
"NC_035780.1\tGnomon\tmRNA\t164820\t166793\t.\t+\t.\tID=rna9;Parent=gene8;Dbxref=GeneID:111105685,Genbank:XM_022440042.1;Name=XM_022440042.1;gbkey=mRNA;gene=LOC111105685;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=protein ANTAGONIST OF LIKE HETEROCHROMATIN PROTEIN 1-like;transcript_id=XM_022440042.1\tNC_035780.1\t164802\t164804\tCG_motif\t-16\r\n",
"NC_035780.1\tGnomon\tmRNA\t190449\t193594\t.\t-\t.\tID=rna11;Parent=gene10;Dbxref=GeneID:111133554,Genbank:XM_022482070.1;Name=XM_022482070.1;gbkey=mRNA;gene=LOC111133554;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=putative uncharacterized protein DDB_G0277407;transcript_id=XM_022482070.1\tNC_035780.1\t193603\t193605\tCG_motif\t10\r\n"
]
}
],
"source": [
"!head 2018-11-14-Flanking-Analysis/2018-11-14-mRNA-Closest-NoOverlap-CGmotifs.txt"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}