{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gene Background Analysis\n", "\n", "In this notebook, I will characterize the location of the `unite` gene background from `methylKit` within the *C. virginica* genome.\n", "\n", "Methods:\n", "\n", "0. Prepare for Analyses\n", "1. Locate Files and Set Variable Paths\n", "2. Identify Overlaps with Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Prepare for Analyses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0a. Set Working Directory" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/Users/yaamini/Documents/yaamini-virginica/analyses/2018-12-02-Gene-Enrichment-Analysis'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/yaamini/Documents/yaamini-virginica/analyses\n" ] } ], "source": [ "cd ../analyses/" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/Users/yaamini/Documents/yaamini-virginica/analyses'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!mkdir 2018-12-02-Gene-Enrichment-Analysis" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m2018-10-25-MethylKit\u001b[m\u001b[m/ \u001b[34m2018-12-02-Gene-Enrichment-Analysis\u001b[m\u001b[m/\r\n", "\u001b[34m2018-11-01-DML-and-DMR-Analysis\u001b[m\u001b[m/ README.md\r\n" ] } ], "source": [ "ls -F" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/yaamini/Documents/yaamini-virginica/analyses/2018-12-02-Gene-Enrichment-Analysis\n" ] } ], "source": [ "cd 2018-12-02-Gene-Enrichment-Analysis/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Locate Relevant Files and Set Variable Path Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1a. Set Variable Path Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting the variable path names allows me to reuse this script with different input files or different paths to programs without manually changing the file names each time." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bedtoolsDirectory = \"/Users/Shared/bioinformatics/bedtools2/bin/\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "geneBackground = \"../../analyses/2018-10-25-MethylKit/2018-11-29-Methylation-Information-Cov3.bed\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "DMLlist = \"../../analyses/2018-10-25-MethylKit/2018-11-07-DML-Locations.bed\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "DMRlist = \"../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "exonList = \"../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_exon.bed\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "intronList = \"../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_intron.bed\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mRNAList = \"../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3\"" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "transposableElementsAll = \"../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-all.gff\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "transposableElementsCg = \"../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-Cg.gff\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1b. Confirm Variable Path Works and Characterize Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The BEDfiles with DML and DMR can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. The files only have DML and DMR that were at least 50% different between the two treatments (control and elevated pCO2)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_007175.2\t147\t147\t+\t7\t0\t7\t134\t4\t130\t39\t0\t39\t99\t1\t98\t394\t5\t389\t9\t0\t9\t241\t4\t237\t623\t10\t613\t178\t8\t170\t15\t0\t15\r\n", "NC_007175.2\t246\t246\t+\t3\t0\t3\t109\t1\t108\t26\t0\t26\t91\t0\t91\t306\t11\t295\t4\t0\t4\t208\t3\t205\t558\t11\t547\t183\t4\t179\t13\t0\t13\r\n", "NC_007175.2\t257\t257\t+\t3\t0\t3\t134\t4\t130\t29\t0\t29\t110\t2\t108\t379\t0\t379\t5\t0\t5\t239\t0\t239\t669\t10\t659\t199\t7\t192\t20\t1\t19\r\n", "NC_007175.2\t266\t266\t+\t3\t0\t3\t168\t2\t166\t39\t2\t37\t126\t4\t122\t458\t5\t453\t7\t0\t7\t293\t1\t292\t782\t6\t776\t244\t8\t236\t23\t0\t23\r\n", "NC_007175.2\t473\t473\t+\t3\t0\t3\t112\t3\t109\t40\t0\t40\t114\t8\t106\t322\t10\t312\t5\t0\t5\t214\t3\t211\t614\t8\t606\t205\t7\t198\t17\t0\t17\r\n", "NC_007175.2\t665\t665\t+\t7\t0\t7\t107\t2\t105\t44\t0\t44\t56\t0\t56\t358\t9\t349\t7\t0\t7\t212\t9\t203\t561\t11\t550\t244\t8\t236\t21\t0\t21\r\n", "NC_007175.2\t685\t685\t-\t4\t0\t4\t108\t1\t107\t28\t0\t28\t129\t0\t129\t369\t2\t367\t4\t0\t4\t188\t8\t180\t538\t6\t532\t82\t5\t77\t6\t0\t6\r\n", "NC_007175.2\t709\t709\t+\t5\t0\t5\t114\t3\t111\t46\t1\t45\t73\t6\t67\t377\t15\t362\t4\t0\t4\t259\t9\t250\t623\t18\t605\t242\t10\t232\t21\t0\t21\r\n", "NC_007175.2\t710\t710\t-\t6\t0\t6\t122\t4\t118\t36\t1\t35\t134\t3\t131\t415\t2\t413\t4\t0\t4\t198\t1\t197\t636\t11\t625\t119\t2\t117\t10\t0\t10\r\n", "NC_007175.2\t725\t725\t+\t4\t0\t4\t121\t1\t120\t45\t1\t44\t70\t0\t70\t397\t4\t393\t4\t0\t4\t280\t8\t272\t622\t8\t614\t242\t7\t235\t17\t0\t17\r\n" ] } ], "source": [ "#Previewing the files\n", "!head {geneBackground}" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 670301 ../../analyses/2018-10-25-MethylKit/2018-11-29-Methylation-Information-Cov3.bed\r\n" ] } ], "source": [ "#Counting the number of lines\n", "!wc -l {geneBackground}" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t346071\t346073\t-\t50\r\n", "NC_035780.1\t990995\t990997\t-\t-51\r\n", "NC_035780.1\t1882691\t1882693\t-\t52\r\n", "NC_035780.1\t1885022\t1885024\t-\t61\r\n", "NC_035780.1\t1933499\t1933501\t-\t53\r\n", "NC_035780.1\t1945182\t1945184\t+\t55\r\n", "NC_035780.1\t1958998\t1959000\t-\t53\r\n", "NC_035780.1\t1983256\t1983258\t-\t-69\r\n", "NC_035780.1\t2538924\t2538926\t-\t-50\r\n", "NC_035780.1\t2541652\t2541654\t-\t-55\r\n" ] } ], "source": [ "!head {DMLlist}" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1398 ../../analyses/2018-10-25-MethylKit/2018-11-07-DML-Locations.bed\r\n" ] } ], "source": [ "#Counting the number of DML\n", "!wc -l {DMLlist}" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t571100\t571200\tDMR\t58\r\n", "NC_035780.1\t573700\t573800\tDMR\t52\r\n", "NC_035780.1\t1885000\t1885100\tDMR\t51\r\n", "NC_035780.1\t1933500\t1933600\tDMR\t53\r\n", "NC_035780.1\t4285700\t4285800\tDMR\t-51\r\n", "NC_035780.1\t24159600\t24159700\tDMR\t51\r\n", "NC_035780.1\t26629500\t26629600\tDMR\t65\r\n", "NC_035780.1\t28563400\t28563500\tDMR\t59\r\n", "NC_035780.1\t29883000\t29883100\tDMR\t-55\r\n", "NC_035780.1\t31302900\t31303000\tDMR\t-61\r\n" ] } ], "source": [ "!head {DMRlist}" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 162 ../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed\r\n" ] } ], "source": [ "#Counting the number of DMR\n", "!wc -l {DMRlist}" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t13578\t13603\r\n", "NC_035780.1\t14237\t14290\r\n", "NC_035780.1\t14557\t14594\r\n", "NC_035780.1\t28961\t29073\r\n", "NC_035780.1\t30524\t31557\r\n", "NC_035780.1\t31736\t31887\r\n", "NC_035780.1\t31977\t32565\r\n", "NC_035780.1\t32959\t33324\r\n", "NC_035780.1\t66869\t66897\r\n", "NC_035780.1\t64123\t64334\r\n" ] } ], "source": [ "!head {exonList}" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 731279 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_exon.bed\r\n" ] } ], "source": [ "!wc -l {exonList}" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t28961\t28961\r\n", "NC_035780.1\t29074\t30524\r\n", "NC_035780.1\t31558\t31736\r\n", "NC_035780.1\t31888\t31977\r\n", "NC_035780.1\t32566\t32959\r\n", "NC_035780.1\t43110\t43112\r\n", "NC_035780.1\t44359\t45913\r\n", "NC_035780.1\t46507\t64123\r\n", "NC_035780.1\t64335\t66869\r\n", "NC_035780.1\t85606\t85606\r\n" ] } ], "source": [ "!head {intronList}" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 319262 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_intron.bed\r\n" ] } ], "source": [ "!wc -l {intronList}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n" ] } ], "source": [ "!head -n 1 {mRNAList}" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 60201 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3\r\n" ] } ], "source": [ "!wc -l {mRNAList}" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 2\r\n", "##date 2018-08-23\r\n", "##sequence-region Cvirginica_v300.fa\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t262\t1389\t31.1\t+\t.\tTarget \"Motif:REP-6_LMi\" 2920 4055\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t1728\t1947\t26.1\t-\t.\tTarget \"Motif:REP-6_LMi\" 14320 14534\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t1866\t2013\t33.6\t+\t.\tTarget \"Motif:LSU-rRNA_Cel\" 2372 2520\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t2129\t2367\t20.5\t-\t.\tTarget \"Motif:REP-6_LMi\" 13886 14118\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t2836\t2980\t31.5\t+\t.\tTarget \"Motif:REP-6_LMi\" 6216 6359\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t3196\t3277\t30.5\t+\t.\tTarget \"Motif:REP-6_LMi\" 6572 6653\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t5168\t5532\t32.9\t+\t.\tTarget \"Motif:REP-6_LMi\" 4620 4983\r\n" ] } ], "source": [ "!head {transposableElementsAll}" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 692371 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-all.gff\r\n" ] } ], "source": [ "!wc -l {transposableElementsAll}" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 2\r\n", "##date 2018-08-27\r\n", "##sequence-region Cvirginica_v300.fa\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t1866\t2013\t33.6\t+\t.\tTarget \"Motif:LSU-rRNA_Cel\" 2372 2520\r\n", "NC_007175.2\tRepeatMasker\tsimilarity\t6529\t6628\t19.0\t+\t.\tTarget \"Motif:(TA)n\" 2 102\r\n", "NC_035780.1\tRepeatMasker\tsimilarity\t1473\t1535\t 0.0\t+\t.\tTarget \"Motif:(TAACCC)n\" 1 63\r\n", "NC_035780.1\tRepeatMasker\tsimilarity\t5080\t7289\t32.5\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2102 4631\r\n", "NC_035780.1\tRepeatMasker\tsimilarity\t7423\t7489\t25.4\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 2097 2163\r\n", "NC_035780.1\tRepeatMasker\tsimilarity\t7623\t8079\t34.1\t-\t.\tTarget \"Motif:Gypsy-62_CGi-I\" 1516 1975\r\n", "NC_035780.1\tRepeatMasker\tsimilarity\t8261\t8295\t14.1\t+\t.\tTarget \"Motif:(CTCCT)n\" 1 33\r\n" ] } ], "source": [ "!head {transposableElementsCg}" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 626665 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-Cg.gff\r\n" ] } ], "source": [ "!wc -l {transposableElementsCg}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Identify Gene Background overlaps with Genome Feature Tracks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will use `intersectBed` to find the overlaps between the gene background and mRNA coding regions and transposable elements (all species and *C. gigas* only)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2a. DML" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1808\n", "Gene background overlaps with DML\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {DMLlist} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with DML\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2b. DMR" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 128\n", "Gene background overlaps with DMR\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {DMRlist} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with DMR\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2c. Exons " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 420979\n", "Gene background overlaps with exons\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {exonList} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with exons\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneBackground} \\\n", "-b {exonList} \\\n", "> 2019-01-04-Gene-Background-Exons.txt" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t100560\t100560\t-\t13\t5\t8\t31\t11\t20\t16\t6\t10\t23\t17\t6\t25\t8\t17\t8\t0\t8\t21\t8\t13\t19\t6\t13\t41\t35\t6\t15\t0\t15\tNC_035780.1\t100554\t100661\r\n" ] } ], "source": [ "!head -n 1 2019-01-04-Gene-Background-Exons.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2d. Introns " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 202086\n", "Gene background overlaps with introns\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {intronList} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with introns\"" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneBackground} \\\n", "-b {intronList} \\\n", "> 2019-01-04-Gene-Background-Introns.txt" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t87542\t87542\t+\t4\t4\t0\t11\t8\t3\t20\t16\t4\t15\t9\t6\t12\t10\t2\t11\t10\t1\t8\t6\t2\t10\t4\t6\t9\t7\t2\t4\t4\t0\tNC_035780.1\t85778\t88423\r\n" ] } ], "source": [ "!head -n 1 2019-01-04-Gene-Background-Introns.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2e. mRNA " ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 614204\n", "Gene background overlaps with mRNA\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {mRNAList} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with mRNA\"" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneBackground} \\\n", "-b {mRNAList} \\\n", "> 2018-12-18-Gene-Background-mRNA.txt" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t87542\t87542\t+\t4\t4\t0\t11\t8\t3\t20\t16\t4\t15\t9\t6\t12\t10\t2\t11\t10\t1\t8\t6\t2\t10\t4\t6\t9\t7\t2\t4\t4\t0\tNC_035780.1\tGnomon\tmRNA\t85606\t95254\t.\t-\t.\tID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1\r\n" ] } ], "source": [ "!head -n 1 2018-12-18-Gene-Background-mRNA.txt" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Isolate unique genes in gene background-mRNA overlap\n", "! cut -f43 2018-12-18-Gene-Background-mRNA.txt | sort | uniq -c > 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 41 ID=rna10000;Parent=gene5866;Dbxref=GeneID:111121983,Genbank:XM_022463489.1;Name=XM_022463489.1;gbkey=mRNA;gene=LOC111121983;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=sodium-coupled neutral amino acid transporter 9-like%2C transcript variant X4;transcript_id=XM_022463489.1\r\n" ] } ], "source": [ "!head -n 1 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 26627 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt\r\n" ] } ], "source": [ "#Count number of unique genes\n", "!wc -l 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2f. Transposable elements (all)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 76830\n", "Gene background overlaps with TE (all)\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {transposableElementsAll} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with TE (all)\"" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneBackground} \\\n", "-b {transposableElementsAll} \\\n", "> 2018-12-18-Gene-Background-TEall.txt" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_007175.2\t266\t266\t+\t3\t0\t3\t168\t2\t166\t39\t2\t37\t126\t4\t122\t458\t5\t453\t7\t0\t7\t293\t1\t292\t782\t6\t776\t244\t8\t236\t23\t0\t23\tNC_007175.2\tRepeatMasker\tsimilarity\t262\t1389\t31.1\t+\t.\tTarget \"Motif:REP-6_LMi\" 2920 4055\r\n" ] } ], "source": [ "!head -n 1 2018-12-18-Gene-Background-TEall.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2g. Transposable elements (Cg)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 58340\n", "Gene background overlaps with TE (Cg)\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneBackground} \\\n", "-b {transposableElementsCg} \\\n", "| wc -l\n", "!echo \"Gene background overlaps with TE (Cg)\"" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneBackground} \\\n", "-b {transposableElementsCg} \\\n", "> 2018-12-18-Gene-Background-TECg.txt" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_007175.2\t2004\t2004\t+\t4\t0\t4\t38\t0\t38\t12\t1\t11\t62\t2\t60\t194\t1\t193\t3\t0\t3\t113\t4\t109\t405\t15\t390\t47\t1\t46\t8\t0\t8\tNC_007175.2\tRepeatMasker\tsimilarity\t1866\t2013\t33.6\t+\t.\tTarget \"Motif:LSU-rRNA_Cel\" 2372 2520\r\n" ] } ], "source": [ "!head -n 1 2018-12-18-Gene-Background-TECg.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Calculate Percent Overlap with Gene Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I want to know how much of the original background is represented by DML and DMR. To do this, I'll just do some simple math." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3a. DML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines in DML) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.20856301870353766" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(1398 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0.2086% overlap between DML and the original background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3b. DMR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines in DMR) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.024168246802555866" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(162 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0.0242% overlap between DMR and the original background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I also want proportions for the overlap between the original background and other genome feature files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3c. Exons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines overlapped) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "62.804471424031895" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(420979 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overlap between exons and gene background accounts for 62.80% of the gene background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3d. Introns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines overlapped) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "30.148545205810525" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(202086 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overlap between introns and gene background accounts for 30.15% of the gene background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3e. mRNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines overlapped) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "91.63107320442607" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(614204 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overlap between mRNA and gene background accounts for 91.63% of the gene background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3f. Transposable elements (all)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines overlapped) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "11.462014826175107" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(76830 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overlap between transposable elements (all) and gene background accounts for 11.46% of the gene background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3g. Transposable elements (*C. gigas* only)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(lines overlapped) / (lines in gene background)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "8.703552583093268" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(58340 / 670301) * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overlap between transposable elements (all) and gene background accounts for 8.70% of the gene background." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }