{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DML Analysis\n", "\n", "In this notebook, I will examine the location of differentially methylated loci (DML) in the *C. gigas* genome. The DML were identified using `methylKit` in [this R script](https://github.com/RobertsLab/project-gigas-oa-meth/blob/master/analyses/2019-09-12-MethylKit/2019-09-12-MethylKit.Rmd).\n", "\n", "Methods:\n", "\n", "1. Prepare for Analyses\n", "2. Locate Files and Set Variable Paths\n", "3. Identify Overlaps between Genomic Feature Tracks\n", "4. Annotate DML-Gene overlaps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Prepare for Analyses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0a. Set Working Directory" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/Users/yaamini/Documents/project-gigas-oa-meth/notebooks'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/yaamini/Documents/project-gigas-oa-meth/analyses\n" ] } ], "source": [ "cd ../analyses/" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!mkdir 2019-09-15-DML-Analysis" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m2019-08-30-Bismark-Parameter-Testing\u001b[m\u001b[m/ \u001b[34m2019-09-15-DML-Analysis\u001b[m\u001b[m/\r\n", "\u001b[34m2019-09-12-MethylKit\u001b[m\u001b[m/ README.md\r\n", "\u001b[34m2019-09-13-IGV-Verification\u001b[m\u001b[m/\r\n" ] } ], "source": [ "ls -F" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/yaamini/Documents/project-gigas-oa-meth/analyses/2019-09-15-DML-Analysis\n" ] } ], "source": [ "cd 2019-09-15-DML-Analysis/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0b. Download Genome Feature Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will be using the following tracks from [this `eagle` directory](https://eagle.fish.washington.edu/trilobite/index.php?dir=Crassostrea_gigas_v9_tracks%2F):\n", "\n", "1. Exon: Coding regions\n", "2. Intron: Regions that are removed\n", "3. Genes: This includes exons and introns, as well as constituent mRNA.\n", "4. Promoters: Regions upstream of genes that could be important for regulation\n", "5. Tranpsosable elements (_C. gigas_): Transposable elements located using information from _C. gigas_ only (see [Sam's notes](http://onsnetwork.org/kubu4/2018/08/28/transposable-element-mapping-crassostrea-virginica-genome-cvirginica_v300-using-repeatmasker-4-07/) for more information)\n", "4. CG motifs: Regions with CGs where methylation can occur" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 11.7M 100 11.7M 0 0 6888k 0 0:00:01 0:00:01 --:--:-- 6964k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_exon.gff \\\n", "> Cgigas_v9_exon.gff" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 12.0M 100 12.0M 0 0 7933k 0 0:00:01 0:00:01 --:--:-- 7947k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_intron.gff \\\n", "> Cgigas_v9_intron.gff" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1777k 100 1777k 0 0 4896k 0 --:--:-- --:--:-- --:--:-- 5288k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_gene.gff \\\n", "> Cgigas_v9_gene.gff" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1848k 100 1848k 0 0 5306k 0 --:--:-- --:--:-- --:--:-- 5373k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_1k5p_gene_promoter.gff \\\n", "> Cgigas_v9_1k5p_gene_promoter.gff" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 6325k 100 6325k 0 0 7365k 0 --:--:-- --:--:-- --:--:-- 7695k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_TE.gff \\\n", "> Cgigas_v9_TE.gff" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 932M 100 932M 0 0 9789k 0 0:01:37 0:01:37 --:--:-- 9541k\n" ] } ], "source": [ "!curl https://eagle.fish.washington.edu/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_CG.gff \\\n", "> Cgigas_v9_CG.gff" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cgigas_v9_1k5p_gene_promoter.gff Cgigas_v9_gene.gff\r\n", "Cgigas_v9_CG.gff Cgigas_v9_intron.gff\r\n", "Cgigas_v9_exon.gff\r\n" ] } ], "source": [ "!ls Cgigas*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Locate Relevant Files and Set Variable Path Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1a. Set Variable Path Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting the variable path names allows me to reuse this script with different input files or different paths to programs without manually changing the file names each time." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bedtoolsDirectory = \"/Users/Shared/bioinformatics/bedtools2/bin/\"" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": true }, "outputs": [], "source": [ "DMLlist = \"../2019-09-12-MethylKit/2019-09-15-DML-Destrand-10x-Locations-100diff-NoExtras.bed\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "exonList = \"Cgigas_v9_exon.gff\"" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "intronList = \"Cgigas_v9_intron.gff\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "geneList = \"Cgigas_v9_gene.gff\"" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "promoterList = \"Cgigas_v9_1k5p_gene_promoter.gff\"" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "transposableElements = \"Cgigas_v9_TE.gff\"" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "CGMotifList = \"Cgigas_v9_CG.gff\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1b. Confirm Variable Path Works and Characterize Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The BEDfiles with DML and DMR can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. The files only have DML and DMR that were at least 50% different between the two treatments (control and elevated pCO2)." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C22384\t1328\t1330\t-100\r\n", "C22628\t1621\t1623\t100\r\n", "C28982\t4929\t4931\t100\r\n", "C29914\t4052\t4054\t-100\r\n", "C29914\t4052\t4054\t-100\r\n", "C29976\t649\t651\t-100\r\n", "C30322\t3482\t3484\t-100\r\n", "C30322\t3599\t3601\t-100\r\n", "C32984\t5070\t5072\t-100\r\n", "C33708\t8307\t8309\t100\r\n" ] } ], "source": [ "#Previewing the files\n", "!head {DMLlist}" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 628 ../2019-09-12-MethylKit/2019-09-15-DML-Destrand-10x-Locations-100diff-NoExtras.bed\r\n" ] } ], "source": [ "#Counting the number of lines to count DML\n", "!wc -l {DMLlist}" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\r\n", "C17212\tGLEAN\tCDS\t31\t363\t.\t+\t0\tParent=CGI_10000002;\r\n", "C17316\tGLEAN\tCDS\t30\t257\t.\t+\t0\tParent=CGI_10000003;\r\n", "C17476\tGLEAN\tCDS\t104\t257\t.\t-\t0\tParent=CGI_10000004;\r\n", "C17476\tGLEAN\tCDS\t34\t74\t.\t-\t2\tParent=CGI_10000004;\r\n", "C17998\tGLEAN\tCDS\t196\t387\t.\t-\t0\tParent=CGI_10000005;\r\n", "C18346\tGLEAN\tCDS\t174\t551\t.\t+\t0\tParent=CGI_10000009;\r\n", "C18428\tGLEAN\tCDS\t286\t546\t.\t-\t0\tParent=CGI_10000010;\r\n", "C18964\tGLEAN\tCDS\t203\t658\t.\t-\t0\tParent=CGI_10000011;\r\n", "C18980\tGLEAN\tCDS\t30\t674\t.\t+\t0\tParent=CGI_10000012;\r\n" ] } ], "source": [ "!head {exonList}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 196691 Cgigas_v9_exon.gff\r\n" ] } ], "source": [ "!wc -l {exonList}" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C17476\tsubtractBed\tintrn\t75\t103\t.\t-\t.\tParent=CGI_10000004;\r", "\r\n", "C19392\tsubtractBed\tintrn\t184\t451\t.\t+\t.\tParent=CGI_10000015;\r", "\r\n", "C20262\tsubtractBed\tintrn\t539\t641\t.\t-\t.\tParent=CGI_10000025;\r", "\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\r\n", "C20334\tsubtractBed\tintrn\t524\t867\t.\t-\t.\tParent=CGI_10000028;\r", "\r\n", "C20412\tsubtractBed\tintrn\t215\t409\t.\t-\t.\tParent=CGI_10000029;\r", "\r\n", "C20412\tsubtractBed\tintrn\t464\t705\t.\t-\t.\tParent=CGI_10000029;\r", "\r\n", "C20462\tsubtractBed\tintrn\t50\t271\t.\t+\t.\tParent=CGI_10000030;\r", "\r\n", "C20462\tsubtractBed\tintrn\t360\t481\t.\t+\t.\tParent=CGI_10000030;\r", "\r\n", "C20462\tsubtractBed\tintrn\t577\t822\t.\t+\t.\tParent=CGI_10000030;\r", "\r\n" ] } ], "source": [ "!head {intronList}" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 176049 Cgigas_v9_intron.gff\r\n" ] } ], "source": [ "!wc -l {intronList}" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\r\n", "C17212\tGLEAN\tmRNA\t31\t363\t0.999572\t+\t.\tID=CGI_10000002;\r\n", "C17316\tGLEAN\tmRNA\t30\t257\t0.555898\t+\t.\tID=CGI_10000003;\r\n", "C17476\tGLEAN\tmRNA\t34\t257\t0.998947\t-\t.\tID=CGI_10000004;\r\n", "C17998\tGLEAN\tmRNA\t196\t387\t1\t-\t.\tID=CGI_10000005;\r\n", "C18346\tGLEAN\tmRNA\t174\t551\t1\t+\t.\tID=CGI_10000009;\r\n", "C18428\tGLEAN\tmRNA\t286\t546\t0.555898\t-\t.\tID=CGI_10000010;\r\n", "C18964\tGLEAN\tmRNA\t203\t658\t0.999572\t-\t.\tID=CGI_10000011;\r\n", "C18980\tGLEAN\tmRNA\t30\t674\t0.555898\t+\t.\tID=CGI_10000012;\r\n", "C19100\tGLEAN\tmRNA\t160\t681\t0.999955\t-\t.\tID=CGI_10000013;\r\n" ] } ], "source": [ "!head {geneList}" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28027 Cgigas_v9_gene.gff\r\n" ] } ], "source": [ "!wc -l {geneList}" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C16582\tflankbed\tpromoter\t386\t395\t.\t-\t.\tID=CGI_10000001;\r", "\r\n", "C17212\tflankbed\tpromoter\t1\t30\t.\t+\t.\tID=CGI_10000002;\r", "\r\n", "C17316\tflankbed\tpromoter\t1\t29\t.\t+\t.\tID=CGI_10000003;\r", "\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\r\n", "C17998\tflankbed\tpromoter\t388\t559\t.\t-\t.\tID=CGI_10000005;\r", "\r\n", "C18346\tflankbed\tpromoter\t1\t173\t.\t+\t.\tID=CGI_10000009;\r", "\r\n", "C18428\tflankbed\tpromoter\t547\t611\t.\t-\t.\tID=CGI_10000010;\r", "\r\n", "C18964\tflankbed\tpromoter\t659\t714\t.\t-\t.\tID=CGI_10000011;\r", "\r\n", "C18980\tflankbed\tpromoter\t1\t29\t.\t+\t.\tID=CGI_10000012;\r", "\r\n", "C19100\tflankbed\tpromoter\t682\t743\t.\t-\t.\tID=CGI_10000013;\r", "\r\n" ] } ], "source": [ "!head {promoterList}" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28023 Cgigas_v9_1k5p_gene_promoter.gff\r\n" ] } ], "source": [ "!wc -l {promoterList}" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\r\n", "C21306\tTRF\tTandem_Repeat\t35\t143\t112\t+\t.\t.\r\n", "C21306\tTRF\tTandem_Repeat\t574\t947\t208\t+\t.\t.\r\n", "C21306\tTRF\tTandem_Repeat\t574\t901\t313\t+\t.\t.\r\n", "C21372\tTRF\tTandem_Repeat\t643\t671\t58\t+\t.\t.\r\n", "C22542\tTRF\tTandem_Repeat\t1727\t1774\t96\t+\t.\t.\r\n", "C22728\tTRF\tTandem_Repeat\t426\t491\t105\t+\t.\t.\r\n", "C23428\tTRF\tTandem_Repeat\t130\t415\t202\t+\t.\t.\r\n", "C23796\tTRF\tTandem_Repeat\t547\t608\t97\t+\t.\t.\r\n", "C24440\tTRF\tTandem_Repeat\t1059\t1089\t62\t+\t.\t.\r\n" ] } ], "source": [ "!head {transposableElements}" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 119786 Cgigas_v9_TE.gff\r\n" ] } ], "source": [ "!wc -l {transposableElements}" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\r\n", "##sequence-region scaffold360 1 280\r\n", "#!Date 2013-04-23\r\n", "#!Type DNA\r\n", "#!Source-version EMBOSS 6.5.7.0\r\n", "scaffold360\tfuzznuc\tnucleotide_motif\t60\t61\t2\t+\t.\tID=scaffold360.1;note=*pat pattern:CG\r\n", "scaffold360\tfuzznuc\tnucleotide_motif\t96\t97\t2\t+\t.\tID=scaffold360.2;note=*pat pattern:CG\r\n", "scaffold360\tfuzznuc\tnucleotide_motif\t120\t121\t2\t+\t.\tID=scaffold360.3;note=*pat pattern:CG\r\n", "scaffold360\tfuzznuc\tnucleotide_motif\t187\t188\t2\t+\t.\tID=scaffold360.4;note=*pat pattern:CG\r\n", "##gff-version 3\r\n" ] } ], "source": [ "!head {CGMotifList}" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 10035701 Cgigas_v9_CG.gff\r\n" ] } ], "source": [ "!wc -l {CGMotifList}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Identify DML Overlaps with Genomic Feature Tracks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To identify the location of DML in the *C. gigas* genome, I will use `intersect` from `bedtools`. [The BEDtools suite](http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html) allows me to easily find overlapping regions of different BEDfiles." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Tool: bedtools intersect (aka intersectBed)\r\n", "Version: v2.26.0\r\n", "Summary: Report overlaps between two feature files.\r\n", "\r\n", "Usage: bedtools intersect [OPTIONS] -a -b \r\n", "\r\n", "\tNote: -b may be followed with multiple databases and/or \r\n", "\twildcard (*) character(s). \r\n", "Options: \r\n", "\t-wa\tWrite the original entry in A for each overlap.\r\n", "\r\n", "\t-wb\tWrite the original entry in B for each overlap.\r\n", "\t\t- Useful for knowing _what_ A overlaps. Restricted by -f and -r.\r\n", "\r\n", "\t-loj\tPerform a \"left outer join\". That is, for each feature in A\r\n", "\t\treport each overlap with B. If no overlaps are found, \r\n", "\t\treport a NULL feature for B.\r\n", "\r\n", "\t-wo\tWrite the original A and B entries plus the number of base\r\n", "\t\tpairs of overlap between the two features.\r\n", "\t\t- Overlaps restricted by -f and -r.\r\n", "\t\t Only A features with overlap are reported.\r\n", "\r\n", "\t-wao\tWrite the original A and B entries plus the number of base\r\n", "\t\tpairs of overlap between the two features.\r\n", "\t\t- Overlapping features restricted by -f and -r.\r\n", "\t\t However, A features w/o overlap are also reported\r\n", "\t\t with a NULL B feature and overlap = 0.\r\n", "\r\n", "\t-u\tWrite the original A entry _once_ if _any_ overlaps found in B.\r\n", "\t\t- In other words, just report the fact >=1 hit was found.\r\n", "\t\t- Overlaps restricted by -f and -r.\r\n", "\r\n", "\t-c\tFor each entry in A, report the number of overlaps with B.\r\n", "\t\t- Reports 0 for A entries that have no overlap with B.\r\n", "\t\t- Overlaps restricted by -f and -r.\r\n", "\r\n", "\t-v\tOnly report those entries in A that have _no overlaps_ with B.\r\n", "\t\t- Similar to \"grep -v\" (an homage).\r\n", "\r\n", "\t-ubam\tWrite uncompressed BAM output. Default writes compressed BAM.\r\n", "\r\n", "\t-s\tRequire same strandedness. That is, only report hits in B\r\n", "\t\tthat overlap A on the _same_ strand.\r\n", "\t\t- By default, overlaps are reported without respect to strand.\r\n", "\r\n", "\t-S\tRequire different strandedness. That is, only report hits in B\r\n", "\t\tthat overlap A on the _opposite_ strand.\r\n", "\t\t- By default, overlaps are reported without respect to strand.\r\n", "\r\n", "\t-f\tMinimum overlap required as a fraction of A.\r\n", "\t\t- Default is 1E-9 (i.e., 1bp).\r\n", "\t\t- FLOAT (e.g. 0.50)\r\n", "\r\n", "\t-F\tMinimum overlap required as a fraction of B.\r\n", "\t\t- Default is 1E-9 (i.e., 1bp).\r\n", "\t\t- FLOAT (e.g. 0.50)\r\n", "\r\n", "\t-r\tRequire that the fraction overlap be reciprocal for A AND B.\r\n", "\t\t- In other words, if -f is 0.90 and -r is used, this requires\r\n", "\t\t that B overlap 90% of A and A _also_ overlaps 90% of B.\r\n", "\r\n", "\t-e\tRequire that the minimum fraction be satisfied for A OR B.\r\n", "\t\t- In other words, if -e is used with -f 0.90 and -F 0.10 this requires\r\n", "\t\t that either 90% of A is covered OR 10% of B is covered.\r\n", "\t\t Without -e, both fractions would have to be satisfied.\r\n", "\r\n", "\t-split\tTreat \"split\" BAM or BED12 entries as distinct BED intervals.\r\n", "\r\n", "\t-g\tProvide a genome file to enforce consistent chromosome sort order\r\n", "\t\tacross input files. Only applies when used with -sorted option.\r\n", "\r\n", "\t-nonamecheck\tFor sorted data, don't throw an error if the file has different naming conventions\r\n", "\t\t\tfor the same chromosome. ex. \"chr1\" vs \"chr01\".\r\n", "\r\n", "\t-sorted\tUse the \"chromsweep\" algorithm for sorted (-k1,1 -k2,2n) input.\r\n", "\r\n", "\t-names\tWhen using multiple databases, provide an alias for each that\r\n", "\t\twill appear instead of a fileId when also printing the DB record.\r\n", "\r\n", "\t-filenames\tWhen using multiple databases, show each complete filename\r\n", "\t\t\tinstead of a fileId when also printing the DB record.\r\n", "\r\n", "\t-sortout\tWhen using multiple databases, sort the output DB hits\r\n", "\t\t\tfor each record.\r\n", "\r\n", "\t-bed\tIf using BAM input, write output as BED.\r\n", "\r\n", "\t-header\tPrint the header from the A file prior to results.\r\n", "\r\n", "\t-nobuf\tDisable buffered output. Using this option will cause each line\r\n", "\t\tof output to be printed as it is generated, rather than saved\r\n", "\t\tin a buffer. This will make printing large output files \r\n", "\t\tnoticeably slower, but can be useful in conjunction with\r\n", "\t\tother software tools and scripts that need to process one\r\n", "\t\tline of bedtools output at a time.\r\n", "\r\n", "\t-iobuf\tSpecify amount of memory to use for input buffer.\r\n", "\t\tTakes an integer argument. Optional suffixes K/M/G supported.\r\n", "\t\tNote: currently has no effect with compressed files.\r\n", "\r\n", "Notes: \r\n", "\t(1) When a BAM file is used for the A file, the alignment is retained if overlaps exist,\r\n", "\tand exlcuded if an overlap cannot be found. If multiple overlaps exist, they are not\r\n", "\treported, as we are only testing for one or more overlaps.\r\n", "\r\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2a. CG motifs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is more of a sanity check than anything else. If 100% of the DML do not overlap with CG motifs, we have a problem." ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 624\n", "DML overlaps with CG motifs\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {CGMotifList} \\\n", "| wc -l\n", "!echo \"DML overlaps with CG motifs\"" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "scrolled": false }, "source": [ "They all don't overlap, so I need to fix that after PCSGA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2b. Exons" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 157\n", "DML overlaps with exons\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {exonList} \\\n", "| wc -l\n", "!echo \"DML overlaps with exons\"" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {DMLlist} \\\n", "-b {exonList} \\\n", "> 2019-09-15-DML-Exon.txt" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C28982\t4929\t4931\t100\tC28982\tGLEAN\tCDS\t4851\t4993\t.\t-\t0\tParent=CGI_10000485;\r\n", "C33708\t8307\t8309\t100\tC33708\tGLEAN\tCDS\t8220\t8464\t.\t-\t2\tParent=CGI_10001120;\r\n", "scaffold100\t676764\t676766\t100\tscaffold100\tGLEAN\tCDS\t676685\t676795\t.\t-\t0\tParent=CGI_10026448;\r\n", "scaffold102\t108199\t108201\t-100\tscaffold102\tGLEAN\tCDS\t108080\t108283\t.\t+\t0\tParent=CGI_10028421;\r\n", "scaffold102\t108944\t108946\t-100\tscaffold102\tGLEAN\tCDS\t108904\t109002\t.\t+\t0\tParent=CGI_10028421;\r\n", "scaffold102\t110675\t110677\t-100\tscaffold102\tGLEAN\tCDS\t110633\t110736\t.\t+\t2\tParent=CGI_10028421;\r\n", "scaffold1024\t215501\t215503\t100\tscaffold1024\tGLEAN\tCDS\t215426\t215584\t.\t+\t2\tParent=CGI_10028528;\r\n", "scaffold1031\t425965\t425967\t100\tscaffold1031\tGLEAN\tCDS\t425901\t426024\t.\t+\t1\tParent=CGI_10019710;\r\n", "scaffold1037\t361706\t361708\t-100\tscaffold1037\tGLEAN\tCDS\t361695\t361813\t.\t-\t1\tParent=CGI_10017467;\r\n", "scaffold107\t1026788\t1026790\t-100\tscaffold107\tGLEAN\tCDS\t1026676\t1026843\t.\t-\t1\tParent=CGI_10025081;\r\n" ] } ], "source": [ "!head 2019-09-15-DML-Exon.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2c. Introns" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 285\n", "DML overlaps with introns\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {intronList} \\\n", "| wc -l\n", "!echo \"DML overlaps with introns\"" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {DMLlist} \\\n", "-b {intronList} \\\n", "> 2019-09-15-DML-Intron.txt" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C34158\t8559\t8561\t100\tC34158\tsubtractBed\tintrn\t8437\t9010\t.\t-\t.\tParent=CGI_10001223;\r", "\r\n", "scaffold100\t645217\t645219\t-100\tscaffold100\tsubtractBed\tintrn\t641875\t648094\t.\t+\t.\tParent=CGI_10026446;\r", "\r\n", "scaffold102\t107843\t107845\t-100\tscaffold102\tsubtractBed\tintrn\t107045\t107927\t.\t+\t.\tParent=CGI_10028421;\r", "\r\n", "scaffold102\t108296\t108298\t-100\tscaffold102\tsubtractBed\tintrn\t108284\t108592\t.\t+\t.\tParent=CGI_10028421;\r", "\r\n", "scaffold102\t108460\t108462\t-100\tscaffold102\tsubtractBed\tintrn\t108284\t108592\t.\t+\t.\tParent=CGI_10028421;\r", "\r\n", "scaffold102\t108466\t108468\t-100\tscaffold102\tsubtractBed\tintrn\t108284\t108592\t.\t+\t.\tParent=CGI_10028421;\r", "\r\n", "scaffold102\t109070\t109072\t-100\tscaffold102\tsubtractBed\tintrn\t109003\t109589\t.\t+\t.\tParent=CGI_10028421;\r", "\r\n", "scaffold102\t1287048\t1287050\t100\tscaffold102\tsubtractBed\tintrn\t1285756\t1288440\t.\t+\t.\tParent=CGI_10028489;\r", "\r\n", "scaffold1021\t65344\t65346\t100\tscaffold1021\tsubtractBed\tintrn\t64011\t65845\t.\t-\t.\tParent=CGI_10022483;\r", "\r\n", "scaffold1021\t484587\t484589\t100\tscaffold1021\tsubtractBed\tintrn\t476972\t486430\t.\t-\t.\tParent=CGI_10022501;\r", "\r\n" ] } ], "source": [ "!head 2019-09-15-DML-Intron.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2d. Genes" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 442\n", "DML overlaps with genes\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {geneList} \\\n", "| wc -l\n", "!echo \"DML overlaps with genes\"" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {DMLlist} \\\n", "-b {geneList} \\\n", "> 2019-09-15-DML-Genes.txt" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C28982\t4929\t4931\t100\tC28982\tGLEAN\tmRNA\t756\t6077\t0.999998\t-\t.\tID=CGI_10000485;\r\n", "C33708\t8307\t8309\t100\tC33708\tGLEAN\tmRNA\t8220\t14973\t0.727981\t-\t.\tID=CGI_10001120;\r\n", "C34158\t8559\t8561\t100\tC34158\tGLEAN\tmRNA\t8394\t13250\t1\t-\t.\tID=CGI_10001223;\r\n", "scaffold100\t645217\t645219\t-100\tscaffold100\tGLEAN\tmRNA\t633050\t652924\t0.873323\t+\t.\tID=CGI_10026446;\r\n", "scaffold100\t676764\t676766\t100\tscaffold100\tGLEAN\tmRNA\t671285\t725122\t0.993892\t-\t.\tID=CGI_10026448;\r\n", "scaffold102\t107843\t107845\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID=CGI_10028421;\r\n", "scaffold102\t108199\t108201\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID=CGI_10028421;\r\n", "scaffold102\t108296\t108298\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID=CGI_10028421;\r\n", "scaffold102\t108460\t108462\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID=CGI_10028421;\r\n", "scaffold102\t108466\t108468\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID=CGI_10028421;\r\n" ] } ], "source": [ "!head 2019-09-15-DML-Genes.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I know how many overlaps there are, but I also want to know how many unique genes have DMLs in them. For this, I will use the following code:\n", "\n", "`cut -f9 2019-09-15-DML-Genes.txt | sort | uniq -c`\n", "\n", "`cut` is the command that isolates the column information. Each gene has a unique end position, so I'll look at unique entries in the ninth column (`-f9`). The column is piped into `sort`, then that output is counted for unique lines by `uniq`. Finally, I'll pipe this into `wc -l` to count the number of unique genes." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 389\n", "unique genes overlapping with DML\n" ] } ], "source": [ "! cut -f9 2019-09-15-DML-Genes.txt | sort | uniq -c | wc -l\n", "!echo \"unique genes overlapping with DML\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2e. Promoters" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 24\n", "DML overlaps with promoters\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {promoterList} \\\n", "| wc -l\n", "!echo \"DML overlaps with promoters\"" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {DMLlist} \\\n", "-b {promoterList} \\\n", "> 2019-09-15-DML-Promoters.txt" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C34158\t8559\t8561\t100\tC34158\tflankbed\tpromoter\t8251\t9250\t.\t-\t.\tID=CGI_10001222;\r", "\r\n", "scaffold1004\t438043\t438045\t-100\tscaffold1004\tflankbed\tpromoter\t437292\t438291\t.\t+\t.\tID=CGI_10020779;\r", "\r\n", "scaffold1009\t963653\t963655\t100\tscaffold1009\tflankbed\tpromoter\t963342\t964341\t.\t+\t.\tID=CGI_10028771;\r", "\r\n", "scaffold1050\t80914\t80916\t100\tscaffold1050\tflankbed\tpromoter\t80116\t81115\t.\t+\t.\tID=CGI_10009169;\r", "\r\n", "scaffold1199\t128931\t128933\t100\tscaffold1199\tflankbed\tpromoter\t128693\t129692\t.\t-\t.\tID=CGI_10005928;\r", "\r\n", "scaffold1249\t210763\t210765\t100\tscaffold1249\tflankbed\tpromoter\t209971\t210970\t.\t-\t.\tID=CGI_10019056;\r", "\r\n", "scaffold1301\t565191\t565193\t100\tscaffold1301\tflankbed\tpromoter\t564701\t565700\t.\t-\t.\tID=CGI_10027731;\r", "\r\n", "scaffold1409\t351501\t351503\t-100\tscaffold1409\tflankbed\tpromoter\t351306\t352305\t.\t-\t.\tID=CGI_10013791;\r", "\r\n", "scaffold1459\t117866\t117868\t-100\tscaffold1459\tflankbed\tpromoter\t117436\t118435\t.\t-\t.\tID=CGI_10010107;\r", "\r\n", "scaffold22\t1678759\t1678761\t100\tscaffold22\tflankbed\tpromoter\t1678054\t1679053\t.\t-\t.\tID=CGI_10028927;\r", "\r\n" ] } ], "source": [ "!head 2019-09-15-DML-Promoters.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2f. Transposable Elements" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 8\n", "DML overlaps with transposable elements\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {DMLlist} \\\n", "-b {transposableElements} \\\n", "| wc -l\n", "!echo \"DML overlaps with transposable elements\"" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {DMLlist} \\\n", "-b {transposableElements} \\\n", "> 2019-09-15-DML-TE.txt" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scaffold11\t354193\t354195\t100\tscaffold11\tWUBlastX\tLINE_L2\t354101\t354322\t89\t-\t.\t.\r\n", "scaffold11\t354209\t354211\t100\tscaffold11\tWUBlastX\tLINE_L2\t354101\t354322\t89\t-\t.\t.\r\n", "scaffold126\t579856\t579858\t-100\tscaffold126\tWUBlastX\tDNA_MuDR\t579734\t580051\t64\t+\t.\t.\r\n", "scaffold126\t579856\t579858\t-100\tscaffold126\tWUBlastX\tDNA_MuDR\t579743\t579997\t79\t+\t.\t.\r\n", "scaffold1301\t958700\t958702\t-100\tscaffold1301\tWUBlastX\tDNA_TcMar-Pogo\t958126\t959520\t528\t+\t.\t.\r\n", "scaffold1599\t123482\t123484\t-100\tscaffold1599\tTRF\tTandem_Repeat\t123195\t124139\t1487\t+\t.\t.\r\n", "scaffold1631\t15201\t15203\t-100\tscaffold1631\tTRF\tTandem_Repeat\t15150\t15205\t67\t+\t.\t.\r\n", "scaffold1715\t426046\t426048\t100\tscaffold1715\tWUBlastX\tDNA_IS4EU\t425909\t426592\t413\t-\t.\t.\r\n", "scaffold1715\t426046\t426048\t100\tscaffold1715\tWUBlastX\tDNA_IS4EU\t425912\t426751\t348\t-\t.\t.\r\n", "scaffold1843\t28003\t28005\t100\tscaffold1843\tWUBlastX\tDNA_hAT-hATw\t27958\t28119\t23\t-\t.\t.\r\n" ] } ], "source": [ "!head 2019-09-15-DML-TE.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2g. No overlaps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I also want to count the number of DML that do not overlap with any features (i.e. unannotated intergenic regions). To do this, I'll use the `-v` argument in `bedtools`, which reports \"those entries in A that have no overlap in B.\" I can specify multiple files with `-b`. I'll use exons, introns, transposable elements, and putative promoter regions." ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 165\n", "DML do not overlap with exons, introns, transposable elements, or promoters\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-v \\\n", "-a {DMLlist} \\\n", "-b {exonList} {intronList} {transposableElements} {promoterList} \\\n", "| wc -l\n", "!echo \"DML do not overlap with exons, introns, transposable elements, or promoters\"" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-v \\\n", "-a {DMLlist} \\\n", "-b {exonList} {intronList} {transposableElements} {promoterList} \\\n", "> 2019-09-15-No-Overlap-DML.txt" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C22384\t1328\t1330\t-100\r\n", "C22628\t1621\t1623\t100\r\n", "C29914\t4052\t4054\t-100\r\n", "C29914\t4052\t4054\t-100\r\n", "C29976\t649\t651\t-100\r\n", "C30322\t3482\t3484\t-100\r\n", "C30322\t3599\t3601\t-100\r\n", "C32984\t5070\t5072\t-100\r\n", "scaffold1017\t395465\t395467\t100\r\n", "scaffold1024\t1343499\t1343501\t100\r\n" ] } ], "source": [ "!head 2019-09-15-No-Overlap-DML.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Identify Overlaps between Other Genome Feature Tracks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3a. CG motifs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fully understand my results, I also need to know where CG motifs are located with respect to the other features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exons" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 172056\n", "Exon overlaps with CG motifs\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {exonList} \\\n", "-b {CGMotifList} \\\n", "| wc -l\n", "!echo \"Exon overlaps with CG motifs\"" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wo \\\n", "-a {exonList} \\\n", "-b {CGMotifList} \\\n", "> 2019-09-15-Exon-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t41\t42\t2\t+\t.\tID=C16582.1;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t47\t48\t2\t+\t.\tID=C16582.2;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t67\t68\t2\t+\t.\tID=C16582.3;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t84\t85\t2\t+\t.\tID=C16582.4;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t93\t94\t2\t+\t.\tID=C16582.5;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t125\t126\t2\t+\t.\tID=C16582.6;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t177\t178\t2\t+\t.\tID=C16582.7;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t216\t217\t2\t+\t.\tID=C16582.8;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t257\t258\t2\t+\t.\tID=C16582.9;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tCDS\t35\t385\t.\t-\t0\tParent=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t283\t284\t2\t+\t.\tID=C16582.10;note=*pat pattern:CG\t1\r\n" ] } ], "source": [ "!head 2019-09-15-Exon-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Introns" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 150884\n", "Intron overlaps with CG motifs\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {intronList} \\\n", "-b {CGMotifList} \\\n", "| wc -l\n", "!echo \"Intron overlaps with CG motifs\"" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wo \\\n", "-a {intronList} \\\n", "-b {CGMotifList} \\\n", "> 2019-09-15-Intron-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C19392\tsubtractBed\tintrn\t184\t451\t.\t+\t.\tParent=CGI_10000015;\r", "\tC19392\tfuzznuc\tnucleotide_motif\t244\t245\t2\t+\t.\tID=C19392.6;note=*pat pattern:CG\t1\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\tC20262\tfuzznuc\tnucleotide_motif\t690\t691\t2\t+\t.\tID=C20262.40;note=*pat pattern:CG\t1\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\tC20262\tfuzznuc\tnucleotide_motif\t697\t698\t2\t+\t.\tID=C20262.41;note=*pat pattern:CG\t1\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\tC20262\tfuzznuc\tnucleotide_motif\t739\t740\t2\t+\t.\tID=C20262.42;note=*pat pattern:CG\t1\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\tC20262\tfuzznuc\tnucleotide_motif\t846\t847\t2\t+\t.\tID=C20262.43;note=*pat pattern:CG\t1\r\n", "C20262\tsubtractBed\tintrn\t650\t871\t.\t-\t.\tParent=CGI_10000025;\r", "\tC20262\tfuzznuc\tnucleotide_motif\t867\t868\t2\t+\t.\tID=C20262.44;note=*pat pattern:CG\t1\r\n", "C20334\tsubtractBed\tintrn\t524\t867\t.\t-\t.\tParent=CGI_10000028;\r", "\tC20334\tfuzznuc\tnucleotide_motif\t786\t787\t2\t+\t.\tID=C20334.7;note=*pat pattern:CG\t1\r\n", "C20334\tsubtractBed\tintrn\t524\t867\t.\t-\t.\tParent=CGI_10000028;\r", "\tC20334\tfuzznuc\tnucleotide_motif\t815\t816\t2\t+\t.\tID=C20334.8;note=*pat pattern:CG\t1\r\n", "C20412\tsubtractBed\tintrn\t215\t409\t.\t-\t.\tParent=CGI_10000029;\r", "\tC20412\tfuzznuc\tnucleotide_motif\t244\t245\t2\t+\t.\tID=C20412.5;note=*pat pattern:CG\t1\r\n", "C20412\tsubtractBed\tintrn\t215\t409\t.\t-\t.\tParent=CGI_10000029;\r", "\tC20412\tfuzznuc\tnucleotide_motif\t286\t287\t2\t+\t.\tID=C20412.6;note=*pat pattern:CG\t1\r\n" ] } ], "source": [ "!head 2019-09-15-Intron-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Genes" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28015\n", "gene overlaps with CG motifs\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneList} \\\n", "-b {CGMotifList} \\\n", "| wc -l\n", "!echo \"gene overlaps with CG motifs\"" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wo \\\n", "-a {geneList} \\\n", "-b {CGMotifList} \\\n", "> 2019-09-15-Genes-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t41\t42\t2\t+\t.\tID=C16582.1;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t47\t48\t2\t+\t.\tID=C16582.2;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t67\t68\t2\t+\t.\tID=C16582.3;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t84\t85\t2\t+\t.\tID=C16582.4;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t93\t94\t2\t+\t.\tID=C16582.5;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t125\t126\t2\t+\t.\tID=C16582.6;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t177\t178\t2\t+\t.\tID=C16582.7;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t216\t217\t2\t+\t.\tID=C16582.8;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t257\t258\t2\t+\t.\tID=C16582.9;note=*pat pattern:CG\t1\r\n", "C16582\tGLEAN\tmRNA\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\tC16582\tfuzznuc\tnucleotide_motif\t283\t284\t2\t+\t.\tID=C16582.10;note=*pat pattern:CG\t1\r\n" ] } ], "source": [ "!head 2019-09-15-Genes-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Promoters" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5696\n", "promoter overlaps with CG motifs\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneList} \\\n", "-b {promoterList} \\\n", "| wc -l\n", "!echo \"promoter overlaps with CG motifs\"" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wo \\\n", "-a {promoterList} \\\n", "-b {CGMotifList} \\\n", "> 2019-09-15-Promoter-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t291\t292\t2\t+\t.\tID=C17476.9;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t389\t390\t2\t+\t.\tID=C17476.10;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t391\t392\t2\t+\t.\tID=C17476.11;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t398\t399\t2\t+\t.\tID=C17476.12;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t414\t415\t2\t+\t.\tID=C17476.13;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t439\t440\t2\t+\t.\tID=C17476.14;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t455\t456\t2\t+\t.\tID=C17476.15;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t460\t461\t2\t+\t.\tID=C17476.16;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t469\t470\t2\t+\t.\tID=C17476.17;note=*pat pattern:CG\t1\r\n", "C17476\tflankbed\tpromoter\t258\t491\t.\t-\t.\tID=CGI_10000004;\r", "\tC17476\tfuzznuc\tnucleotide_motif\t488\t489\t2\t+\t.\tID=C17476.18;note=*pat pattern:CG\t1\r\n" ] } ], "source": [ "!head 2019-09-15-Promoter-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transposable Elements" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 82544\n", "Transposable element overlap with CG motifs\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {transposableElements} \\\n", "-b {CGMotifList} \\\n", "| wc -l\n", "!echo \"Transposable element overlap with CG motifs\"" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wo \\\n", "-a {transposableElements} \\\n", "-b {CGMotifList} \\\n", "> 2019-09-15-TE-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\tC21242\tfuzznuc\tnucleotide_motif\t48\t49\t2\t+\t.\tID=C21242.2;note=*pat pattern:CG\t1\r\n", "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\tC21242\tfuzznuc\tnucleotide_motif\t57\t58\t2\t+\t.\tID=C21242.3;note=*pat pattern:CG\t1\r\n", "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\tC21242\tfuzznuc\tnucleotide_motif\t70\t71\t2\t+\t.\tID=C21242.4;note=*pat pattern:CG\t1\r\n", "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\tC21242\tfuzznuc\tnucleotide_motif\t79\t80\t2\t+\t.\tID=C21242.5;note=*pat pattern:CG\t1\r\n", "C21242\tTRF\tTandem_Repeat\t38\t100\t72\t+\t.\t.\tC21242\tfuzznuc\tnucleotide_motif\t92\t93\t2\t+\t.\tID=C21242.6;note=*pat pattern:CG\t1\r\n", "C21306\tTRF\tTandem_Repeat\t574\t947\t208\t+\t.\t.\tC21306\tfuzznuc\tnucleotide_motif\t664\t665\t2\t+\t.\tID=C21306.6;note=*pat pattern:CG\t1\r\n", "C21306\tTRF\tTandem_Repeat\t574\t947\t208\t+\t.\t.\tC21306\tfuzznuc\tnucleotide_motif\t799\t800\t2\t+\t.\tID=C21306.7;note=*pat pattern:CG\t1\r\n", "C21306\tTRF\tTandem_Repeat\t574\t947\t208\t+\t.\t.\tC21306\tfuzznuc\tnucleotide_motif\t904\t905\t2\t+\t.\tID=C21306.8;note=*pat pattern:CG\t1\r\n", "C21306\tTRF\tTandem_Repeat\t574\t901\t313\t+\t.\t.\tC21306\tfuzznuc\tnucleotide_motif\t664\t665\t2\t+\t.\tID=C21306.6;note=*pat pattern:CG\t1\r\n", "C21306\tTRF\tTandem_Repeat\t574\t901\t313\t+\t.\t.\tC21306\tfuzznuc\tnucleotide_motif\t799\t800\t2\t+\t.\tID=C21306.7;note=*pat pattern:CG\t1\r\n" ] } ], "source": [ "!head 2019-09-15-TE-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### No overlaps" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5118363\n", "CG motifs do not overlap with exons, introns, transposable elements, or putative promoters\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-v \\\n", "-a {CGMotifList} \\\n", "-b {exonList} {intronList} {transposableElements} {promoterList} \\\n", "| wc -l\n", "!echo \"CG motifs do not overlap with exons, introns, transposable elements, or putative promoters\"" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/sh: {bedtoolsDirectory}intersectBed: command not found\r\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-v \\\n", "-a {CGMotifList} \\\n", "-b {exonList} {intronList} {transposableElementsAll} {promoterList} \\\n", "> 2019-09-15-No-Overlap-CGmotifs.txt" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "!head 2019-09-15-No-Overlap-CGmotifs.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3b. Transposable Elements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also good to know where transposable elements overlap with the other feature tracks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exons" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2597\n", "Exon overlaps with transposable elements\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {exonList} \\\n", "-b {transposableElements} \\\n", "| wc -l\n", "!echo \"Exon overlaps with transposable elements\"" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {exonList} \\\n", "-b {transposableElements} \\\n", "> 2019-09-15-Exon-TE.txt" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C22430\tGLEAN\tCDS\t263\t493\t.\t+\t0\tParent=CGI_10000081;\tC22430\tWUBlastX\tDNA_TcMar-Tc2\t230\t493\t60\t+\t.\t.\r\n", "C23316\tGLEAN\tCDS\t929\t1111\t.\t-\t0\tParent=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t929\t1675\t117\t-\t.\t.\r\n", "C23316\tGLEAN\tCDS\t953\t1111\t.\t-\t0\tParent=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t953\t1489\t157\t-\t.\t.\r\n", "C23316\tGLEAN\tCDS\t644\t660\t.\t-\t0\tParent=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t547\t660\t32\t-\t.\t.\r\n", "C23734\tGLEAN\tCDS\t512\t712\t.\t-\t0\tParent=CGI_10000149;\tC23734\tWUBlastX\tLTR_Gypsy\t23\t724\t246\t-\t.\t.\r\n", "C23734\tGLEAN\tCDS\t512\t712\t.\t-\t0\tParent=CGI_10000149;\tC23734\tWUBlastX\tLTR_Gypsy\t26\t964\t217\t-\t.\t.\r\n", "C24320\tGLEAN\tCDS\t1545\t1904\t.\t-\t0\tParent=CGI_10000190;\tC24320\tWUBlastX\tLTR_ERV1\t1545\t1904\t27\t-\t.\t.\r\n", "C24604\tGLEAN\tCDS\t792\t989\t.\t-\t2\tParent=CGI_10000200;\tC24604\tWUBlastX\tDNA_MuDR\t792\t989\t42\t-\t.\t.\r\n", "C25496\tGLEAN\tCDS\t2006\t2074\t.\t+\t0\tParent=CGI_10000243;\tC25496\tWUBlastX\tLINE_CR1\t2006\t2536\t27\t+\t.\t.\r\n", "C26320\tGLEAN\tCDS\t2740\t2964\t.\t+\t0\tParent=CGI_10000297;\tC26320\tWUBlastX\tLINE_L2\t2710\t2964\t75\t+\t.\t.\r\n" ] } ], "source": [ "!head 2019-09-15-Exon-TE.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Introns" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 18989\n", "Intron overlaps with transposable elements\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {intronList} \\\n", "-b {transposableElements} \\\n", "| wc -l\n", "!echo \"Intron overlaps with transposable elements\"" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {intronList} \\\n", "-b {transposableElements} \\\n", "> 2019-09-15-Intron-TE.txt" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C26674\tsubtractBed\tintrn\t2430\t2483\t.\t+\t.\tParent=CGI_10000320;\r", "\tC26674\tTRF\tTandem_Repeat\t2430\t2483\t108\t+\t.\t.\r\n", "C26674\tsubtractBed\tintrn\t2860\t2913\t.\t+\t.\tParent=CGI_10000320;\r", "\tC26674\tTRF\tTandem_Repeat\t2860\t2913\t74\t+\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t1323\t1346\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t1323\t1346\t20\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t2202\t3359\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t2202\t3359\t178\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t2205\t2915\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t2205\t2915\t412\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t1613\t2275\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t1613\t2275\t350\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t2271\t3059\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t2271\t3059\t408\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t1484\t2260\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy\t1484\t2260\t314\t-\t.\t.\r\n", "C26936\tsubtractBed\tintrn\t2271\t2831\t.\t-\t.\tParent=CGI_10000336;\r", "\tC26936\tWUBlastX\tLTR_Gypsy-Gmr1\t2271\t2831\t413\t-\t.\t.\r\n", "C28348\tsubtractBed\tintrn\t3411\t3479\t.\t-\t.\tParent=CGI_10000423;\r", "\tC28348\tWUBlastX\tLTR_Gypsy\t3411\t3479\t23\t+\t.\t.\r\n" ] } ], "source": [ "!head 2019-09-15-Intron-TE.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Genes" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11748\n", "gene overlaps with transposable elements\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {geneList} \\\n", "-b {transposableElements} \\\n", "| wc -l\n", "!echo \"gene overlaps with transposable elements\"" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {geneList} \\\n", "-b {transposableElements} \\\n", "> 2019-09-15-Gene-TE.txt" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C22430\tGLEAN\tmRNA\t263\t493\t0.555898\t+\t.\tID=CGI_10000081;\tC22430\tWUBlastX\tDNA_TcMar-Tc2\t230\t493\t60\t+\t.\t.\r\n", "C23316\tGLEAN\tmRNA\t929\t1111\t0.999999\t-\t.\tID=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t929\t1675\t117\t-\t.\t.\r\n", "C23316\tGLEAN\tmRNA\t953\t1111\t0.999999\t-\t.\tID=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t953\t1489\t157\t-\t.\t.\r\n", "C23316\tGLEAN\tmRNA\t644\t660\t0.999999\t-\t.\tID=CGI_10000116;\tC23316\tWUBlastX\tLTR_Ngaro\t547\t660\t32\t-\t.\t.\r\n", "C23734\tGLEAN\tmRNA\t512\t712\t1\t-\t.\tID=CGI_10000149;\tC23734\tWUBlastX\tLTR_Gypsy\t23\t724\t246\t-\t.\t.\r\n", "C23734\tGLEAN\tmRNA\t512\t712\t1\t-\t.\tID=CGI_10000149;\tC23734\tWUBlastX\tLTR_Gypsy\t26\t964\t217\t-\t.\t.\r\n", "C24320\tGLEAN\tmRNA\t1545\t1904\t0.999372\t-\t.\tID=CGI_10000190;\tC24320\tWUBlastX\tLTR_ERV1\t1545\t1904\t27\t-\t.\t.\r\n", "C24604\tGLEAN\tmRNA\t792\t989\t0.997918\t-\t.\tID=CGI_10000200;\tC24604\tWUBlastX\tDNA_MuDR\t792\t989\t42\t-\t.\t.\r\n", "C25496\tGLEAN\tmRNA\t2006\t2074\t0.998748\t+\t.\tID=CGI_10000243;\tC25496\tWUBlastX\tLINE_CR1\t2006\t2536\t27\t+\t.\t.\r\n", "C26320\tGLEAN\tmRNA\t2740\t2964\t0.481188\t+\t.\tID=CGI_10000297;\tC26320\tWUBlastX\tLINE_L2\t2710\t2964\t75\t+\t.\t.\r\n" ] } ], "source": [ "!head 2019-09-15-Gene-TE.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Promoters" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3966\n", "promoter overlaps with transposable elements\n" ] } ], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {promoterList} \\\n", "-b {transposableElements} \\\n", "| wc -l\n", "!echo \"promoter overlaps with transposable elements\"" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [], "source": [ "! {bedtoolsDirectory}intersectBed \\\n", "-wb \\\n", "-a {promoterList} \\\n", "-b {transposableElements} \\\n", "> 2019-09-15-Promoter-TE.txt" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C22430\tflankbed\tpromoter\t230\t262\t.\t+\t.\tID=CGI_10000081;\r", "\tC22430\tWUBlastX\tDNA_TcMar-Tc2\t230\t493\t60\t+\t.\t.\r\n", "C23316\tflankbed\tpromoter\t1112\t1675\t.\t-\t.\tID=CGI_10000116;\r", "\tC23316\tWUBlastX\tLTR_Ngaro\t929\t1675\t117\t-\t.\t.\r\n", "C23316\tflankbed\tpromoter\t1112\t1489\t.\t-\t.\tID=CGI_10000116;\r", "\tC23316\tWUBlastX\tLTR_Ngaro\t953\t1489\t157\t-\t.\t.\r\n", "C23734\tflankbed\tpromoter\t713\t724\t.\t-\t.\tID=CGI_10000149;\r", "\tC23734\tWUBlastX\tLTR_Gypsy\t23\t724\t246\t-\t.\t.\r\n", "C23734\tflankbed\tpromoter\t713\t964\t.\t-\t.\tID=CGI_10000149;\r", "\tC23734\tWUBlastX\tLTR_Gypsy\t26\t964\t217\t-\t.\t.\r\n", "C24608\tflankbed\tpromoter\t1276\t1998\t.\t-\t.\tID=CGI_10000201;\r", "\tC24608\tWUBlastX\tDNA_En-Spm\t1276\t2823\t1086\t-\t.\t.\r\n", "C25496\tflankbed\tpromoter\t1457\t1753\t.\t+\t.\tID=CGI_10000243;\r", "\tC25496\tWUBlastX\tLINE_L1\t1457\t1753\t124\t+\t.\t.\r\n", "C25496\tflankbed\tpromoter\t853\t1033\t.\t+\t.\tID=CGI_10000243;\r", "\tC25496\tWUBlastX\tLINE_CR1\t353\t1033\t105\t+\t.\t.\r\n", "C26320\tflankbed\tpromoter\t2710\t2739\t.\t+\t.\tID=CGI_10000297;\r", "\tC26320\tWUBlastX\tLINE_L2\t2710\t2964\t75\t+\t.\t.\r\n", "C26472\tflankbed\tpromoter\t536\t730\t.\t-\t.\tID=CGI_10000305;\r", "\tC26472\tWUBlastX\tLTR_Pao\t173\t730\t39\t-\t.\t.\r\n" ] } ], "source": [ "!head 2019-09-15-Promoter-TE.txt" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 4. Annotate DML-Gene Overlaps" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1602k 100 1602k 0 0 5957k 0 --:--:-- --:--:-- --:--:-- 6071k\n" ] } ], "source": [ "#Download blast output from Steven\n", "!curl https://eagle.fish.washington.edu/cnidarian/oyster_v9_Gene_node2_blastout \\\n", "> Cg_v9_blastx_uniprot.05.tab" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 18158 Cg_v9_blastx_uniprot.05.tab\r\n" ] } ], "source": [ "!wc -l Cg_v9_blastx_uniprot.05.tab" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CGI_10000456\tgi|226707547|sp|A4FUY7.2|MOC2B_BOVIN\t58.45\t142\t59\t0\t4\t429\t45\t186\t9e-49\t 161\r\n", "CGI_10000457\tgi|60393871|sp|Q9BUI4.1|RPC3_HUMAN\t52.53\t198\t90\t1\t7\t588\t337\t534\t1e-62\t 209\r\n", "CGI_10000774\tgi|140225|sp|P05450.1|YAT7_RHOBL\t36.23\t69\t44\t0\t163\t369\t36\t104\t2e-09\t55.1\r\n", "CGI_10000861\tgi|81876587|sp|Q8C525.1|M21D2_MOUSE\t28.38\t148\t98\t4\t517\t957\t181\t321\t5e-11\t68.9\r\n", "CGI_10000994\tgi|2492837|sp|P95896.1|AMID_SULSO\t47.17\t494\t252\t4\t157\t1614\t9\t501\t3e-136\t 412\r\n", "CGI_10000643\tgi|13124114|sp|P57756.1|FCN2_RAT\t50.28\t181\t88\t2\t1\t543\t141\t319\t8e-58\t 190\r\n", "CGI_10000610\tgi|189029808|sp|A2A690.1|TANC2_MOUSE\t51.85\t378\t176\t2\t4\t1119\t879\t1256\t1e-124\t 401\r\n", "CGI_10001568\tgi|74896865|sp|Q54G05.1|LRRX1_DICDI\t21.21\t825\t564\t23\t409\t2709\t193\t989\t9e-21\t 102\r\n", "CGI_10001333\tgi|81885333|sp|Q6P799.3|SYSC_RAT\t70.54\t482\t140\t2\t1\t1443\t1\t481\t0.0\t 696\r\n", "CGI_10002404\tgi|51704215|sp|Q01177.2|PLMN_RAT\t38.93\t131\t64\t3\t82\t435\t265\t392\t3e-27\t99.0\r\n" ] } ], "source": [ "!head Cg_v9_blastx_uniprot.05.tab" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#convert pipes to tab\n", "!tr '|' '\\t' < Cg_v9_blastx_uniprot.05.tab \\\n", "> Cg_v9_blastx_uniprot.05.codeIsolated.tab" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CGI_10000456\tgi\t226707547\tsp\tA4FUY7.2\tMOC2B_BOVIN\t58.45\t142\t59\t0\t4\t429\t45\t186\t9e-49\t 161\r\n", "CGI_10000457\tgi\t60393871\tsp\tQ9BUI4.1\tRPC3_HUMAN\t52.53\t198\t90\t1\t7\t588\t337\t534\t1e-62\t 209\r\n", "CGI_10000774\tgi\t140225\tsp\tP05450.1\tYAT7_RHOBL\t36.23\t69\t44\t0\t163\t369\t36\t104\t2e-09\t55.1\r\n", "CGI_10000861\tgi\t81876587\tsp\tQ8C525.1\tM21D2_MOUSE\t28.38\t148\t98\t4\t517\t957\t181\t321\t5e-11\t68.9\r\n", "CGI_10000994\tgi\t2492837\tsp\tP95896.1\tAMID_SULSO\t47.17\t494\t252\t4\t157\t1614\t9\t501\t3e-136\t 412\r\n", "CGI_10000643\tgi\t13124114\tsp\tP57756.1\tFCN2_RAT\t50.28\t181\t88\t2\t1\t543\t141\t319\t8e-58\t 190\r\n", "CGI_10000610\tgi\t189029808\tsp\tA2A690.1\tTANC2_MOUSE\t51.85\t378\t176\t2\t4\t1119\t879\t1256\t1e-124\t 401\r\n", "CGI_10001568\tgi\t74896865\tsp\tQ54G05.1\tLRRX1_DICDI\t21.21\t825\t564\t23\t409\t2709\t193\t989\t9e-21\t 102\r\n", "CGI_10001333\tgi\t81885333\tsp\tQ6P799.3\tSYSC_RAT\t70.54\t482\t140\t2\t1\t1443\t1\t481\t0.0\t 696\r\n", "CGI_10002404\tgi\t51704215\tsp\tQ01177.2\tPLMN_RAT\t38.93\t131\t64\t3\t82\t435\t265\t392\t3e-27\t99.0\r\n" ] } ], "source": [ "!head Cg_v9_blastx_uniprot.05.codeIsolated.tab" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Reduce the number of columns using awk. Sort, and save as a new file.\n", "!awk -v OFS='\\t' '{print $5, $1, $15}' < Cg_v9_blastx_uniprot.05.codeIsolated.tab | sort \\\n", "> gigas-blast-sort.tab" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8.1\tCGI_10000672\t6e-22\r\n", "A0AVF1.1\tCGI_10012146\t1e-49\r\n", "A0AVK6.1\tCGI_10023548\t2e-63\r\n", "A0AVT1.1\tCGI_10003125\t2e-25\r\n", "A0AVT1.1\tCGI_10012444\t2e-113\r\n", "A0FGR8.1\tCGI_10025868\t9e-24\r\n", "A0JC51.1\tCGI_10016118\t8e-34\r\n", "A0JM12.1\tCGI_10001556\t6e-17\r\n", "A0JM12.1\tCGI_10001637\t4e-09\r\n", "A0JM12.1\tCGI_10001773\t3e-11\r\n" ] } ], "source": [ "!head gigas-blast-sort.tab" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Uniprot codes have \".1\" appended, so those need to be removed.\n", "#Isolate the contig column name with cut\n", "#Flip order of characters with rev\n", "#Delete last three characters with cut -c\n", "#Flip order of characters with rev\n", "#Add information as a new column to annotated table with paste\n", "\n", "!cut -f1 gigas-blast-sort.tab \\\n", "| rev \\\n", "| cut -c 3- \\\n", "| rev \\\n", "> gigas-blast-sort-correctUniprot.tab" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\r\n", "A0AVF1\r\n", "A0AVK6\r\n", "A0AVT1\r\n", "A0AVT1\r\n", "A0FGR8\r\n", "A0JC51\r\n", "A0JM12\r\n", "A0JM12\r\n", "A0JM12\r\n" ] } ], "source": [ "#Confirm formatting\n", "!head gigas-blast-sort-correctUniprot.tab" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!paste gigas-blast-sort-correctUniprot.tab gigas-blast-sort.tab \\\n", "> gigas-blast-sort-withCorrectUniprot.tab" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!awk '{print $1\"\\t\"$3\"\\t\"$4}' gigas-blast-sort-withCorrectUniprot.tab > gigas-blast-sort-onlyCorrectUniprot.tab" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\tCGI_10000672\t6e-22\r\n", "A0AVF1\tCGI_10012146\t1e-49\r\n", "A0AVK6\tCGI_10023548\t2e-63\r\n", "A0AVT1\tCGI_10003125\t2e-25\r\n", "A0AVT1\tCGI_10012444\t2e-113\r\n", "A0FGR8\tCGI_10025868\t9e-24\r\n", "A0JC51\tCGI_10016118\t8e-34\r\n", "A0JM12\tCGI_10001556\t6e-17\r\n", "A0JM12\tCGI_10001637\t4e-09\r\n", "A0JM12\tCGI_10001773\t3e-11\r\n" ] } ], "source": [ "!head gigas-blast-sort-onlyCorrectUniprot.tab" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 340M 100 340M 0 0 74.4M 0 0:00:04 0:00:04 --:--:-- 76.5M\n" ] } ], "source": [ "!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/uniprot-SP-GO.sorted" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A023GPI8\tLECA_CANBL\treviewed\tLectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]\t\tCanavalia boliviana\t237\t\t\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tGO:0005537; GO:0046872\r\n", "A0A023GPJ0\tCDII_ENTCC\treviewed\tImmunity protein CdiI\tcdiI ECL_04450.1\tEnterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)\t145\t\t\t\t\t\r\n", "A0A023PXA5\tYA19A_YEAST\treviewed\tPutative uncharacterized protein YAL019W-A\tYAL019W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t189\t\t\t\t\t\r\n", "A0A023PXB0\tYA019_YEAST\treviewed\tPutative uncharacterized protein YAR019W-A\tYAR019W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t110\t\t\t\t\t\r\n", "A0A023PXB5\tIRC2_YEAST\treviewed\tPutative uncharacterized membrane protein IRC2 (Increased recombination centers protein 2)\tIRC2 YDR112W\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t102\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXB9\tYD99W_YEAST\treviewed\tPutative uncharacterized membrane protein YDR199W\tYDR199W\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t121\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXC2\tYE53A_YEAST\treviewed\tPutative uncharacterized membrane protein YEL053W-A\tYEL053W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t115\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXC7\tYE068_YEAST\treviewed\tPutative uncharacterized membrane protein YER068C-A\tYER068C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t143\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXD3\tYE88A_YEAST\treviewed\tPutative uncharacterized protein YER088C-A\tYER088C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t107\t\t\t\t\t\r\n", "A0A023PXD5\tYE147_YEAST\treviewed\tPutative uncharacterized membrane protein YER147C-A\tYER147C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t136\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n" ] } ], "source": [ "!head uniprot-SP-GO.sorted" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Join the first column in the first file with the first column in the second file\n", "#The files are tab delimited, and the output should also be tab delimited (-t $'\\t')\n", "!join -1 1 -2 1 -t $'\\t' \\\n", "gigas-blast-sort-onlyCorrectUniprot.tab \\\n", "uniprot-SP-GO.sorted \\\n", "> gigas-blast-annot.tab" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\tCGI_10000672\t6e-22\tANGP2_CANLF\treviewed\tAngiopoietin-2 (ANG-2)\tANGPT2\tCanis lupus familiaris (Dog) (Canis familiaris)\t495\tangiogenesis [GO:0001525]; cell differentiation [GO:0030154]; negative regulation of angiogenesis [GO:0016525]; negative regulation of blood vessel endothelial cell migration [GO:0043537]; negative regulation of cell-substrate adhesion [GO:0010812]; negative regulation of positive chemotaxis [GO:0050928]; Tie signaling pathway [GO:0048014]\textracellular space [GO:0005615]\tmetal ion binding [GO:0046872]; receptor tyrosine kinase binding [GO:0030971]\textracellular space [GO:0005615]; metal ion binding [GO:0046872]; receptor tyrosine kinase binding [GO:0030971]; angiogenesis [GO:0001525]; cell differentiation [GO:0030154]; negative regulation of angiogenesis [GO:0016525]; negative regulation of blood vessel endothelial cell migration [GO:0043537]; negative regulation of cell-substrate adhesion [GO:0010812]; negative regulation of positive chemotaxis [GO:0050928]; Tie signaling pathway [GO:0048014]\tGO:0001525; GO:0005615; GO:0010812; GO:0016525; GO:0030154; GO:0030971; GO:0043537; GO:0046872; GO:0048014; GO:0050928\r\n", "A0AVF1\tCGI_10012146\t1e-49\tIFT56_HUMAN\treviewed\tIntraflagellar transport protein 56 (Tetratricopeptide repeat protein 26) (TPR repeat protein 26)\tTTC26 IFT56\tHomo sapiens (Human)\t554\tcilium assembly [GO:0060271]; intraciliary transport [GO:0042073]; intraciliary transport involved in cilium assembly [GO:0035735]; smoothened signaling pathway [GO:0007224]; spermatid development [GO:0007286]\tcentrosome [GO:0005813]; ciliary basal body [GO:0036064]; ciliary tip [GO:0097542]; cilium [GO:0005929]; intraciliary transport particle B [GO:0030992]\t\tcentrosome [GO:0005813]; ciliary basal body [GO:0036064]; ciliary tip [GO:0097542]; cilium [GO:0005929]; intraciliary transport particle B [GO:0030992]; cilium assembly [GO:0060271]; intraciliary transport [GO:0042073]; intraciliary transport involved in cilium assembly [GO:0035735]; smoothened signaling pathway [GO:0007224]; spermatid development [GO:0007286]\tGO:0005813; GO:0005929; GO:0007224; GO:0007286; GO:0030992; GO:0035735; GO:0036064; GO:0042073; GO:0060271; GO:0097542\r\n", "A0AVK6\tCGI_10023548\t2e-63\tE2F8_HUMAN\treviewed\tTranscription factor E2F8 (E2F-8)\tE2F8\tHomo sapiens (Human)\t867\tcell cycle comprising mitosis without cytokinesis [GO:0033301]; cell proliferation [GO:0008283]; chorionic trophoblast cell differentiation [GO:0060718]; DNA damage response, signal transduction by p53 class mediator resulting in cell cycle arrest [GO:0006977]; hepatocyte differentiation [GO:0070365]; negative regulation of cytokinesis [GO:0032466]; negative regulation of transcription from RNA polymerase II promoter [GO:0000122]; placenta development [GO:0001890]; positive regulation of DNA endoreduplication [GO:0032877]; positive regulation of transcription from RNA polymerase II promoter [GO:0045944]; sprouting angiogenesis [GO:0002040]; transcription, DNA-templated [GO:0006351]; trophoblast giant cell differentiation [GO:0060707]\tcytosol [GO:0005829]; nucleolus [GO:0005730]; nucleoplasm [GO:0005654]; nucleus [GO:0005634]; transcription factor complex [GO:0005667]\tcore promoter binding [GO:0001047]; protein homodimerization activity [GO:0042803]; RNA polymerase II core promoter proximal region sequence-specific DNA binding [GO:0000978]; transcriptional repressor activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001078]; transcription corepressor activity [GO:0003714]; transcription factor activity, sequence-specific DNA binding [GO:0003700]\tcytosol [GO:0005829]; nucleolus [GO:0005730]; nucleoplasm [GO:0005654]; nucleus [GO:0005634]; transcription factor complex [GO:0005667]; core promoter binding [GO:0001047]; protein homodimerization activity [GO:0042803]; RNA polymerase II core promoter proximal region sequence-specific DNA binding [GO:0000978]; transcription corepressor activity [GO:0003714]; transcription factor activity, sequence-specific DNA binding [GO:0003700]; transcriptional repressor activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001078]; cell cycle comprising mitosis without cytokinesis [GO:0033301]; cell proliferation [GO:0008283]; chorionic trophoblast cell differentiation [GO:0060718]; DNA damage response, signal transduction by p53 class mediator resulting in cell cycle arrest [GO:0006977]; hepatocyte differentiation [GO:0070365]; negative regulation of cytokinesis [GO:0032466]; negative regulation of transcription from RNA polymerase II promoter [GO:0000122]; placenta development [GO:0001890]; positive regulation of DNA endoreduplication [GO:0032877]; positive regulation of transcription from RNA polymerase II promoter [GO:0045944]; sprouting angiogenesis [GO:0002040]; transcription, DNA-templated [GO:0006351]; trophoblast giant cell differentiation [GO:0060707]\tGO:0000122; GO:0000978; GO:0001047; GO:0001078; GO:0001890; GO:0002040; GO:0003700; GO:0003714; GO:0005634; GO:0005654; GO:0005667; GO:0005730; GO:0005829; GO:0006351; GO:0006977; GO:0008283; GO:0032466; GO:0032877; GO:0033301; GO:0042803; GO:0045944; GO:0060707; GO:0060718; GO:0070365\r\n", "A0AVT1\tCGI_10003125\t2e-25\tUBA6_HUMAN\treviewed\tUbiquitin-like modifier-activating enzyme 6 (Ubiquitin-activating enzyme 6) (EC 6.2.1.45) (Monocyte protein 4) (MOP-4) (Ubiquitin-activating enzyme E1-like protein 2) (E1-L2)\tUBA6 MOP4 UBE1L2\tHomo sapiens (Human)\t1052\tamygdala development [GO:0021764]; cellular response to DNA damage stimulus [GO:0006974]; dendritic spine development [GO:0060996]; hippocampus development [GO:0021766]; learning [GO:0007612]; locomotory behavior [GO:0007626]; protein ubiquitination [GO:0016567]; protein ubiquitination involved in ubiquitin-dependent protein catabolic process [GO:0042787]; ubiquitin-dependent protein catabolic process [GO:0006511]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]\tATP binding [GO:0005524]; FAT10 activating enzyme activity [GO:0019780]; ubiquitin activating enzyme activity [GO:0004839]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]; ATP binding [GO:0005524]; FAT10 activating enzyme activity [GO:0019780]; ubiquitin activating enzyme activity [GO:0004839]; amygdala development [GO:0021764]; cellular response to DNA damage stimulus [GO:0006974]; dendritic spine development [GO:0060996]; hippocampus development [GO:0021766]; learning [GO:0007612]; locomotory behavior [GO:0007626]; protein ubiquitination [GO:0016567]; protein ubiquitination involved in ubiquitin-dependent protein catabolic process [GO:0042787]; ubiquitin-dependent protein catabolic process [GO:0006511]\tGO:0004839; GO:0005524; GO:0005737; GO:0005829; GO:0006511; GO:0006974; GO:0007612; GO:0007626; GO:0016567; GO:0019780; GO:0021764; GO:0021766; GO:0042787; GO:0060996\r\n", "A0AVT1\tCGI_10012444\t2e-113\tUBA6_HUMAN\treviewed\tUbiquitin-like modifier-activating enzyme 6 (Ubiquitin-activating enzyme 6) (EC 6.2.1.45) (Monocyte protein 4) (MOP-4) (Ubiquitin-activating enzyme E1-like protein 2) (E1-L2)\tUBA6 MOP4 UBE1L2\tHomo sapiens (Human)\t1052\tamygdala development [GO:0021764]; cellular response to DNA damage stimulus [GO:0006974]; dendritic spine development [GO:0060996]; hippocampus development [GO:0021766]; learning [GO:0007612]; locomotory behavior [GO:0007626]; protein ubiquitination [GO:0016567]; protein ubiquitination involved in ubiquitin-dependent protein catabolic process [GO:0042787]; ubiquitin-dependent protein catabolic process [GO:0006511]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]\tATP binding [GO:0005524]; FAT10 activating enzyme activity [GO:0019780]; ubiquitin activating enzyme activity [GO:0004839]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]; ATP binding [GO:0005524]; FAT10 activating enzyme activity [GO:0019780]; ubiquitin activating enzyme activity [GO:0004839]; amygdala development [GO:0021764]; cellular response to DNA damage stimulus [GO:0006974]; dendritic spine development [GO:0060996]; hippocampus development [GO:0021766]; learning [GO:0007612]; locomotory behavior [GO:0007626]; protein ubiquitination [GO:0016567]; protein ubiquitination involved in ubiquitin-dependent protein catabolic process [GO:0042787]; ubiquitin-dependent protein catabolic process [GO:0006511]\tGO:0004839; GO:0005524; GO:0005737; GO:0005829; GO:0006511; GO:0006974; GO:0007612; GO:0007626; GO:0016567; GO:0019780; GO:0021764; GO:0021766; GO:0042787; GO:0060996\r\n", "A0FGR8\tCGI_10025868\t9e-24\tESYT2_HUMAN\treviewed\tExtended synaptotagmin-2 (E-Syt2) (Chr2Syt)\tESYT2 FAM62B KIAA1228\tHomo sapiens (Human)\t921\tendocytosis [GO:0006897]; glycosphingolipid metabolic process [GO:0006687]; lipid transport [GO:0006869]\tendoplasmic reticulum membrane [GO:0005789]; extrinsic component of cytoplasmic side of plasma membrane [GO:0031234]; integral component of plasma membrane [GO:0005887]; intrinsic component of endoplasmic reticulum membrane [GO:0031227]; membrane [GO:0016020]; organelle membrane contact site [GO:0044232]\tcadherin binding [GO:0045296]; calcium-dependent phospholipid binding [GO:0005544]; calcium ion binding [GO:0005509]; identical protein binding [GO:0042802]; phosphatidylcholine binding [GO:0031210]; phosphatidylethanolamine binding [GO:0008429]; phosphatidylinositol binding [GO:0035091]\tendoplasmic reticulum membrane [GO:0005789]; extrinsic component of cytoplasmic side of plasma membrane [GO:0031234]; integral component of plasma membrane [GO:0005887]; intrinsic component of endoplasmic reticulum membrane [GO:0031227]; membrane [GO:0016020]; organelle membrane contact site [GO:0044232]; cadherin binding [GO:0045296]; calcium ion binding [GO:0005509]; calcium-dependent phospholipid binding [GO:0005544]; identical protein binding [GO:0042802]; phosphatidylcholine binding [GO:0031210]; phosphatidylethanolamine binding [GO:0008429]; phosphatidylinositol binding [GO:0035091]; endocytosis [GO:0006897]; glycosphingolipid metabolic process [GO:0006687]; lipid transport [GO:0006869]\tGO:0005509; GO:0005544; GO:0005789; GO:0005887; GO:0006687; GO:0006869; GO:0006897; GO:0008429; GO:0016020; GO:0031210; GO:0031227; GO:0031234; GO:0035091; GO:0042802; GO:0044232; GO:0045296\r\n", "A0JC51\tCGI_10016118\t8e-34\tZIC4_XENLA\treviewed\tZinc finger protein ZIC 4 (XlZic4) (Zinc finger protein of the cerebellum 4)\tzic4\tXenopus laevis (African clawed frog)\t530\tnervous system development [GO:0007399]; neural crest formation [GO:0014029]\tnucleus [GO:0005634]\tDNA binding [GO:0003677]; metal ion binding [GO:0046872]\tnucleus [GO:0005634]; DNA binding [GO:0003677]; metal ion binding [GO:0046872]; nervous system development [GO:0007399]; neural crest formation [GO:0014029]\tGO:0003677; GO:0005634; GO:0007399; GO:0014029; GO:0046872\r\n", "A0JM12\tCGI_10001556\t6e-17\tMEG10_XENTR\treviewed\tMultiple epidermal growth factor-like domains protein 10 (Multiple EGF-like domains protein 10)\tmegf10\tXenopus tropicalis (Western clawed frog) (Silurana tropicalis)\t1114\tcell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]\t\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]; cell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tGO:0005886; GO:0006909; GO:0007155; GO:0014719; GO:0014816; GO:0014841; GO:0016021; GO:0030239\r\n", "A0JM12\tCGI_10001637\t4e-09\tMEG10_XENTR\treviewed\tMultiple epidermal growth factor-like domains protein 10 (Multiple EGF-like domains protein 10)\tmegf10\tXenopus tropicalis (Western clawed frog) (Silurana tropicalis)\t1114\tcell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]\t\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]; cell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tGO:0005886; GO:0006909; GO:0007155; GO:0014719; GO:0014816; GO:0014841; GO:0016021; GO:0030239\r\n", "A0JM12\tCGI_10001773\t3e-11\tMEG10_XENTR\treviewed\tMultiple epidermal growth factor-like domains protein 10 (Multiple EGF-like domains protein 10)\tmegf10\tXenopus tropicalis (Western clawed frog) (Silurana tropicalis)\t1114\tcell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]\t\tintegral component of membrane [GO:0016021]; plasma membrane [GO:0005886]; cell adhesion [GO:0007155]; myofibril assembly [GO:0030239]; phagocytosis [GO:0006909]; skeletal muscle satellite cell activation [GO:0014719]; skeletal muscle satellite cell differentiation [GO:0014816]; skeletal muscle satellite cell proliferation [GO:0014841]\tGO:0005886; GO:0006909; GO:0007155; GO:0014719; GO:0014816; GO:0014841; GO:0016021; GO:0030239\r\n" ] } ], "source": [ "!head gigas-blast-annot.tab" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#convert = and ; to tab\n", "!tr '=' '\\t' < 2019-09-15-DML-Genes.txt | \\\n", "tr ';' '\\t' \\\n", "> 2019-09-15-DML-Genes-CGI-Unfolded.txt" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C28982\t4929\t4931\t100\tC28982\tGLEAN\tmRNA\t756\t6077\t0.999998\t-\t.\tID\tCGI_10000485\t\r\n", "C33708\t8307\t8309\t100\tC33708\tGLEAN\tmRNA\t8220\t14973\t0.727981\t-\t.\tID\tCGI_10001120\t\r\n", "C34158\t8559\t8561\t100\tC34158\tGLEAN\tmRNA\t8394\t13250\t1\t-\t.\tID\tCGI_10001223\t\r\n", "scaffold100\t645217\t645219\t-100\tscaffold100\tGLEAN\tmRNA\t633050\t652924\t0.873323\t+\t.\tID\tCGI_10026446\t\r\n", "scaffold100\t676764\t676766\t100\tscaffold100\tGLEAN\tmRNA\t671285\t725122\t0.993892\t-\t.\tID\tCGI_10026448\t\r\n", "scaffold102\t107843\t107845\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID\tCGI_10028421\t\r\n", "scaffold102\t108199\t108201\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID\tCGI_10028421\t\r\n", "scaffold102\t108296\t108298\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID\tCGI_10028421\t\r\n", "scaffold102\t108460\t108462\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID\tCGI_10028421\t\r\n", "scaffold102\t108466\t108468\t-100\tscaffold102\tGLEAN\tmRNA\t106910\t127626\t0.997949\t+\t.\tID\tCGI_10028421\t\r\n" ] } ], "source": [ "!head 2019-09-15-DML-Genes-CGI-Unfolded.txt" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Reduce the number of columns using awk. Sort, and save as a new file.\n", "!awk -v OFS='\\t' '{print $14, $1, $2, $3, $4, $7, $8, $9}' < 2019-09-15-DML-Genes-CGI-Unfolded.txt | sort \\\n", "> DML-Gene-sort.tab" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CGI_10000428\tscaffold1242\t4421\t4423\t100\tmRNA\t3925\t4806\r\n", "CGI_10000485\tC28982\t4929\t4931\t100\tmRNA\t756\t6077\r\n", "CGI_10000789\tscaffold475\t8452\t8454\t100\tmRNA\t4744\t9923\r\n", "CGI_10001120\tC33708\t8307\t8309\t100\tmRNA\t8220\t14973\r\n", "CGI_10001200\tscaffold34088\t7645\t7647\t-100\tmRNA\t6736\t14285\r\n", "CGI_10001223\tC34158\t8559\t8561\t100\tmRNA\t8394\t13250\r\n", "CGI_10001227\tscaffold34202\t5513\t5515\t100\tmRNA\t527\t10844\r\n", "CGI_10001358\tscaffold34756\t3725\t3727\t-100\tmRNA\t21\t10691\r\n", "CGI_10001451\tscaffold35286\t17331\t17333\t100\tmRNA\t5633\t23296\r\n", "CGI_10001721\tscaffold36006\t22505\t22507\t100\tmRNA\t1062\t28962\r\n" ] } ], "source": [ "!head DML-Gene-sort.tab" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Join the first column in the first file with the third column in the second file\n", "#The files are tab delimited, and the output should also be tab delimited (-t $'\\t')\n", "!join -1 1 -2 2 -t $'\\t' \\\n", "DML-Gene-sort.tab \\\n", "gigas-blast-annot.tab \\\n", "> DML-Gene-annot.tab" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q52KJ8\tQ52KJ8.2\tCGI_10000428\t6e-41\tMSRB1_RAT\treviewed\tMethionine-R-sulfoxide reductase B1 (MsrB1) (EC 1.8.4.-) (Selenoprotein X) (SelX)\tMsrb1 Sepx1\tRattus norvegicus (Rat)\t116\tactin filament polymerization [GO:0030041]; innate immune response [GO:0045087]; protein repair [GO:0030091]; response to oxidative stress [GO:0006979]\tcytoplasm [GO:0005737]; cytoskeleton [GO:0005856]; nucleus [GO:0005634]\tactin binding [GO:0003779]; metal ion binding [GO:0046872]; peptide-methionine (R)-S-oxide reductase activity [GO:0033743]\tcytoplasm [GO:0005737]; cytoskeleton [GO:0005856]; nucleus [GO:0005634]; actin binding [GO:0003779]; metal ion binding [GO:0046872]; peptide-methionine (R)-S-oxide reductase activity [GO:0033743]; actin filament polymerization [GO:0030041]; innate immune response [GO:0045087]; protein repair [GO:0030091]; response to oxidative stress [GO:0006979]\tGO:0003779; GO:0005634; GO:0005737; GO:0005856; GO:0006979; GO:0030041; GO:0030091; GO:0033743; GO:0045087; GO:0046872\r\n" ] } ], "source": [ "!grep \"CGI_10000428\" gigas-blast-annot.tab" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!awk -F\"\\t\" 'NR==FNR{a[$1]=$0;next}$2 in a{print $0\"\\t\"a[$2]}' DML-Gene-sort.tab gigas-blast-annot.tab > DML-Gene-annot.tab" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0NGI1\tCGI_10010648\t1e-31\tTM234_ANOGA\treviewed\tTransmembrane protein 234 homolog\tAGAP012180\tAnopheles gambiae (African malaria mosquito)\t140\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\tCGI_10010648\tscaffold1127\t45024\t45026\t100\tmRNA\t43951\t45586\r\n", "A1L2Y1\tCGI_10019566\t7e-07\tFSIP1_XENLA\treviewed\tFibrous sheath-interacting protein 1\tfsip1\tXenopus laevis (African clawed frog)\t459\t\t\t\t\t\tCGI_10019566\tscaffold1794\t46921\t46923\t-100\tmRNA\t42711\t62714\r\n", "A2AHJ4\tCGI_10020687\t0.0\tBRWD3_MOUSE\treviewed\tBromodomain and WD repeat-containing protein 3\tBrwd3 Gm596\tMus musculus (Mouse)\t1799\tcytoskeleton organization [GO:0007010]; regulation of cell shape [GO:0008360]; regulation of transcription from RNA polymerase II promoter [GO:0006357]\tnucleus [GO:0005634]\t\tnucleus [GO:0005634]; cytoskeleton organization [GO:0007010]; regulation of cell shape [GO:0008360]; regulation of transcription from RNA polymerase II promoter [GO:0006357]\tGO:0005634; GO:0006357; GO:0007010; GO:0008360\tCGI_10020687\tscaffold288\t633290\t633292\t100\tmRNA\t617164\t636803\r\n", "A2AWP0\tCGI_10026569\t7e-19\tBIRC7_MOUSE\treviewed\tBaculoviral IAP repeat-containing protein 7 (EC 2.3.2.27) (Livin) (RING-type E3 ubiquitin transferase BIRC7) [Cleaved into: Baculoviral IAP repeat-containing protein 7 30 kDa subunit (Truncated livin) (p30-Livin) (tLivin)]\tBirc7 Livin\tMus musculus (Mouse)\t285\tinhibition of cysteine-type endopeptidase activity involved in apoptotic process [GO:1990001]; negative regulation of apoptotic process [GO:0043066]; protein ubiquitination [GO:0016567]; regulation of cell proliferation [GO:0042127]; regulation of natural killer cell apoptotic process [GO:0070247]; regulation of signal transduction [GO:0009966]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]; Golgi apparatus [GO:0005794]; microtubule organizing center [GO:0005815]; nucleus [GO:0005634]\tcysteine-type endopeptidase inhibitor activity involved in apoptotic process [GO:0043027]; metal ion binding [GO:0046872]; ubiquitin-protein transferase activity [GO:0004842]\tcytoplasm [GO:0005737]; cytosol [GO:0005829]; Golgi apparatus [GO:0005794]; microtubule organizing center [GO:0005815]; nucleus [GO:0005634]; cysteine-type endopeptidase inhibitor activity involved in apoptotic process [GO:0043027]; metal ion binding [GO:0046872]; ubiquitin-protein transferase activity [GO:0004842]; inhibition of cysteine-type endopeptidase activity involved in apoptotic process [GO:1990001]; negative regulation of apoptotic process [GO:0043066]; protein ubiquitination [GO:0016567]; regulation of cell proliferation [GO:0042127]; regulation of natural killer cell apoptotic process [GO:0070247]; regulation of signal transduction [GO:0009966]\tGO:0004842; GO:0005634; GO:0005737; GO:0005794; GO:0005815; GO:0005829; GO:0009966; GO:0016567; GO:0042127; GO:0043027; GO:0043066; GO:0046872; GO:0070247; GO:1990001\tCGI_10026569\tscaffold145\t960693\t960695\t100\tmRNA\t959637\t969232\r\n", "A2BGR3\tCGI_10025557\t0.0\tERC6L_DANRE\treviewed\tDNA excision repair protein ERCC-6-like (EC 3.6.4.12) (ATP-dependent helicase ERCC6-like)\tercc6l si:ch211-278b8.3\tDanio rerio (Zebrafish) (Brachydanio rerio)\t1451\tcell cycle [GO:0007049]; cell division [GO:0051301]; intein-mediated protein splicing [GO:0016539]\tcondensed chromosome kinetochore [GO:0000777]\tATP binding [GO:0005524]; DNA binding [GO:0003677]; helicase activity [GO:0004386]\tcondensed chromosome kinetochore [GO:0000777]; ATP binding [GO:0005524]; DNA binding [GO:0003677]; helicase activity [GO:0004386]; cell cycle [GO:0007049]; cell division [GO:0051301]; intein-mediated protein splicing [GO:0016539]\tGO:0000777; GO:0003677; GO:0004386; GO:0005524; GO:0007049; GO:0016539; GO:0051301\tCGI_10025557\tscaffold370\t652930\t652932\t100\tmRNA\t630608\t656609\r\n", "A4D0V7\tCGI_10020074\t2e-71\tCPED1_HUMAN\treviewed\tCadherin-like and PC-esterase domain-containing protein 1\tCPED1 C7orf58 UNQ9432/PRO34713\tHomo sapiens (Human)\t1026\t\tendoplasmic reticulum [GO:0005783]\t\tendoplasmic reticulum [GO:0005783]\tGO:0005783\tCGI_10020074\tscaffold258\t549090\t549092\t100\tmRNA\t540142\t564206\r\n", "A4IF87\tCGI_10011812\t2e-148\tGNPAT_BOVIN\treviewed\tDihydroxyacetone phosphate acyltransferase (DAP-AT) (DHAP-AT) (EC 2.3.1.42) (Acyl-CoA:dihydroxyacetonephosphateacyltransferase) (Glycerone-phosphate O-acyltransferase)\tGNPAT\tBos taurus (Bovine)\t680\tether lipid biosynthetic process [GO:0008611]; glycerophospholipid metabolic process [GO:0006650]\tperoxisomal membrane [GO:0005778]\tglycerone-phosphate O-acyltransferase activity [GO:0016287]\tperoxisomal membrane [GO:0005778]; glycerone-phosphate O-acyltransferase activity [GO:0016287]; ether lipid biosynthetic process [GO:0008611]; glycerophospholipid metabolic process [GO:0006650]\tGO:0005778; GO:0006650; GO:0008611; GO:0016287\tCGI_10011812\tscaffold43692\t75103\t75105\t100\tmRNA\t63821\t78304\r\n", "A6NHR9\tCGI_10013732\t0.0\tSMHD1_HUMAN\treviewed\tStructural maintenance of chromosomes flexible hinge domain-containing protein 1 (SMC hinge domain-containing protein 1)\tSMCHD1 KIAA0650\tHomo sapiens (Human)\t2005\tchromosome organization [GO:0051276]; inactivation of X chromosome by DNA methylation [GO:0060821]; nose development [GO:0043584]\tBarr body [GO:0001740]\tATPase activity [GO:0016887]; ATP binding [GO:0005524]\tBarr body [GO:0001740]; ATP binding [GO:0005524]; ATPase activity [GO:0016887]; chromosome organization [GO:0051276]; inactivation of X chromosome by DNA methylation [GO:0060821]; nose development [GO:0043584]\tGO:0001740; GO:0005524; GO:0016887; GO:0043584; GO:0051276; GO:0060821\tCGI_10013732\tscaffold43878\t343347\t343349\t100\tmRNA\t331528\t360286\r\n", "A6QQ68\tCGI_10009117\t2e-24\tF228B_BOVIN\treviewed\tProtein FAM228B\tFAM228B\tBos taurus (Bovine)\t264\t\t\t\t\t\tCGI_10009117\tscaffold43184\t175852\t175854\t-100\tmRNA\t172157\t193737\r\n", "A6QQP7\tCGI_10024760\t0.0\tDYSF_BOVIN\treviewed\tDysferlin (Dystrophy-associated fer-1-like protein) (Fer-1-like protein 1)\tDYSF FER1L1\tBos taurus (Bovine)\t2107\tvesicle fusion [GO:0006906]\tcytoplasmic vesicle membrane [GO:0030659]; integral component of membrane [GO:0016021]; plasma membrane [GO:0005886]; sarcolemma [GO:0042383]\tcalcium ion binding [GO:0005509]; phospholipid binding [GO:0005543]\tcytoplasmic vesicle membrane [GO:0030659]; integral component of membrane [GO:0016021]; plasma membrane [GO:0005886]; sarcolemma [GO:0042383]; calcium ion binding [GO:0005509]; phospholipid binding [GO:0005543]; vesicle fusion [GO:0006906]\tGO:0005509; GO:0005543; GO:0005886; GO:0006906; GO:0016021; GO:0030659; GO:0042383\tCGI_10024760\tscaffold70\t199391\t199393\t100\tmRNA\t171032\t212395\r\n" ] } ], "source": [ "!head DML-Gene-annot.tab" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 2314k 100 2314k 0 0 18.6M 0 --:--:-- --:--:-- --:--:-- 19.4M\n" ] } ], "source": [ "!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#reducing number of columns and sorting to get spid evlaue - \n", "!paste gigas-blast-sort-onlyCorrectUniprot.tab > _blast-sort.tab" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\tCGI_10000672\t6e-22\r\n", "A0AVF1\tCGI_10012146\t1e-49\r\n", "A0AVK6\tCGI_10023548\t2e-63\r\n", "A0AVT1\tCGI_10003125\t2e-25\r\n", "A0AVT1\tCGI_10012444\t2e-113\r\n", "A0FGR8\tCGI_10025868\t9e-24\r\n", "A0JC51\tCGI_10016118\t8e-34\r\n", "A0JM12\tCGI_10001556\t6e-17\r\n", "A0JM12\tCGI_10001637\t4e-09\r\n", "A0JM12\tCGI_10001773\t3e-11\r\n" ] } ], "source": [ "!head _blast-sort.tab" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#joining blast with uniprot annoation file and reducing to three columns UniprotID, Query, All GO terms\n", "!join -t $'\\t' \\\n", "_blast-sort.tab \\\n", "uniprot-SP-GO.sorted \\\n", "| cut -f1,2,14 \\\n", "> _blast-annot.tab" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\tCGI_10000672\tGO:0001525; GO:0005615; GO:0010812; GO:0016525; GO:0030154; GO:0030971; GO:0043537; GO:0046872; GO:0048014; GO:0050928\r\n", "A0AVF1\tCGI_10012146\tGO:0005813; GO:0005929; GO:0007224; GO:0007286; GO:0030992; GO:0035735; GO:0036064; GO:0042073; GO:0060271; GO:0097542\r\n" ] } ], "source": [ "!head -2 _blast-annot.tab\n" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash \n", "\n", "# This script was originally written to address a specific problem that Rhonda was having\n", "\n", "\n", "\n", "# input_file is the initial, \"problem\" file\n", "# file is an intermediate file that most of the program works upon\n", "# output_file is the final file produced by the script\n", "input_file=\"_blast-annot.tab\"\n", "file=\"_intermediate.file\"\n", "output_file=\"_blast-GO-unfolded.tab\"\n", "\n", "# sed command substitutes the \"; \" sequence to a tab and writes the new format to a new file.\n", "# This character sequence is how the GO terms are delimited in their field.\n", "sed $'s/; /\\t/g' \"$input_file\" > \"$file\"\n", "\n", "# Identify first field containing a GO term.\n", "# Search file with grep for \"GO:\" and pipe to awk.\n", "# Awk sets tab as field delimiter (-F'\\t'), runs a for loop that looks for \"GO:\" (~/GO:/), and then prints the field number).\n", "# Awk results are piped to sort, which sorts unique by number (-ug).\n", "# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; \"-n1\").\n", "begin_goterms=$(grep \"GO:\" \"$file\" | awk -F'\\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)\n", "\n", "# While loop to process each line of the input file.\n", "while read -r line\n", "\tdo\n", "\t\n", "\t# Send contents of the current line to awk.\n", "\t# Set the field separator as a tab (-F'\\t') and print the number of fields in that line.\n", "\t# Save the results of the echo/awk pipe (i.e. number of fields) to the variable \"max_field\".\n", "\tmax_field=$(echo \"$line\" | awk -F'\\t' '{print NF}')\n", "\n", "\t# Send contents of current line to cut.\n", "\t# Cut fields (i.e. retain those fields) 1-12.\n", "\t# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable \"fixed_fields\"\n", "\tfixed_fields=$(echo \"$line\" | cut -f1-2)\n", "\n", "\t# Since not all the lines contain the same number of fields (e.g. may not have GO terms),\n", "\t# evaluate the number of fields in each line to determine how to handle current line.\n", "\n", "\t# If the value in max_field is less than the field number where the GO terms begin,\n", "\t# then just print the current line (%s) followed by a newline (\\n).\n", "\tif (( \"$max_field\" < \"$begin_goterms\" ))\n", "\t\tthen printf \"%s\\n\" \"$line\"\n", "\t\t\telse\n", "\n", "\t\t\t# Send contents of current line (which contains GO terms) to cut.\n", "\t\t\t# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.\n", "\t\t\t# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable \"goterms\".\n", "\t\t\tgoterms=$(echo \"$line\" | cut -f\"$begin_goterms\"-\"$max_field\")\n", "\t\t\t\n", "\t\t\t# Assign values in the variable \"goterms\" to a new indexed array (called \"array\"), \n", "\t\t\t# with tab delimiter (IFS=$'\\t')\n", "\t\t\tIFS=$'\\t' read -r -a array <<<\"$goterms\"\n", "\t\t\t\n", "\t\t\t# Iterate through each element of the array.\n", "\t\t\t# Print the first 12 fields (i.e. the fields stored in \"fixed_fields\") followed by a tab (%s\\t).\n", "\t\t\t# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\\n).\n", "\t\t\tfor element in \"${!array[@]}\"\t\n", "\t\t\t\tdo printf \"%s\\t%s\\n\" \"$fixed_fields\" \"${array[$element]}\"\n", "\t\t\tdone\n", "\tfi\n", "\n", "# Send the input file into the while loop and send the output to a file named \"rhonda_fixed.txt\".\n", "done < \"$file\" > \"$output_file\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A8J8\tCGI_10000672\tGO:0001525\r\n", "A0A8J8\tCGI_10000672\tGO:0005615\r\n", "A0A8J8\tCGI_10000672\tGO:0010812\r\n", "A0A8J8\tCGI_10000672\tGO:0016525\r\n", "A0A8J8\tCGI_10000672\tGO:0030154\r\n", "A0A8J8\tCGI_10000672\tGO:0030971\r\n", "A0A8J8\tCGI_10000672\tGO:0043537\r\n", "A0A8J8\tCGI_10000672\tGO:0046872\r\n", "A0A8J8\tCGI_10000672\tGO:0048014\r\n", "A0A8J8\tCGI_10000672\tGO:0050928\r\n" ] } ], "source": [ "!head _blast-GO-unfolded.tab" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sort: unrecognized option `--version-sort'\r\n", "Try `sort --help' for more information.\r\n", "awk: write error on /dev/stdout\r\n", " input record number 691, file _blast-GO-unfolded.tab\r\n", " source line number 1\r\n" ] } ], "source": [ "!awk '{print $3\"\\t\"$2}' _blast-GO-unfolded.tab | sort --version-sort > _blast-GO-unfolded.sorted" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#extra space was removed tw\n", "!head _blast-GO-unfolded.sorted" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: sort [OPTION]... [FILE]...\r\n", "Write sorted concatenation of all FILE(s) to standard output.\r\n", "\r\n", "Mandatory arguments to long options are mandatory for short options too.\r\n", "Ordering options:\r\n", "\r\n", " -b, --ignore-leading-blanks ignore leading blanks\r\n", " -d, --dictionary-order consider only blanks and alphanumeric characters\r\n", " -f, --ignore-case fold lower case to upper case characters\r\n", " -g, --general-numeric-sort compare according to general numerical value\r\n", " -i, --ignore-nonprinting consider only printable characters\r\n", " -M, --month-sort compare (unknown) < `JAN' < ... < `DEC'\r\n", " -n, --numeric-sort compare according to string numerical value\r\n", " -r, --reverse reverse the result of comparisons\r\n", "\r\n", "Other options:\r\n", "\r\n", " -c, --check check whether input is sorted; do not sort\r\n", " -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\r\n", " -m, --merge merge already sorted files; do not sort\r\n", " -o, --output=FILE write result to FILE instead of standard output\r\n", " -s, --stable stabilize sort by disabling last-resort comparison\r\n", " -S, --buffer-size=SIZE use SIZE for main memory buffer\r\n", " -t, --field-separator=SEP use SEP instead of non-blank to blank transition\r\n", " -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;\r\n", " multiple options specify multiple directories\r\n", " -u, --unique with -c, check for strict ordering;\r\n", " without -c, output only the first of an equal run\r\n", " -z, --zero-terminated end lines with 0 byte, not newline\r\n", " --help display this help and exit\r\n", " --version output version information and exit\r\n", "\r\n", "POS is F[.C][OPTS], where F is the field number and C the character position\r\n", "in the field. OPTS is one or more single-letter ordering options, which\r\n", "override global ordering options for that key. If no key is given, use the\r\n", "entire line as the key.\r\n", "\r\n", "SIZE may be followed by the following multiplicative suffixes:\r\n", "% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.\r\n", "\r\n", "With no FILE, or when FILE is -, read standard input.\r\n", "\r\n", "*** WARNING ***\r\n", "The locale specified by the environment affects sort order.\r\n", "Set LC_ALL=C to get the traditional sort order that uses\r\n", "native byte values.\r\n", "\r\n", "Report bugs to .\r\n" ] } ], "source": [ "!sort --help" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }