{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Use bedtools to see where DMLs and MACAU loci are located.\n", "\n", "DMLs between the Olympia oyster populations, Hood Canal and South Sound, were identified using MethylKit. File is: [analyses/dml25.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/dml25.bed) \n", "\n", "MACAU was used to identify loci at which methylation is associated with a phenotype, in our case shell length, while controlling for relatedness. \n", "- All loci with 10x coverage across 75% of samples: [analyses/macau-all-loci.10x75.bed](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/analyses/macau/macau-all-loci.10x75.bed)\n", "- Loci, any samples with 10x coverage:[analyses/macau.sign.perc.meth.10x75.bed](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/analyses/macau/macau.sign.perc.meth.10x75.bed)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/Users/laura/Documents/roberts-lab/paper-oly-mbdbs-gen/code'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make directory for BED output" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: ../analyses/BEDtools/: File exists\r\n" ] } ], "source": [ "mkdir ../analyses/BEDtools/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set file paths for feature files \n", "\n", "[Olurida_v081-20190709.gene.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.gene.gff) - genes \n", "[Olurida_v081-20190709.CDS.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.CDS.gff) - Coding regions of genes \n", "[Olurida_v081-20190709.exon.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.exon.gff) - Exons \n", "[Olurida_v081-20190709.mRNA.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.mRNA.gff) - mRNA \n", "[Olurida_v081_TE-Cg.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081_TE-Cg.gff) - Transposable elements \n", "[20190709-Olurida_v081.stringtie.gtf](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/20190709-Olurida_v081.stringtie.gtf) - alternative splice variants \n", "\n", "Note: may also have an intron track, Steven was working on that. Could also try to get new 3' and 5' UTR tracks. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "gene = \"../genome-features/Olurida_v081-20190709.gene.gff\"\n", "CDS = \"../genome-features/Olurida_v081-20190709.CDS.gff\"\n", "exon = \"../genome-features/Olurida_v081-20190709.exon.gff\"\n", "mRNA = \"../genome-features/Olurida_v081-20190709.mRNA.gff\"\n", "TE = \"../genome-features/Olurida_v081_TE-Cg.gff\"\n", "ASV = \"../genome-features/20190709-Olurida_v081.stringtie.gtf\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Background files used for MACAU and DMLs " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "AllLociMACAU = \"../analyses/macau/macau-all-loci.bed\"\n", "AllLociDMLs = \"../analyses/mydiff-all.bed\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig0\t39226\t39226\n", "Contig0\t39234\t39234\n", "Contig0\t64179\t64179\n", "Contig0\t71523\t71523\n", "Contig0\t71533\t71533\n", "Contig0\t71542\t71542\n", "Contig0\t71546\t71546\n", "Contig0\t71558\t71558\n", "Contig0\t71563\t71563\n", "Contig0\t71617\t71617\n", " 33284 ../analyses/macau/macau-all-loci.bed\n" ] } ], "source": [ "! head {AllLociMACAU}\n", "! wc -l {AllLociMACAU}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig0\t39225\t39227\t-6\n", "Contig0\t39233\t39235\t6\n", "Contig0\t64178\t64180\t0\n", "Contig0\t71522\t71524\t-2\n", "Contig0\t71532\t71534\t0\n", "Contig0\t71541\t71543\t-1\n", "Contig0\t71545\t71547\t-2\n", "Contig0\t71557\t71559\t0\n", "Contig0\t71562\t71564\t-4\n", "Contig0\t71616\t71618\t2\n", " 33738 ../analyses/mydiff-all.bed\n" ] } ], "source": [ "! head {AllLociDMLs}\n", "! wc -l {AllLociDMLs}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preview DML and MACAU loci bed files " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig33119\t17449\t17449\n", "Contig55340\t2651\t2651\n", "Contig29959\t3332\t3332\n", "Contig1438\t20471\t20471\n", " 4 ../analyses/macau/macau.sign.perc.meth.bed\n" ] } ], "source": [ "macau = \"../analyses/macau/macau.sign.perc.meth.bed\"\n", "!head {macau}\n", "!wc -l {macau}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig100994\t3427\t3429\t-31\n", "Contig101661\t3024\t3026\t26\n", "Contig102755\t406\t408\t28\n", "Contig102998\t2220\t2222\t26\n", "Contig103186\t2986\t2988\t33\n", "Contig103405\t169\t171\t27\n", "Contig103503\t7053\t7055\t-36\n", "Contig104531\t8145\t8147\t-37\n", "Contig105226\t1696\t1698\t-29\n", "Contig109515\t3377\t3379\t54\n", " 359 ../analyses/dml25.bed\n" ] } ], "source": [ "DML = \"../analyses/dml25.bed\"\n", "!head {DML}\n", "!wc -l {DML}" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig0\t39226\t39226\r\n", "Contig0\t39234\t39234\r\n", "Contig0\t64179\t64179\r\n", "Contig0\t71523\t71523\r\n", "Contig0\t71533\t71533\r\n", "Contig0\t71542\t71542\r\n", "Contig0\t71546\t71546\r\n", "Contig0\t71558\t71558\r\n", "Contig0\t71563\t71563\r\n", "Contig0\t71617\t71617\r\n" ] } ], "source": [ "! head {AllLociMACAU}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig0\t39225\t39227\t-6\r\n", "Contig0\t39233\t39235\t6\r\n", "Contig0\t64178\t64180\t0\r\n", "Contig0\t71522\t71524\t-2\r\n", "Contig0\t71532\t71534\t0\r\n", "Contig0\t71541\t71543\t-1\r\n", "Contig0\t71545\t71547\t-2\r\n", "Contig0\t71557\t71559\t0\r\n", "Contig0\t71562\t71564\t-4\r\n", "Contig0\t71616\t71618\t2\r\n" ] } ], "source": [ "! head {AllLociDMLs}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Tool: bedtools intersect (aka intersectBed)\r\n", "Version: v2.29.0\r\n", "Summary: Report overlaps between two feature files.\r\n", "\r\n", "Usage: bedtools intersect [OPTIONS] -a -b \r\n", "\r\n", "\tNote: -b may be followed with multiple databases and/or \r\n", "\twildcard (*) character(s). \r\n", "Options: \r\n", "\t-wa\tWrite the original entry in A for each overlap.\r\n", "\r\n", "\t-wb\tWrite the original entry in B for each overlap.\r\n", "\t\t- Useful for knowing _what_ A overlaps. Restricted by -f and -r.\r\n", "\r\n", "\t-loj\tPerform a \"left outer join\". That is, for each feature in A\r\n", "\t\treport each overlap with B. If no overlaps are found, \r\n", "\t\treport a NULL feature for B.\r\n", "\r\n", "\t-wo\tWrite the original A and B entries plus the number of base\r\n", "\t\tpairs of overlap between the two features.\r\n", "\t\t- Overlaps restricted by -f and -r.\r\n", "\t\t Only A features with overlap are reported.\r\n", "\r\n", "\t-wao\tWrite the original A and B entries plus the number of base\r\n", "\t\tpairs of overlap between the two features.\r\n", "\t\t- Overlapping features restricted by -f and -r.\r\n", "\t\t However, A features w/o overlap are also reported\r\n", "\t\t with a NULL B feature and overlap = 0.\r\n", "\r\n", "\t-u\tWrite the original A entry _once_ if _any_ overlaps found in B.\r\n", "\t\t- In other words, just report the fact >=1 hit was found.\r\n", "\t\t- Overlaps restricted by -f and -r.\r\n", "\r\n", "\t-c\tFor each entry in A, report the number of overlaps with B.\r\n", "\t\t- Reports 0 for A entries that have no overlap with B.\r\n", "\t\t- Overlaps restricted by -f, -F, -r, and -s.\r\n", "\r\n", "\t-C\tFor each entry in A, separately report the number of\r\n", "\t\t- overlaps with each B file on a distinct line.\r\n", "\t\t- Reports 0 for A entries that have no overlap with B.\r\n", "\t\t- Overlaps restricted by -f, -F, -r, and -s.\r\n", "\r\n", "\t-v\tOnly report those entries in A that have _no overlaps_ with B.\r\n", "\t\t- Similar to \"grep -v\" (an homage).\r\n", "\r\n", "\t-ubam\tWrite uncompressed BAM output. Default writes compressed BAM.\r\n", "\r\n", "\t-s\tRequire same strandedness. That is, only report hits in B\r\n", "\t\tthat overlap A on the _same_ strand.\r\n", "\t\t- By default, overlaps are reported without respect to strand.\r\n", "\r\n", "\t-S\tRequire different strandedness. That is, only report hits in B\r\n", "\t\tthat overlap A on the _opposite_ strand.\r\n", "\t\t- By default, overlaps are reported without respect to strand.\r\n", "\r\n", "\t-f\tMinimum overlap required as a fraction of A.\r\n", "\t\t- Default is 1E-9 (i.e., 1bp).\r\n", "\t\t- FLOAT (e.g. 0.50)\r\n", "\r\n", "\t-F\tMinimum overlap required as a fraction of B.\r\n", "\t\t- Default is 1E-9 (i.e., 1bp).\r\n", "\t\t- FLOAT (e.g. 0.50)\r\n", "\r\n", "\t-r\tRequire that the fraction overlap be reciprocal for A AND B.\r\n", "\t\t- In other words, if -f is 0.90 and -r is used, this requires\r\n", "\t\t that B overlap 90% of A and A _also_ overlaps 90% of B.\r\n", "\r\n", "\t-e\tRequire that the minimum fraction be satisfied for A OR B.\r\n", "\t\t- In other words, if -e is used with -f 0.90 and -F 0.10 this requires\r\n", "\t\t that either 90% of A is covered OR 10% of B is covered.\r\n", "\t\t Without -e, both fractions would have to be satisfied.\r\n", "\r\n", "\t-split\tTreat \"split\" BAM or BED12 entries as distinct BED intervals.\r\n", "\r\n", "\t-g\tProvide a genome file to enforce consistent chromosome sort order\r\n", "\t\tacross input files. Only applies when used with -sorted option.\r\n", "\r\n", "\t-nonamecheck\tFor sorted data, don't throw an error if the file has different naming conventions\r\n", "\t\t\tfor the same chromosome. ex. \"chr1\" vs \"chr01\".\r\n", "\r\n", "\t-sorted\tUse the \"chromsweep\" algorithm for sorted (-k1,1 -k2,2n) input.\r\n", "\r\n", "\t-names\tWhen using multiple databases, provide an alias for each that\r\n", "\t\twill appear instead of a fileId when also printing the DB record.\r\n", "\r\n", "\t-filenames\tWhen using multiple databases, show each complete filename\r\n", "\t\t\tinstead of a fileId when also printing the DB record.\r\n", "\r\n", "\t-sortout\tWhen using multiple databases, sort the output DB hits\r\n", "\t\t\tfor each record.\r\n", "\r\n", "\t-bed\tIf using BAM input, write output as BED.\r\n", "\r\n", "\t-header\tPrint the header from the A file prior to results.\r\n", "\r\n", "\t-nobuf\tDisable buffered output. Using this option will cause each line\r\n", "\t\tof output to be printed as it is generated, rather than saved\r\n", "\t\tin a buffer. This will make printing large output files \r\n", "\t\tnoticeably slower, but can be useful in conjunction with\r\n", "\t\tother software tools and scripts that need to process one\r\n", "\t\tline of bedtools output at a time.\r\n", "\r\n", "\t-iobuf\tSpecify amount of memory to use for input buffer.\r\n", "\t\tTakes an integer argument. Optional suffixes K/M/G supported.\r\n", "\t\tNote: currently has no effect with compressed files.\r\n", "\r\n", "Notes: \r\n", "\t(1) When a BAM file is used for the A file, the alignment is retained if overlaps exist,\r\n", "\tand excluded if an overlap cannot be found. If multiple overlaps exist, they are not\r\n", "\treported, as we are only testing for one or more overlaps.\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "***** ERROR: No input file given. Exiting. *****\r\n" ] } ], "source": [ "! bedtools intersect" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bedtool options to use: \n", "`-u` - Write the original A entry _once_ if _any_ overlaps found in B, _i.e._ just report the fact >=1 hit was found \n", "`-a` - File A \n", "`-b` - File B " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. DMLs " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loci differentially methylated between SS and HC populations:\n", " 359\n", "Loci that overlap with genes:\n", " 214\n", "Loci that overlap with exons:\n", " 189\n", "Loci that overlap with coding sequences:\n", " 184\n", "Loci that overlap with mRNA:\n", " 214\n", "Loci that overlap with transposable elements:\n", " 9\n", "Loci that overlap with alternative splice variants:\n", " 329\n", "Loci that do not overlap with known features:\n", " 28\n" ] } ], "source": [ "! echo \"Loci differentially methylated between SS and HC populations:\"\n", "! cat {DML} | wc -l \n", "\n", "!echo \"Loci that overlap with genes:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {gene} | wc -l\n", "\n", "!echo \"Loci that overlap with exons:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {exon} | wc -l\n", "\n", "!echo \"Loci that overlap with coding sequences:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {CDS} | wc -l\n", "\n", "!echo \"Loci that overlap with mRNA:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {mRNA} | wc -l\n", "\n", "!echo \"Loci that overlap with transposable elements:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {TE} | wc -l\n", "\n", "!echo \"Loci that overlap with alternative splice variants:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {DML} \\\n", "-b {ASV} | wc -l\n", "\n", "!echo \"Loci that do not overlap with known features:\"\n", "! bedtools intersect \\\n", "-v \\\n", "-a {DML} \\\n", "-b {gene} {exon} {CDS} {mRNA} {TE} {ASV} | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save DML lists to file " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "! bedtools intersect -wb -a {DML} -b {gene} > ../analyses/BEDtools/DML-gene.bed\n", "! bedtools intersect -wb -a {DML} -b {exon} > ../analyses/BEDtools/DML-exon.bed\n", "! bedtools intersect -wb -a {DML} -b {CDS} > ../analyses/BEDtools/DML-CDS.bed\n", "! bedtools intersect -wb -a {DML} -b {mRNA} > ../analyses/BEDtools/DML-mRNA.bed\n", "! bedtools intersect -wb -a {DML} -b {TE} > ../analyses/BEDtools/DML-TE.bed\n", "! bedtools intersect -wb -a {DML} -b {ASV} > ../analyses/BEDtools/DML-ASV.bed\n", "! bedtools intersect -v -a {DML} -b {gene} {exon} {CDS} {mRNA} {TE} {ASV} > ../analyses/BEDtools/DML-intragenic.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save background loci feature lists to files " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "genes\n", " 18097\n", "exon\n", " 15728\n", "CDS\n", " 15155\n", "mRNA\n", " 18097\n", "TE\n", " 1579\n", "ASV\n", " 28361\n", "intragenic\n", " 4674\n" ] } ], "source": [ "! echo \"genes\" \n", "! bedtools intersect -u -a {AllLociDMLs} -b {gene} | wc -l\n", "! echo \"exon\" \n", "! bedtools intersect -u -a {AllLociDMLs} -b {exon} | wc -l\n", "! echo \"CDS\" \n", "! bedtools intersect -u -a {AllLociDMLs} -b {CDS} | wc -l\n", "! echo \"mRNA\"\n", "! bedtools intersect -u -a {AllLociDMLs} -b {mRNA} | wc -l\n", "! echo \"TE\" \n", "! bedtools intersect -u -a {AllLociDMLs} -b {TE} | wc -l\n", "! echo \"ASV\" \n", "! bedtools intersect -u -a {AllLociDMLs} -b {ASV} | wc -l\n", "! echo \"intragenic\" \n", "! bedtools intersect -v -a {AllLociDMLs} -b {gene} {exon} {CDS} {mRNA} {TE} {ASV} | wc -l" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "! bedtools intersect -wb -a {AllLociDMLs} -b {gene} > ../analyses/BEDtools/AllLociDMLs-gene.bed\n", "! bedtools intersect -wb -a {AllLociDMLs} -b {exon} > ../analyses/BEDtools/AllLociDMLs-exon.bed\n", "! bedtools intersect -wb -a {AllLociDMLs} -b {CDS} > ../analyses/BEDtools/AllLociDMLs-CDS.bed\n", "! bedtools intersect -wb -a {AllLociDMLs} -b {mRNA} > ../analyses/BEDtools/AllLociDMLs-mRNA.bed\n", "! bedtools intersect -wb -a {AllLociDMLs} -b {TE} > ../analyses/BEDtools/AllLociDMLs-TE.bed\n", "! bedtools intersect -wb -a {AllLociDMLs} -b {ASV} > ../analyses/BEDtools/AllLociDMLs-ASV.bed\n", "! bedtools intersect -v -a {AllLociDMLs} -b {gene} {exon} {CDS} {mRNA} {TE} {ASV} > ../analyses/BEDtools/AllLociDMLs-intragenic.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. MACAU Loci " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total methylated loci (10x across 75% of samples):\n", " 108490\n", "Loci associated with shell length (MACAU):\n", " 4\n", "Loci that overlap with genes:\n", " 1\n", "Loci that overlap with exons:\n", " 1\n", "Loci that overlap with coding sequences:\n", " 1\n", "Loci that overlap with mRNA:\n", " 1\n", "Loci that overlap with transposable elements:\n", " 0\n", "Loci that overlap with alternative splice variants:\n", " 3\n", "Loci that do not overlap with known features:\n", " 1\n" ] } ], "source": [ "! echo \"Total methylated loci (10x across 75% of samples):\" \n", "! cat ../analyses/macau/macau-all-loci.10x75.bed | wc -l\n", "\n", "! echo \"Loci associated with shell length (MACAU):\"\n", "! cat {macau} | wc -l \n", "\n", "!echo \"Loci that overlap with genes:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {gene} | wc -l\n", "\n", "!echo \"Loci that overlap with exons:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {exon} | wc -l\n", "\n", "!echo \"Loci that overlap with coding sequences:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {CDS} | wc -l\n", "\n", "!echo \"Loci that overlap with mRNA:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {mRNA} | wc -l\n", "\n", "!echo \"Loci that overlap with transposable elements:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {TE} | wc -l\n", "\n", "!echo \"Loci that overlap with alternative splice variants:\"\n", "! bedtools intersect \\\n", "-u \\\n", "-a {macau} \\\n", "-b {ASV} | wc -l\n", "\n", "!echo \"Loci that do not overlap with known features:\"\n", "! bedtools intersect \\\n", "-v \\\n", "-a {macau} \\\n", "-b {gene} {exon} {CDS} {mRNA} {TE} {ASV} | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save macau lists to file " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "! bedtools intersect -wb -a {AllLociMACAU} -b {gene} > ../analyses/BEDtools/AllLociMACAU-gene.bed\n", "! bedtools intersect -wb -a {macau} -b {gene} > ../analyses/BEDtools/macau-gene.bed\n", "! bedtools intersect -wb -a {macau} -b {exon} > ../analyses/BEDtools/macau-exon.bed\n", "! bedtools intersect -wb -a {macau} -b {CDS} > ../analyses/BEDtools/macau-CDS.bed\n", "! bedtools intersect -wb -a {macau} -b {mRNA} > ../analyses/BEDtools/macau-mRNA.bed\n", "! bedtools intersect -wb -a {macau} -b {TE} > ../analyses/BEDtools/macau-TE.bed\n", "! bedtools intersect -wb -a {macau} -b {ASV} > ../analyses/BEDtools/macau-ASV.bed\n", "! bedtools intersect -v -a {macau} -b {gene} {exon} {CDS} {mRNA} {TE} {ASV} > ../analyses/BEDtools/macau-intragenic.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save MACAU background lists to file, too" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "! bedtools intersect -wb -a {AllLociMACAU} -b {gene} > ../analyses/BEDtools/AllLociMACAU-gene.bed\n", "! bedtools intersect -wb -a {AllLociMACAU} -b {exon} > ../analyses/BEDtools/AllLociMACAU-exon.bed\n", "! bedtools intersect -wb -a {AllLociMACAU} -b {CDS} > ../analyses/BEDtools/AllLociMACAU-CDS.bed\n", "! bedtools intersect -wb -a {AllLociMACAU} -b {mRNA} > ../analyses/BEDtools/AllLociMACAU-mRNA.bed\n", "! bedtools intersect -wb -a {AllLociMACAU} -b {TE} > ../analyses/BEDtools/AllLociMACAU-TE.bed\n", "! bedtools intersect -wb -a {AllLociMACAU} -b {ASV} > ../analyses/BEDtools/AllLociMACAU-ASV.bed\n", "! bedtools intersect -v -a {AllLociMACAU} -b {gene} {exon} {CDS} {mRNA} {TE} {ASV} > ../analyses/BEDtools/AllLociMACAU-intragenic.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare blastx annotation files to merge with DML and MACAU results \n", "\n", "The actual merging will occur in a later RStudio notebook" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 1037k 100 1037k 0 0 1383k 0 --:--:-- --:--:-- --:--:-- 1381k\n" ] } ], "source": [ "! curl https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/analyses/Olgene_blastx_uniprot.05.tab \\\n", " > ../data/Olgene_blastx_uniprot.05.tab" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11803 ../data/Olgene_blastx_uniprot.05-20191122.tab\r\n" ] } ], "source": [ "#convert pipes to tab\n", "!tr '|' '\\t' < ../data/Olgene_blastx_uniprot.05.tab \\\n", "> ../data/Olgene_blastx_uniprot.05-20191122.tab\n", "! wc -l ../data/Olgene_blastx_uniprot.05-20191122.tab" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11803 ../data/Olgene_blastx_uniprot.05-20191122-sort.tab\r\n" ] } ], "source": [ "#Reduce the number of columns using awk. Sort, and save as a new file.\n", "!awk -v OFS='\\t' '{print $1, $3, $13}' \\\n", "< ../data/Olgene_blastx_uniprot.05-20191122.tab | sort \\\n", "> ../data/Olgene_blastx_uniprot.05-20191122-sort.tab\n", "! wc -l ../data/Olgene_blastx_uniprot.05-20191122-sort.tab" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contig100018:1232-2375\tP31695\t2.23e-06\r\n", "Contig100073:8284-10076\tH2A0L8\t6.98e-24\r\n", "Contig100101:2158-2821\tO35796\t3.67e-28\r\n", "Contig100107:1089-2009\tQ2KMM2\t8.78e-15\r\n", "Contig100114:437-2094\tQ9V4M2\t1.41e-09\r\n", "Contig100163:2402-6678\tP23708\t2.55e-18\r\n", "Contig100166:542-2054\tG5EBR3\t2.08e-11\r\n", "Contig100170:472-1350\tQ5F3T9\t9.14e-42\r\n", "Contig100188:460-2761\tQ8TD26\t1.35e-18\r\n", "Contig100206:5719-12338\tQ2HJH1\t1.51e-14\r\n" ] } ], "source": [ "! head ../data/Olgene_blastx_uniprot.05-20191122-sort.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boneyard" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/sh: temporary/olurida-blast-sort2.tab: No such file or directory\r\n", "cut: temporary/olurida-blast-sort.tab: No such file or directory\r\n" ] } ], "source": [ "#Uniprot codes have \".1\" appended, so those need to be removed.\n", "#Isolate the contig column name with cut\n", "#Flip order of characters with rev\n", "#Delete last three characters with cut -c\n", "#Flip order of characters with rev\n", "#Add information as a new column to annotated table with paste\n", "\n", "!cut -f1 temporary/olurida-blast-sort.tab \\\n", "| rev \\\n", "| cut -c 3- \\\n", "| rev \\\n", "> temporary/olurida-blast-sort2.tab" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 340M 100 340M 0 0 2083k 0 0:02:47 0:02:47 --:--:-- 2187k 0 2154k 0 0:02:41 0:00:16 0:02:25 2175k0 1988k 0 0:02:55 0:00:36 0:02:19 1911k0 0:02:59 0:00:57 0:02:02 2190k79M 0 0 2067k 0 0:02:48 0:02:18 0:00:30 2206k\n" ] } ], "source": [ "!curl http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/uniprot-SP-GO.sorted \\\n", " > ../data/uniprot-SP-GO-sorted.tab" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A0A023GPI8\tLECA_CANBL\treviewed\tLectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]\t\tCanavalia boliviana\t237\t\t\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tmannose binding [GO:0005537]; metal ion binding [GO:0046872]\tGO:0005537; GO:0046872\r\n", "A0A023GPJ0\tCDII_ENTCC\treviewed\tImmunity protein CdiI\tcdiI ECL_04450.1\tEnterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)\t145\t\t\t\t\t\r\n", "A0A023PXA5\tYA19A_YEAST\treviewed\tPutative uncharacterized protein YAL019W-A\tYAL019W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t189\t\t\t\t\t\r\n", "A0A023PXB0\tYA019_YEAST\treviewed\tPutative uncharacterized protein YAR019W-A\tYAR019W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t110\t\t\t\t\t\r\n", "A0A023PXB5\tIRC2_YEAST\treviewed\tPutative uncharacterized membrane protein IRC2 (Increased recombination centers protein 2)\tIRC2 YDR112W\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t102\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXB9\tYD99W_YEAST\treviewed\tPutative uncharacterized membrane protein YDR199W\tYDR199W\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t121\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXC2\tYE53A_YEAST\treviewed\tPutative uncharacterized membrane protein YEL053W-A\tYEL053W-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t115\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXC7\tYE068_YEAST\treviewed\tPutative uncharacterized membrane protein YER068C-A\tYER068C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t143\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n", "A0A023PXD3\tYE88A_YEAST\treviewed\tPutative uncharacterized protein YER088C-A\tYER088C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t107\t\t\t\t\t\r\n", "A0A023PXD5\tYE147_YEAST\treviewed\tPutative uncharacterized membrane protein YER147C-A\tYER147C-A\tSaccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)\t136\t\tintegral component of membrane [GO:0016021]\t\tintegral component of membrane [GO:0016021]\tGO:0016021\r\n" ] } ], "source": [ "! head ../data/uniprot-SP-GO-sorted.tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join the first column in the first file with the first column in the second file\n", "\n", "The files are tab delimited, and the output should also be tab delimited (-t $'\\t')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "! join -1 2 -2 1 -t $'\\t' \\\n", "../data/Olgene_blastx_uniprot.05-20191122-sort.tab \\\n", "../data/uniprot-SP-GO-sorted.tab \\\n", "> ../data/Oly_blastx_uniprot.tab" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9 ../data/Oly_blastx_uniprot.tab\r\n" ] } ], "source": [ "! wc -l ../data/Oly_blastx_uniprot.tab" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }