{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Genome Feature Tracks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will create genome feature tracks to use in downstream analyses. While pre-made genome feature tracks exist, it's beneficial to generate these tracks to understand what elements they contain. My code was modified from the code provided in the [FROGER Github repository](https://github.com/hputnam/FROGER/wiki/Genome-Sequence-Files-and-Feature-Tracks).\n", "\n", "1. Download *C. virginica* genome file\n", "2. Separate various tracks\n", "3. Visualize tracks in IGV\n", "4. Characterize track overlap with CG motifs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Set working directory" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/Users/yaamini/Documents/paper-gonad-meth/code'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/yaamini/Documents/paper-gonad-meth/genome-feature-tracks\n" ] } ], "source": [ "cd ../genome-feature-tracks/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Download files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1a. Pre-generated CG motif track" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 533M 100 533M 0 0 59.6M 0 0:00:08 0:00:08 --:--:-- 58.1M\n" ] } ], "source": [ "!curl http://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_CG-motif.bed > C_virginica-3.0_CG-motif.bed" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t28\t30\tCG_motif\r\n", "NC_035780.1\t54\t56\tCG_motif\r\n", "NC_035780.1\t75\t77\tCG_motif\r\n", "NC_035780.1\t93\t95\tCG_motif\r\n", "NC_035780.1\t103\t105\tCG_motif\r\n", "NC_035780.1\t116\t118\tCG_motif\r\n", "NC_035780.1\t134\t136\tCG_motif\r\n", "NC_035780.1\t159\t161\tCG_motif\r\n", "NC_035780.1\t209\t211\tCG_motif\r\n", "NC_035780.1\t224\t226\tCG_motif\r\n" ] } ], "source": [ "!head C_virginica-3.0_CG-motif.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1b. *C. virgincia* genome from NCBI" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 16.2M 100 16.2M 0 0 6234k 0 0:00:02 0:00:02 --:--:-- 6411k\n" ] } ], "source": [ "!curl ftp://ftp.ncbi.nlm.nih.gov/genomes/Crassostrea_virginica/GFF/ref_C_virginica-3.0_top_level.gff3.gz > ref_C_virginica-3.0_top_level.gff3.gz" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!gunzip ref_C_virginica-3.0_top_level.gff3.gz" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\r\n", "#!gff-spec-version 1.21\r\n", "#!processor NCBI annotwriter\r\n", "#!genome-build C_virginica-3.0\r\n", "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\r\n", "#!annotation-date 14 September 2017\r\n", "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\r\n", "##sequence-region NC_035780.1 1 65668440\r\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\r\n", "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=id0;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\r\n" ] } ], "source": [ "!head ref_C_virginica-3.0_top_level.gff3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Set variable paths" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bedtoolsDirectory = \"/Users/Shared/bioinformatics/bedtools2/bin/\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fullGenome = \"../data/C_virginica-3.0_genomic.fa\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "CGMotifList = \"C_virginica-3.0_CG-motif.bed\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "chromLengths = \"2018-06-15-bedtools-Chromosome-Lengths.txt\"" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cut: head: No such file or directory\r\n" ] } ], "source": [ "!cut -f1 {chromLengths} > 2019-05-28-bedtools-Chromosome-Names.txt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\r\n", "NC_035781.1\r\n", "NC_035782.1\r\n", "NC_035783.1\r\n", "NC_035784.1\r\n", "NC_035785.1\r\n", "NC_035786.1\r\n", "NC_035787.1\r\n", "NC_035788.1\r\n", "NC_035789.1\r\n", "NC_007175.2\r\n" ] } ], "source": [ "!cat 2019-05-28-bedtools-Chromosome-Names.txt" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "chromNames = \"2019-05-28-bedtools-Chromosome-Names.txt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Generate feature tracks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3a. Genes" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Isolate gene entries. Tab must be included between \"Gnomon\" and \"gene\"\n", "!grep \"Gnomon\tgene\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_gene_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Sort file for downstream use\n", "!{bedtoolsDirectory}sortBed \\\n", "-faidx {chromNames} \\\n", "-i C_virginica-3.0_Gnomon_gene_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Set variable path\n", "geneList = \"C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene0;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\r\n", "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene1;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t43111\t66897\t.\t-\t.\tID=gene2;Dbxref=GeneID:111110729;Name=LOC111110729;gbkey=Gene;gene=LOC111110729;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t85606\t95254\t.\t-\t.\tID=gene3;Dbxref=GeneID:111112434;Name=LOC111112434;gbkey=Gene;gene=LOC111112434;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t99840\t106460\t.\t+\t.\tID=gene4;Dbxref=GeneID:111120752;Name=LOC111120752;gbkey=Gene;gene=LOC111120752;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t108305\t110077\t.\t-\t.\tID=gene5;Dbxref=GeneID:111128944;Name=LOC111128944;gbkey=Gene;gene=LOC111128944;gene_biotype=protein_coding;partial=true;start_range=.,108305\r\n", "NC_035780.1\tGnomon\tgene\t151859\t157536\t.\t+\t.\tID=gene6;Dbxref=GeneID:111128953;Name=LOC111128953;gbkey=Gene;gene=LOC111128953;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t163809\t183798\t.\t-\t.\tID=gene7;Dbxref=GeneID:111105691;Name=LOC111105691;gbkey=Gene;gene=LOC111105691;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t164820\t166793\t.\t+\t.\tID=gene8;Dbxref=GeneID:111105685;Name=LOC111105685;gbkey=Gene;gene=LOC111105685;gene_biotype=protein_coding\r\n", "NC_035780.1\tGnomon\tgene\t169468\t170178\t.\t-\t.\tID=gene9;Dbxref=GeneID:111105702;Name=LOC111105702;gbkey=Gene;gene=LOC111105702;gene_biotype=lncRNA\r\n" ] } ], "source": [ "#View file\n", "!head {geneList}" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 38929 C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3\n", "genes in the C. virginica genome\n" ] } ], "source": [ "#Count the number of genes\n", "!wc -l {geneList}\n", "!echo \"genes in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%bash\n", "#GFF files can be a lot to visualize, so I'll create a BEDfile version too\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_gene_sorted_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3b. mRNA" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!grep \"Gnomon\tmRNA\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_mRNA_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mRNAList = \"C_virginica-3.0_Gnomon_mRNA_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n" ] } ], "source": [ "!head -n1 {mRNAList}" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 60201 C_virginica-3.0_Gnomon_mRNA_yrv.gff3\n", "mRNAs in the C. virginica genome\n" ] } ], "source": [ "!wc -l {mRNAList}\n", "!echo \"mRNAs in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_mRNA_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3c. Exons" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!grep \"Gnomon\texon\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_exon_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Sort the exon file for downstream use\n", "!{bedtoolsDirectory}sortBed \\\n", "-faidx {chromNames} \\\n", "-i C_virginica-3.0_Gnomon_exon_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "exonList = \"C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=id6;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=id7;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t32959\t33324\t.\t+\t.\tID=id8;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t43111\t44358\t.\t-\t.\tID=id11;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n", "NC_035780.1\tGnomon\texon\t43111\t44358\t.\t-\t.\tID=id13;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n" ] } ], "source": [ "!head {exonList}" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 731279 C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3\n", "exons in the C. virginica genome\n" ] } ], "source": [ "!wc -l {exonList}\n", "!echo \"exons in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_exon_sorted_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3d. Intergenic regions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By definition, these are regions that aren't genes. I can use `complementBed` to find all regions that aren't genes, and `subtractBed` to remove exons and create this track." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!{bedtoolsDirectory}complementBed \\\n", "-i {geneList} -sorted \\\n", "-g {chromLengths} \\\n", "| {bedtoolsDirectory}subtractBed \\\n", "-a - \\\n", "-b {exonList} \\\n", "> C_virginica-3.0_Gnomon_intergenic_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "intergenicList = \"C_virginica-3.0_Gnomon_intergenic_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t0\t13577\r\n", "NC_035780.1\t14594\t28960\r\n", "NC_035780.1\t33324\t43110\r\n", "NC_035780.1\t66897\t85605\r\n", "NC_035780.1\t95254\t99839\r\n", "NC_035780.1\t106460\t108304\r\n", "NC_035780.1\t110077\t151858\r\n", "NC_035780.1\t157536\t163808\r\n", "NC_035780.1\t183798\t190448\r\n", "NC_035780.1\t193594\t204242\r\n" ] } ], "source": [ "!head {intergenicList}" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 34557 C_virginica-3.0_Gnomon_intergenic_yrv.gff3\n", "intergenic regions in the C. virginica genome\n" ] } ], "source": [ "!wc -l {intergenicList}\n", "!echo \"intergenic regions in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$2\"\\t\"$3}' C_virginica-3.0_Gnomon_intergenic_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_intergenic_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3e. Coding sequences" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!grep \"Gnomon\tCDS\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_CDS_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Sort the exon file for downstream use\n", "!{bedtoolsDirectory}sortBed \\\n", "-faidx {chromNames} \\\n", "-i C_virginica-3.0_Gnomon_CDS_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "CDSList = \"C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\tCDS\t30535\t31557\t.\t+\t0\tID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1\r\n", "NC_035780.1\tGnomon\tCDS\t31736\t31887\t.\t+\t0\tID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1\r\n", "NC_035780.1\tGnomon\tCDS\t31977\t32565\t.\t+\t1\tID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1\r\n", "NC_035780.1\tGnomon\tCDS\t32959\t33204\t.\t+\t0\tID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1\r\n", "NC_035780.1\tGnomon\tCDS\t43262\t44358\t.\t-\t2\tID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1\r\n", "NC_035780.1\tGnomon\tCDS\t43262\t44358\t.\t-\t2\tID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1\r\n", "NC_035780.1\tGnomon\tCDS\t45913\t45997\t.\t-\t0\tID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1\r\n", "NC_035780.1\tGnomon\tCDS\t64123\t64219\t.\t-\t0\tID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1\r\n", "NC_035780.1\tGnomon\tCDS\t85616\t85777\t.\t-\t0\tID=cds3;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XP_022305632.1;Name=XP_022305632.1;gbkey=CDS;gene=LOC111112434;product=homeobox protein Hox-B7-like;protein_id=XP_022305632.1\r\n", "NC_035780.1\tGnomon\tCDS\t88423\t88589\t.\t-\t2\tID=cds3;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XP_022305632.1;Name=XP_022305632.1;gbkey=CDS;gene=LOC111112434;product=homeobox protein Hox-B7-like;protein_id=XP_022305632.1\r\n" ] } ], "source": [ "!head {CDSList}" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 645355 C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3\n", "CDSRNA in the C. virginica genome\n" ] } ], "source": [ "!wc -l {CDSList}\n", "!echo \"CDSRNA in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_CSD_sorted_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3f. Non-coding Sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can use `complementBed` to create a non-coding sequence track. This track can then be used to create an intron track." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!{bedtoolsDirectory}complementBed \\\n", "-i {exonList} \\\n", "-g 2018-06-15-bedtools-Chromosome-Lengths.txt \\\n", "> C_virginica-3.0_Gnomon_noncoding_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nonCDS = \"C_virginica-3.0_Gnomon_noncoding_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t0\t13577\r\n", "NC_035780.1\t13603\t14236\r\n", "NC_035780.1\t14290\t14556\r\n", "NC_035780.1\t14594\t28960\r\n", "NC_035780.1\t29073\t30523\r\n", "NC_035780.1\t31557\t31735\r\n", "NC_035780.1\t31887\t31976\r\n", "NC_035780.1\t32565\t32958\r\n", "NC_035780.1\t33324\t43110\r\n", "NC_035780.1\t44358\t45912\r\n" ] } ], "source": [ "!head {nonCDS}" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 336677 C_virginica-3.0_Gnomon_noncoding_yrv.gff3\n", "non-coding sequences in the C. virginica genome\n" ] } ], "source": [ "!wc -l {nonCDS}\n", "!echo \"non-coding sequences in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$2\"\\t\"$3}' C_virginica-3.0_Gnomon_noncoding_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_noncoding_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3g. Introns" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#The intersections betwen the non-coding sequences and genes are by definition introns\n", "!{bedtoolsDirectory}intersectBed \\\n", "-a {nonCDS} \\\n", "-b {geneList} -sorted \\\n", "> C_virginica-3.0_Gnomon_intron_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": true }, "outputs": [], "source": [ "intronList = \"C_virginica-3.0_Gnomon_intron_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t13603\t14236\r\n", "NC_035780.1\t14290\t14556\r\n", "NC_035780.1\t29073\t30523\r\n", "NC_035780.1\t31557\t31735\r\n", "NC_035780.1\t31887\t31976\r\n", "NC_035780.1\t32565\t32958\r\n", "NC_035780.1\t44358\t45912\r\n", "NC_035780.1\t46506\t64122\r\n", "NC_035780.1\t64334\t66868\r\n", "NC_035780.1\t85777\t88422\r\n" ] } ], "source": [ "!head {intronList}" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 316614 C_virginica-3.0_Gnomon_intron_yrv.gff3\n", "introns in the C. virginica genome\n" ] } ], "source": [ "!wc -l {intronList}\n", "!echo \"introns in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$2\"\\t\"$3}' C_virginica-3.0_Gnomon_intron_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_intron_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3h. lncRNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: there are only lncRNA and not the associated exons." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!grep \"Gnomon\tlnc_RNA\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_lncRNA_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lncRNAList = \"C_virginica-3.0_Gnomon_lncRNA_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna0;Parent=gene0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t169468\t170178\t.\t-\t.\tID=rna10;Parent=gene9;Dbxref=GeneID:111105702,Genbank:XR_002635081.1;Name=XR_002635081.1;gbkey=ncRNA;gene=LOC111105702;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111105702;transcript_id=XR_002635081.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t900326\t903430\t.\t+\t.\tID=rna105;Parent=gene57;Dbxref=GeneID:111111519,Genbank:XR_002636046.1;Name=XR_002636046.1;gbkey=ncRNA;gene=LOC111111519;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 20 samples with support for all annotated introns;product=uncharacterized LOC111111519;transcript_id=XR_002636046.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t1280831\t1282416\t.\t-\t.\tID=rna130;Parent=gene71;Dbxref=GeneID:111124195,Genbank:XR_002638148.1;Name=XR_002638148.1;gbkey=ncRNA;gene=LOC111124195;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111124195;transcript_id=XR_002638148.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t1432944\t1458091\t.\t+\t.\tID=rna135;Parent=gene76;Dbxref=GeneID:111135942,Genbank:XR_002639675.1;Name=XR_002639675.1;gbkey=ncRNA;gene=LOC111135942;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=uncharacterized LOC111135942;transcript_id=XR_002639675.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t1503802\t1513830\t.\t-\t.\tID=rna137;Parent=gene78;Dbxref=GeneID:111114441,Genbank:XR_002636574.1;Name=XR_002636574.1;gbkey=ncRNA;gene=LOC111114441;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized LOC111114441;transcript_id=XR_002636574.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t1856841\t1863697\t.\t-\t.\tID=rna151;Parent=gene92;Dbxref=GeneID:111115591,Genbank:XR_002636863.1;Name=XR_002636863.1;gbkey=ncRNA;gene=LOC111115591;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111115591%2C transcript variant X1;transcript_id=XR_002636863.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t1856841\t1863683\t.\t-\t.\tID=rna152;Parent=gene92;Dbxref=GeneID:111115591,Genbank:XR_002636864.1;Name=XR_002636864.1;gbkey=ncRNA;gene=LOC111115591;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC111115591%2C transcript variant X2;transcript_id=XR_002636864.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t2161223\t2166803\t.\t+\t.\tID=rna188;Parent=gene111;Dbxref=GeneID:111109763,Genbank:XR_002635698.1;Name=XR_002635698.1;gbkey=ncRNA;gene=LOC111109763;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=uncharacterized LOC111109763;transcript_id=XR_002635698.1\r\n", "NC_035780.1\tGnomon\tlnc_RNA\t2928484\t2930094\t.\t-\t.\tID=rna249;Parent=gene150;Dbxref=GeneID:111122009,Genbank:XR_002637875.1;Name=XR_002637875.1;gbkey=ncRNA;gene=LOC111122009;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC111122009;transcript_id=XR_002637875.1\r\n" ] } ], "source": [ "!head {lncRNAList}" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4750 C_virginica-3.0_Gnomon_lncRNA_yrv.gff3\n", "lncRNA in the C. virginica genome\n" ] } ], "source": [ "!wc -l {lncRNAList}\n", "!echo \"lncRNA in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_lncRNA_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_lncRNA_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3i. Untranslated regions of exons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These can be derived by subtracting coding sequences and exons." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!{bedtoolsDirectory}subtractBed \\\n", "-a {exonList} \\\n", "-b {CDSList} \\\n", "-sorted \\\n", "-g {chromLengths} \\\n", "> C_virginica-3.0_Gnomon_exonUTR_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": true }, "outputs": [], "source": [ "exonUTR = \"C_virginica-3.0_Gnomon_exonUTR_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\r\n", "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t30524\t30534\t.\t+\t.\tID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t33205\t33324\t.\t+\t.\tID=id8;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\r\n", "NC_035780.1\tGnomon\texon\t43111\t43261\t.\t-\t.\tID=id11;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n", "NC_035780.1\tGnomon\texon\t43111\t43261\t.\t-\t.\tID=id13;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n", "NC_035780.1\tGnomon\texon\t45998\t46506\t.\t-\t.\tID=id12;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1\r\n", "NC_035780.1\tGnomon\texon\t64220\t64334\t.\t-\t.\tID=id10;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1\r\n" ] } ], "source": [ "!head {exonUTR}" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 182752 C_virginica-3.0_Gnomon_exonUTR_yrv.gff3\n", "untranslated regions of exons in the C. virginica genome\n" ] } ], "source": [ "!wc -l {exonUTR}\n", "!echo \"untranslated regions of exons in the C. virginica genome\"" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "awk '{print $1\"\\t\"$4\"\\t\"$5}' C_virginica-3.0_Gnomon_exonUTR_yrv.gff3 \\\n", "> C_virginica-3.0_Gnomon_exonUTR_yrv.bed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3j. mtDNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The smallest chromosome is mitochondrial DNA. I can create a separate track for mtDNA by simply using `grep`." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!grep \"NC_007175.2\" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_mtDNA_yrv.gff3" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mtDNA = \"C_virginica-3.0_Gnomon_mtDNA_yrv.gff3\"" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##sequence-region NC_007175.2 1 17244\r\n", "NC_007175.2\tRefSeq\tregion\t1\t17244\t.\t+\t.\tID=id731900;Dbxref=taxon:6565;Is_circular=true;Name=MT;gbkey=Src;genome=mitochondrion;mol_type=genomic DNA\r\n", "NC_007175.2\tRefSeq\tgene\t1\t1623\t.\t+\t.\tID=gene39493;Dbxref=GeneID:3453225;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding\r\n", "NC_007175.2\tRefSeq\tCDS\t1\t1623\t.\t+\t0\tID=cds60201;Parent=gene39493;Dbxref=Genbank:YP_254649.1,GeneID:3453225;Name=YP_254649.1;gbkey=CDS;gene=COX1;product=cytochrome c oxidase subunit I;protein_id=YP_254649.1;transl_table=5\r\n", "NC_007175.2\tRefSeq\tgene\t2558\t3429\t.\t+\t.\tID=gene39494;Dbxref=GeneID:3453226;Name=COX3;gbkey=Gene;gene=COX3;gene_biotype=protein_coding\r\n", "NC_007175.2\tRefSeq\tCDS\t2645\t3429\t.\t+\t0\tID=cds60202;Parent=gene39494;Dbxref=Genbank:YP_254650.2,GeneID:3453226;Name=YP_254650.2;Note=start codon not determined%3B TAA stop codon is completed by the addition of 3' A residues to the mRNA;gbkey=CDS;gene=COX3;partial=true;product=cytochrome c oxidase subunit III;protein_id=YP_254650.2;start_range=.,2645;transl_except=(pos:3428..3429%2Caa:TERM);transl_table=5\r\n", "NC_007175.2\tRefSeq\ttRNA\t3430\t3495\t.\t+\t.\tID=rna67189;anticodon=(pos:3461..3463);gbkey=tRNA;product=tRNA-Ile\r\n", "NC_007175.2\tRefSeq\texon\t3430\t3495\t.\t+\t.\tID=id731901;Parent=rna67189;anticodon=(pos:3461..3463);gbkey=tRNA;product=tRNA-Ile\r\n", "NC_007175.2\tRefSeq\ttRNA\t3499\t3567\t.\t+\t.\tID=rna67190;anticodon=(pos:3531..3533);gbkey=tRNA;product=tRNA-Thr\r\n", "NC_007175.2\tRefSeq\texon\t3499\t3567\t.\t+\t.\tID=id731902;Parent=rna67190;anticodon=(pos:3531..3533);gbkey=tRNA;product=tRNA-Thr\r\n" ] } ], "source": [ "!head {mtDNA}" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 78 C_virginica-3.0_Gnomon_mtDNA_yrv.gff3\n", "mitchondrial DNA sequences in the *C. virginica* genome\n" ] } ], "source": [ "!wc -l {mtDNA}\n", "!echo \"mitchondrial DNA sequences in the *C. virginica* genome\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Visualize in IGV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best way to confirm that I created my tracks correctly is to look at them in the Integrated Genome Viewer (IGV). Visualization can be done at [this link](https://github.com/fish546-2018/yaamini-virginica/blob/master/analyses/2019-05-13-Generating-Genome-Feature-Tracks/2019-05-13-Genome-Track-Verification.xml)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Characterize CG motif locations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Roughly count the number of Cs in the full genome\n", "!fgrep -o -i C {fullGenome} | wc -l" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 14458703 C_virginica-3.0_CG-motif.bed\r\n" ] } ], "source": [ "#Count the number of CG motifs in the premade file\n", "!wc -l {CGMotifList}" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 7914842\n", "CG motifs overlap with genes\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {geneList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with genes\"" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 7507167\n", "CG motifs overlap with mRNA\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {mRNAList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with mRNA\"" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2330546\n", "CG motifs overlap with exons\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {exonList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with exons\"" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6545363\n", "CG motifs overlap with intergenic regions\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {intergenicList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with intergenic regions\"" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1728032\n", "CG motifs overlap with coding sequences\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {CDSList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with coding sequences\"" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 12142171\n", "CG motifs overlap with non-coding sequences\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {nonCDS} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with non-coding sequences\"" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5596808\n", "CG motifs overlap with introns\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {intronList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with introns\"" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 281715\n", "CG motifs overlap with lncRNA\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {lncRNAList} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with lncRNA\"" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 602551\n", "CG motifs overlap with untranslated regions of exons\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {exonUTR} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with untranslated regions of exons\"" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 431\n", "CG motifs overlap with mitochondrial DNA\n" ] } ], "source": [ "!{bedtoolsDirectory}intersectBed \\\n", "-u \\\n", "-a {CGMotifList} \\\n", "-b {mtDNA} \\\n", "| wc -l\n", "!echo \"CG motifs overlap with mitochondrial DNA\"" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }