# Generating Genome Feature Tracks

I will create genome feature tracks to use in downstream analyses. While pre-made genome feature tracks exist, it's beneficial to generate these tracks to understand what elements they contain. My code was modified from the code provided in the [FROGER Github repository](https://github.com/hputnam/FROGER/wiki/Genome-Sequence-Files-and-Feature-Tracks).

1. Download *C. virginica* genome file
2. Separate various tracks
3. Visualize tracks in IGV
4. Characterize track overlap with CG motifs

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/paper-gonad-meth/code'

In [2]:
cd ../genome-feature-tracks/

/Users/yaamini/Documents/paper-gonad-meth/genome-feature-tracks


## 1. Download files

### 1a. Pre-generated CG motif track

In [4]:
!curl http://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_CG-motif.bed > C_virginica-3.0_CG-motif.bed

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 533M 100 533M 0 0 59.6M 0 0:00:08 0:00:08 --:--:-- 58.1M


In [5]:
!head C_virginica-3.0_CG-motif.bed

NC_035780.1	28	30	CG_motif
NC_035780.1	54	56	CG_motif
NC_035780.1	75	77	CG_motif
NC_035780.1	93	95	CG_motif
NC_035780.1	103	105	CG_motif
NC_035780.1	116	118	CG_motif
NC_035780.1	134	136	CG_motif
NC_035780.1	159	161	CG_motif
NC_035780.1	209	211	CG_motif
NC_035780.1	224	226	CG_motif


### 1b. *C. virgincia* genome from NCBI

In [6]:
!curl ftp://ftp.ncbi.nlm.nih.gov/genomes/Crassostrea_virginica/GFF/ref_C_virginica-3.0_top_level.gff3.gz > ref_C_virginica-3.0_top_level.gff3.gz

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 16.2M 100 16.2M 0 0 6234k 0 0:00:02 0:00:02 --:--:-- 6411k


In [7]:
!gunzip ref_C_virginica-3.0_top_level.gff3.gz

In [8]:
!head ref_C_virginica-3.0_top_level.gff3

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-date 14 September 2017
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1	RefSeq	region	1	65668440	.	+	.	ID=id0;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample


## 2. Set variable paths

In [10]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [6]:
fullGenome = "../data/C_virginica-3.0_genomic.fa"

In [7]:
CGMotifList = "C_virginica-3.0_CG-motif.bed"

In [30]:
chromLengths = "2018-06-15-bedtools-Chromosome-Lengths.txt"

In [34]:
!cut -f1 {chromLengths} > 2019-05-28-bedtools-Chromosome-Names.txt

cut: head: No such file or directory


In [3]:
!cat 2019-05-28-bedtools-Chromosome-Names.txt

NC_035780.1
NC_035781.1
NC_035782.1
NC_035783.1
NC_035784.1
NC_035785.1
NC_035786.1
NC_035787.1
NC_035788.1
NC_035789.1
NC_007175.2


In [40]:
chromNames = "2019-05-28-bedtools-Chromosome-Names.txt"

## 3. Generate feature tracks

### 3a. Genes

In [15]:
#Isolate gene entries. Tab must be included between "Gnomon" and "gene"
!grep "Gnomon	gene" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_gene_yrv.gff3

In [51]:
#Sort file for downstream use
!{bedtoolsDirectory}sortBed \
-faidx {chromNames} \
-i C_virginica-3.0_Gnomon_gene_yrv.gff3 \
> C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3

In [52]:
#Set variable path
geneList = "C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3"

In [53]:
#View file
!head {geneList}

NC_035780.1	Gnomon	gene	13578	14594	.	+	.	ID=gene0;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
NC_035780.1	Gnomon	gene	28961	33324	.	+	.	ID=gene1;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding
NC_035780.1	Gnomon	gene	43111	66897	.	-	.	ID=gene2;Dbxref=GeneID:111110729;Name=LOC111110729;gbkey=Gene;gene=LOC111110729;gene_biotype=protein_coding
NC_035780.1	Gnomon	gene	85606	95254	.	-	.	ID=gene3;Dbxref=GeneID:111112434;Name=LOC111112434;gbkey=Gene;gene=LOC111112434;gene_biotype=protein_coding
NC_035780.1	Gnomon	gene	99840	106460	.	+	.	ID=gene4;Dbxref=GeneID:111120752;Name=LOC111120752;gbkey=Gene;gene=LOC111120752;gene_biotype=protein_coding
NC_035780.1	Gnomon	gene	108305	110077	.	-	.	ID=gene5;Dbxref=GeneID:111128944;Name=LOC111128944;gbkey=Gene;gene=LOC111128944;gene_biotype=protein_coding;partial=true;start_range=.,108305
NC_035780.1	Gnomon	gene	151859	157536	.	+	.	ID=gene6;Dbxref=GeneI

In [54]:
#Count the number of genes
!wc -l {geneList}
!echo "genes in the C. virginica genome"

 38929 C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3
genes in the C. virginica genome


In [107]:
%%bash
#GFF files can be a lot to visualize, so I'll create a BEDfile version too
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_gene_sorted_yrv.gff3 \
> C_virginica-3.0_Gnomon_gene_sorted_yrv.bed

### 3b. mRNA

In [19]:
!grep "Gnomon	mRNA" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_mRNA_yrv.gff3

In [84]:
mRNAList = "C_virginica-3.0_Gnomon_mRNA_yrv.gff3"

In [21]:
!head -n1 {mRNAList}

NC_035780.1	Gnomon	mRNA	28961	33324	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1


In [22]:
!wc -l {mRNAList}
!echo "mRNAs in the C. virginica genome"

 60201 C_virginica-3.0_Gnomon_mRNA_yrv.gff3
mRNAs in the C. virginica genome


In [108]:
%%bash
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_mRNA_yrv.gff3 \
> C_virginica-3.0_Gnomon_mRNA_yrv.bed

### 3c. Exons

In [23]:
!grep "Gnomon	exon" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_exon_yrv.gff3

In [42]:
#Sort the exon file for downstream use
!{bedtoolsDirectory}sortBed \
-faidx {chromNames} \
-i C_virginica-3.0_Gnomon_exon_yrv.gff3 \
> C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3

In [43]:
exonList = "C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3"

In [44]:
!head {exonList}

NC_035780.1	Gnomon	exon	13578	13603	.	+	.	ID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14237	14290	.	+	.	ID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14557	14594	.	+	.	ID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	28961	29073	.	+	.	ID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	30524	31557	.	+	.	ID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.

In [45]:
!wc -l {exonList}
!echo "exons in the C. virginica genome"

 731279 C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3
exons in the C. virginica genome


In [109]:
%%bash
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_exon_sorted_yrv.gff3 \
> C_virginica-3.0_Gnomon_exon_sorted_yrv.bed

### 3d. Intergenic regions

By definition, these are regions that aren't genes. I can use `complementBed` to find all regions that aren't genes, and `subtractBed` to remove exons and create this track.

In [56]:
!{bedtoolsDirectory}complementBed \
-i {geneList} -sorted \
-g {chromLengths} \
| {bedtoolsDirectory}subtractBed \
-a - \
-b {exonList} \
> C_virginica-3.0_Gnomon_intergenic_yrv.gff3

In [57]:
intergenicList = "C_virginica-3.0_Gnomon_intergenic_yrv.gff3"

In [58]:
!head {intergenicList}

NC_035780.1	0	13577
NC_035780.1	14594	28960
NC_035780.1	33324	43110
NC_035780.1	66897	85605
NC_035780.1	95254	99839
NC_035780.1	106460	108304
NC_035780.1	110077	151858
NC_035780.1	157536	163808
NC_035780.1	183798	190448
NC_035780.1	193594	204242


In [59]:
!wc -l {intergenicList}
!echo "intergenic regions in the C. virginica genome"

 34557 C_virginica-3.0_Gnomon_intergenic_yrv.gff3
intergenic regions in the C. virginica genome


In [110]:
%%bash
awk '{print $1"\t"$2"\t"$3}' C_virginica-3.0_Gnomon_intergenic_yrv.gff3 \
> C_virginica-3.0_Gnomon_intergenic_yrv.bed

### 3e. Coding sequences

In [27]:
!grep "Gnomon	CDS" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_CDS_yrv.gff3

In [70]:
#Sort the exon file for downstream use
!{bedtoolsDirectory}sortBed \
-faidx {chromNames} \
-i C_virginica-3.0_Gnomon_CDS_yrv.gff3 \
> C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3

In [71]:
CDSList = "C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3"

In [72]:
!head {CDSList}

NC_035780.1	Gnomon	CDS	30535	31557	.	+	0	ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1	Gnomon	CDS	31736	31887	.	+	0	ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1	Gnomon	CDS	31977	32565	.	+	1	ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1	Gnomon	CDS	32959	33204	.	+	0	ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1	Gnomon	CDS	43262	44358	.	-	2	ID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;prod

In [73]:
!wc -l {CDSList}
!echo "CDSRNA in the C. virginica genome"

 645355 C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3
CDSRNA in the C. virginica genome


In [111]:
%%bash
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_CDS_sorted_yrv.gff3 \
> C_virginica-3.0_Gnomon_CSD_sorted_yrv.bed

### 3f. Non-coding Sequences

I can use `complementBed` to create a non-coding sequence track. This track can then be used to create an intron track.

In [46]:
!{bedtoolsDirectory}complementBed \
-i {exonList} \
-g 2018-06-15-bedtools-Chromosome-Lengths.txt \
> C_virginica-3.0_Gnomon_noncoding_yrv.gff3

In [47]:
nonCDS = "C_virginica-3.0_Gnomon_noncoding_yrv.gff3"

In [48]:
!head {nonCDS}

NC_035780.1	0	13577
NC_035780.1	13603	14236
NC_035780.1	14290	14556
NC_035780.1	14594	28960
NC_035780.1	29073	30523
NC_035780.1	31557	31735
NC_035780.1	31887	31976
NC_035780.1	32565	32958
NC_035780.1	33324	43110
NC_035780.1	44358	45912


In [49]:
!wc -l {nonCDS}
!echo "non-coding sequences in the C. virginica genome"

 336677 C_virginica-3.0_Gnomon_noncoding_yrv.gff3
non-coding sequences in the C. virginica genome


In [112]:
%%bash
awk '{print $1"\t"$2"\t"$3}' C_virginica-3.0_Gnomon_noncoding_yrv.gff3 \
> C_virginica-3.0_Gnomon_noncoding_yrv.bed

### 3g. Introns

In [96]:
#The intersections betwen the non-coding sequences and genes are by definition introns
!{bedtoolsDirectory}intersectBed \
-a {nonCDS} \
-b {geneList} -sorted \
> C_virginica-3.0_Gnomon_intron_yrv.gff3

In [97]:
intronList = "C_virginica-3.0_Gnomon_intron_yrv.gff3"

In [98]:
!head {intronList}

NC_035780.1	13603	14236
NC_035780.1	14290	14556
NC_035780.1	29073	30523
NC_035780.1	31557	31735
NC_035780.1	31887	31976
NC_035780.1	32565	32958
NC_035780.1	44358	45912
NC_035780.1	46506	64122
NC_035780.1	64334	66868
NC_035780.1	85777	88422


In [99]:
!wc -l {intronList}
!echo "introns in the C. virginica genome"

 316614 C_virginica-3.0_Gnomon_intron_yrv.gff3
introns in the C. virginica genome


In [100]:
%%bash
awk '{print $1"\t"$2"\t"$3}' C_virginica-3.0_Gnomon_intron_yrv.gff3 \
> C_virginica-3.0_Gnomon_intron_yrv.bed

### 3h. lncRNA

Note: there are only lncRNA and not the associated exons.

In [31]:
!grep "Gnomon	lnc_RNA" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_lncRNA_yrv.gff3

In [85]:
lncRNAList = "C_virginica-3.0_Gnomon_lncRNA_yrv.gff3"

In [33]:
!head {lncRNAList}

NC_035780.1	Gnomon	lnc_RNA	13578	14594	.	+	.	ID=rna0;Parent=gene0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	lnc_RNA	169468	170178	.	-	.	ID=rna10;Parent=gene9;Dbxref=GeneID:111105702,Genbank:XR_002635081.1;Name=XR_002635081.1;gbkey=ncRNA;gene=LOC111105702;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111105702;transcript_id=XR_002635081.1
NC_035780.1	Gnomon	lnc_RNA	900326	903430	.	+	.	ID=rna105;Parent=gene57;Dbxref=GeneID:111111519,Genbank:XR_002636046.1;Name=XR_002636046.1;gbkey=ncRNA;gene=L

In [34]:
!wc -l {lncRNAList}
!echo "lncRNA in the C. virginica genome"

 4750 C_virginica-3.0_Gnomon_lncRNA_yrv.gff3
lncRNA in the C. virginica genome


In [113]:
%%bash
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_lncRNA_yrv.gff3 \
> C_virginica-3.0_Gnomon_lncRNA_yrv.bed

### 3i. Untranslated regions of exons

These can be derived by subtracting coding sequences and exons.

In [74]:
!{bedtoolsDirectory}subtractBed \
-a {exonList} \
-b {CDSList} \
-sorted \
-g {chromLengths} \
> C_virginica-3.0_Gnomon_exonUTR_yrv.gff3

In [75]:
exonUTR = "C_virginica-3.0_Gnomon_exonUTR_yrv.gff3"

In [76]:
!head {exonUTR}

NC_035780.1	Gnomon	exon	13578	13603	.	+	.	ID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14237	14290	.	+	.	ID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14557	14594	.	+	.	ID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	28961	29073	.	+	.	ID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	30524	30534	.	+	.	ID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.

In [77]:
!wc -l {exonUTR}
!echo "untranslated regions of exons in the C. virginica genome"

 182752 C_virginica-3.0_Gnomon_exonUTR_yrv.gff3
untranslated regions of exons in the C. virginica genome


In [114]:
%%bash
awk '{print $1"\t"$4"\t"$5}' C_virginica-3.0_Gnomon_exonUTR_yrv.gff3 \
> C_virginica-3.0_Gnomon_exonUTR_yrv.bed

### 3j. mtDNA

The smallest chromosome is mitochondrial DNA. I can create a separate track for mtDNA by simply using `grep`.

In [78]:
!grep "NC_007175.2" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_mtDNA_yrv.gff3

In [79]:
mtDNA = "C_virginica-3.0_Gnomon_mtDNA_yrv.gff3"

In [80]:
!head {mtDNA}

##sequence-region NC_007175.2 1 17244
NC_007175.2	RefSeq	region	1	17244	.	+	.	ID=id731900;Dbxref=taxon:6565;Is_circular=true;Name=MT;gbkey=Src;genome=mitochondrion;mol_type=genomic DNA
NC_007175.2	RefSeq	gene	1	1623	.	+	.	ID=gene39493;Dbxref=GeneID:3453225;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding
NC_007175.2	RefSeq	CDS	1	1623	.	+	0	ID=cds60201;Parent=gene39493;Dbxref=Genbank:YP_254649.1,GeneID:3453225;Name=YP_254649.1;gbkey=CDS;gene=COX1;product=cytochrome c oxidase subunit I;protein_id=YP_254649.1;transl_table=5
NC_007175.2	RefSeq	gene	2558	3429	.	+	.	ID=gene39494;Dbxref=GeneID:3453226;Name=COX3;gbkey=Gene;gene=COX3;gene_biotype=protein_coding
NC_007175.2	RefSeq	CDS	2645	3429	.	+	0	ID=cds60202;Parent=gene39494;Dbxref=Genbank:YP_254650.2,GeneID:3453226;Name=YP_254650.2;Note=start codon not determined%3B TAA stop codon is completed by the addition of 3' A residues to the mRNA;gbkey=CDS;gene=COX3;partial=true;product=cytochrome c oxidase subunit III;protein_id=YP_2

In [115]:
!wc -l {mtDNA}
!echo "mitchondrial DNA sequences in the *C. virginica* genome"

 78 C_virginica-3.0_Gnomon_mtDNA_yrv.gff3
mitchondrial DNA sequences in the *C. virginica* genome


## 4. Visualize in IGV

The best way to confirm that I created my tracks correctly is to look at them in the Integrated Genome Viewer (IGV). Visualization can be done at [this link](https://github.com/fish546-2018/yaamini-virginica/blob/master/analyses/2019-05-13-Generating-Genome-Feature-Tracks/2019-05-13-Genome-Track-Verification.xml).

## 5. Characterize CG motif locations

In [None]:
#Roughly count the number of Cs in the full genome
!fgrep -o -i C {fullGenome} | wc -l

In [41]:
#Count the number of CG motifs in the premade file
!wc -l {CGMotifList}

 14458703 C_virginica-3.0_CG-motif.bed


In [82]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {geneList} \
| wc -l
!echo "CG motifs overlap with genes"

 7914842
CG motifs overlap with genes


In [86]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {mRNAList} \
| wc -l
!echo "CG motifs overlap with mRNA"

 7507167
CG motifs overlap with mRNA


In [87]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {exonList} \
| wc -l
!echo "CG motifs overlap with exons"

 2330546
CG motifs overlap with exons


In [88]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {intergenicList} \
| wc -l
!echo "CG motifs overlap with intergenic regions"

 6545363
CG motifs overlap with intergenic regions


In [89]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {CDSList} \
| wc -l
!echo "CG motifs overlap with coding sequences"

 1728032
CG motifs overlap with coding sequences


In [90]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {nonCDS} \
| wc -l
!echo "CG motifs overlap with non-coding sequences"

 12142171
CG motifs overlap with non-coding sequences


In [91]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {intronList} \
| wc -l
!echo "CG motifs overlap with introns"

 5596808
CG motifs overlap with introns


In [92]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {lncRNAList} \
| wc -l
!echo "CG motifs overlap with lncRNA"

 281715
CG motifs overlap with lncRNA


In [93]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {exonUTR} \
| wc -l
!echo "CG motifs overlap with untranslated regions of exons"

 602551
CG motifs overlap with untranslated regions of exons


In [94]:
!{bedtoolsDirectory}intersectBed \
-u \
-a {CGMotifList} \
-b {mtDNA} \
| wc -l
!echo "CG motifs overlap with mitochondrial DNA"

 431
CG motifs overlap with mitochondrial DNA
