# Gene Background Analysis

In this notebook, I will characterize the location of the `unite` gene background from `methylKit` within the *C. virginica* genome.

Methods:

0. Prepare for Analyses
1. Locate Files and Set Variable Paths
2. Identify Overlaps with Background

## 0. Prepare for Analyses

### 0a. Set Working Directory

In [34]:
pwd

'/Users/yaamini/Documents/yaamini-virginica/analyses/2018-12-02-Gene-Enrichment-Analysis'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/yaamini-virginica/analyses


In [3]:
pwd

'/Users/yaamini/Documents/yaamini-virginica/analyses'

In [4]:
!mkdir 2018-12-02-Gene-Enrichment-Analysis

In [4]:
ls -F

[34m2018-10-25-MethylKit[m[m/ [34m2018-12-02-Gene-Enrichment-Analysis[m[m/
[34m2018-11-01-DML-and-DMR-Analysis[m[m/ README.md


In [5]:
cd 2018-12-02-Gene-Enrichment-Analysis/

/Users/yaamini/Documents/yaamini-virginica/analyses/2018-12-02-Gene-Enrichment-Analysis


## 1. Locate Relevant Files and Set Variable Path Names

### 1a. Set Variable Path Names

Setting the variable path names allows me to reuse this script with different input files or different paths to programs without manually changing the file names each time.

In [6]:
bedtoolsDirectory = "/Users/Shared/bioinformatics/bedtools2/bin/"

In [7]:
geneBackground = "../../analyses/2018-10-25-MethylKit/2018-11-29-Methylation-Information-Cov3.bed"

In [8]:
DMLlist = "../../analyses/2018-10-25-MethylKit/2018-11-07-DML-Locations.bed"

In [9]:
DMRlist = "../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed"

In [10]:
exonList = "../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_exon.bed"

In [11]:
intronList = "../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_intron.bed"

In [12]:
mRNAList = "../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3"

In [13]:
transposableElementsAll = "../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-all.gff"

In [14]:
transposableElementsCg = "../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-Cg.gff"

### 1b. Confirm Variable Path Works and Characterize Files

The BEDfiles with DML and DMR can be viewed below. Columns are are the chromosome, start position, end position, strand, and fold difference with direction. The files only have DML and DMR that were at least 50% different between the two treatments (control and elevated pCO2).

In [15]:
#Previewing the files
!head {geneBackground}

NC_007175.2	147	147	+	7	0	7	134	4	130	39	0	39	99	1	98	394	5	389	9	0	9	241	4	237	623	10	613	178	8	170	15	0	15
NC_007175.2	246	246	+	3	0	3	109	1	108	26	0	26	91	0	91	306	11	295	4	0	4	208	3	205	558	11	547	183	4	179	13	0	13
NC_007175.2	257	257	+	3	0	3	134	4	130	29	0	29	110	2	108	379	0	379	5	0	5	239	0	239	669	10	659	199	7	192	20	1	19
NC_007175.2	266	266	+	3	0	3	168	2	166	39	2	37	126	4	122	458	5	453	7	0	7	293	1	292	782	6	776	244	8	236	23	0	23
NC_007175.2	473	473	+	3	0	3	112	3	109	40	0	40	114	8	106	322	10	312	5	0	5	214	3	211	614	8	606	205	7	198	17	0	17
NC_007175.2	665	665	+	7	0	7	107	2	105	44	0	44	56	0	56	358	9	349	7	0	7	212	9	203	561	11	550	244	8	236	21	0	21
NC_007175.2	685	685	-	4	0	4	108	1	107	28	0	28	129	0	129	369	2	367	4	0	4	188	8	180	538	6	532	82	5	77	6	0	6
NC_007175.2	709	709	+	5	0	5	114	3	111	46	1	45	73	6	67	377	15	362	4	0	4	259	9	250	623	18	605	242	10	232	21	0	21
NC_007175.2	710	710	-	6	0	6	122	4	118	36	1	35	134	3	131	415	2	413	4	0	4	198	1	197	636	11	625	119	2	117	10	0	10
NC_

In [17]:
#Counting the number of lines
!wc -l {geneBackground}

 670301 ../../analyses/2018-10-25-MethylKit/2018-11-29-Methylation-Information-Cov3.bed


In [18]:
!head {DMLlist}

NC_035780.1	346071	346073	-	50
NC_035780.1	990995	990997	-	-51
NC_035780.1	1882691	1882693	-	52
NC_035780.1	1885022	1885024	-	61
NC_035780.1	1933499	1933501	-	53
NC_035780.1	1945182	1945184	+	55
NC_035780.1	1958998	1959000	-	53
NC_035780.1	1983256	1983258	-	-69
NC_035780.1	2538924	2538926	-	-50
NC_035780.1	2541652	2541654	-	-55


In [19]:
#Counting the number of DML
!wc -l {DMLlist}

 1398 ../../analyses/2018-10-25-MethylKit/2018-11-07-DML-Locations.bed


In [20]:
!head {DMRlist}

NC_035780.1	571100	571200	DMR	58
NC_035780.1	573700	573800	DMR	52
NC_035780.1	1885000	1885100	DMR	51
NC_035780.1	1933500	1933600	DMR	53
NC_035780.1	4285700	4285800	DMR	-51
NC_035780.1	24159600	24159700	DMR	51
NC_035780.1	26629500	26629600	DMR	65
NC_035780.1	28563400	28563500	DMR	59
NC_035780.1	29883000	29883100	DMR	-55
NC_035780.1	31302900	31303000	DMR	-61


In [21]:
#Counting the number of DMR
!wc -l {DMRlist}

 162 ../../analyses/2018-10-25-MethylKit/2018-11-07-DMR-Locations.bed


In [22]:
!head {exonList}

NC_035780.1	13578	13603
NC_035780.1	14237	14290
NC_035780.1	14557	14594
NC_035780.1	28961	29073
NC_035780.1	30524	31557
NC_035780.1	31736	31887
NC_035780.1	31977	32565
NC_035780.1	32959	33324
NC_035780.1	66869	66897
NC_035780.1	64123	64334


In [23]:
!wc -l {exonList}

 731279 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_exon.bed


In [24]:
!head {intronList}

NC_035780.1	28961	28961
NC_035780.1	29074	30524
NC_035780.1	31558	31736
NC_035780.1	31888	31977
NC_035780.1	32566	32959
NC_035780.1	43110	43112
NC_035780.1	44359	45913
NC_035780.1	46507	64123
NC_035780.1	64335	66869
NC_035780.1	85606	85606


In [25]:
!wc -l {intronList}

 319262 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_intron.bed


In [26]:
!head -n 1 {mRNAList}

NC_035780.1	Gnomon	mRNA	28961	33324	.	+	.	ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1


In [27]:
!wc -l {mRNAList}

 60201 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_Gnomon_mRNA.gff3


In [30]:
!head {transposableElementsAll}

##gff-version 2
##date 2018-08-23
##sequence-region Cvirginica_v300.fa
NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055
NC_007175.2	RepeatMasker	similarity	1728	1947	26.1	-	.	Target "Motif:REP-6_LMi" 14320 14534
NC_007175.2	RepeatMasker	similarity	1866	2013	33.6	+	.	Target "Motif:LSU-rRNA_Cel" 2372 2520
NC_007175.2	RepeatMasker	similarity	2129	2367	20.5	-	.	Target "Motif:REP-6_LMi" 13886 14118
NC_007175.2	RepeatMasker	similarity	2836	2980	31.5	+	.	Target "Motif:REP-6_LMi" 6216 6359
NC_007175.2	RepeatMasker	similarity	3196	3277	30.5	+	.	Target "Motif:REP-6_LMi" 6572 6653
NC_007175.2	RepeatMasker	similarity	5168	5532	32.9	+	.	Target "Motif:REP-6_LMi" 4620 4983


In [31]:
!wc -l {transposableElementsAll}

 692371 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-all.gff


In [32]:
!head {transposableElementsCg}

##gff-version 2
##date 2018-08-27
##sequence-region Cvirginica_v300.fa
NC_007175.2	RepeatMasker	similarity	1866	2013	33.6	+	.	Target "Motif:LSU-rRNA_Cel" 2372 2520
NC_007175.2	RepeatMasker	similarity	6529	6628	19.0	+	.	Target "Motif:(TA)n" 2 102
NC_035780.1	RepeatMasker	similarity	1473	1535	 0.0	+	.	Target "Motif:(TAACCC)n" 1 63
NC_035780.1	RepeatMasker	similarity	5080	7289	32.5	-	.	Target "Motif:Gypsy-62_CGi-I" 2102 4631
NC_035780.1	RepeatMasker	similarity	7423	7489	25.4	-	.	Target "Motif:Gypsy-62_CGi-I" 2097 2163
NC_035780.1	RepeatMasker	similarity	7623	8079	34.1	-	.	Target "Motif:Gypsy-62_CGi-I" 1516 1975
NC_035780.1	RepeatMasker	similarity	8261	8295	14.1	+	.	Target "Motif:(CTCCT)n" 1 33


In [33]:
!wc -l {transposableElementsCg}

 626665 ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_TE-Cg.gff


## 2. Identify Gene Background overlaps with Genome Feature Tracks

I will use `intersectBed` to find the overlaps between the gene background and mRNA coding regions and transposable elements (all species and *C. gigas* only).

### 2a. DML

In [16]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {DMLlist} \
| wc -l
!echo "Gene background overlaps with DML"

 1808
Gene background overlaps with DML


### 2b. DMR

In [49]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {DMRlist} \
| wc -l
!echo "Gene background overlaps with DMR"

 128
Gene background overlaps with DMR


### 2c. Exons 

In [18]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {exonList} \
| wc -l
!echo "Gene background overlaps with exons"

 420979
Gene background overlaps with exons


In [16]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneBackground} \
-b {exonList} \
> 2019-01-04-Gene-Background-Exons.txt

In [17]:
!head -n 1 2019-01-04-Gene-Background-Exons.txt

NC_035780.1	100560	100560	-	13	5	8	31	11	20	16	6	10	23	17	6	25	8	17	8	0	8	21	8	13	19	6	13	41	35	6	15	0	15	NC_035780.1	100554	100661


### 2d. Introns 

In [19]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {intronList} \
| wc -l
!echo "Gene background overlaps with introns"

 202086
Gene background overlaps with introns


In [20]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneBackground} \
-b {intronList} \
> 2019-01-04-Gene-Background-Introns.txt

In [21]:
!head -n 1 2019-01-04-Gene-Background-Introns.txt

NC_035780.1	87542	87542	+	4	4	0	11	8	3	20	16	4	15	9	6	12	10	2	11	10	1	8	6	2	10	4	6	9	7	2	4	4	0	NC_035780.1	85778	88423


### 2e. mRNA 

In [34]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {mRNAList} \
| wc -l
!echo "Gene background overlaps with mRNA"

 614204
Gene background overlaps with mRNA


In [39]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneBackground} \
-b {mRNAList} \
> 2018-12-18-Gene-Background-mRNA.txt

In [35]:
!head -n 1 2018-12-18-Gene-Background-mRNA.txt

NC_035780.1	87542	87542	+	4	4	0	11	8	3	20	16	4	15	9	6	12	10	2	11	10	1	8	6	2	10	4	6	9	7	2	4	4	0	NC_035780.1	Gnomon	mRNA	85606	95254	.	-	.	ID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1


In [37]:
#Isolate unique genes in gene background-mRNA overlap
! cut -f43 2018-12-18-Gene-Background-mRNA.txt | sort | uniq -c > 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt

In [39]:
!head -n 1 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt

 41 ID=rna10000;Parent=gene5866;Dbxref=GeneID:111121983,Genbank:XM_022463489.1;Name=XM_022463489.1;gbkey=mRNA;gene=LOC111121983;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=sodium-coupled neutral amino acid transporter 9-like%2C transcript variant X4;transcript_id=XM_022463489.1


In [40]:
#Count number of unique genes
!wc -l 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt

 26627 2019-03-11-Unique-Genes-in-Gene-Background-mRNA-Overlap.txt


### 2f. Transposable elements (all)

In [35]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {transposableElementsAll} \
| wc -l
!echo "Gene background overlaps with TE (all)"

 76830
Gene background overlaps with TE (all)


In [43]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneBackground} \
-b {transposableElementsAll} \
> 2018-12-18-Gene-Background-TEall.txt

In [44]:
!head -n 1 2018-12-18-Gene-Background-TEall.txt

NC_007175.2	266	266	+	3	0	3	168	2	166	39	2	37	126	4	122	458	5	453	7	0	7	293	1	292	782	6	776	244	8	236	23	0	23	NC_007175.2	RepeatMasker	similarity	262	1389	31.1	+	.	Target "Motif:REP-6_LMi" 2920 4055


### 2g. Transposable elements (Cg)

In [36]:
! {bedtoolsDirectory}intersectBed \
-u \
-a {geneBackground} \
-b {transposableElementsCg} \
| wc -l
!echo "Gene background overlaps with TE (Cg)"

 58340
Gene background overlaps with TE (Cg)


In [45]:
! {bedtoolsDirectory}intersectBed \
-wb \
-a {geneBackground} \
-b {transposableElementsCg} \
> 2018-12-18-Gene-Background-TECg.txt

In [46]:
!head -n 1 2018-12-18-Gene-Background-TECg.txt

NC_007175.2	2004	2004	+	4	0	4	38	0	38	12	1	11	62	2	60	194	1	193	3	0	3	113	4	109	405	15	390	47	1	46	8	0	8	NC_007175.2	RepeatMasker	similarity	1866	2013	33.6	+	.	Target "Motif:LSU-rRNA_Cel" 2372 2520


## 3. Calculate Percent Overlap with Gene Background

I want to know how much of the original background is represented by DML and DMR. To do this, I'll just do some simple math.

### 3a. DML

(lines in DML) / (lines in gene background)

In [24]:
(1398 / 670301) * 100

0.20856301870353766

0.2086% overlap between DML and the original background.

### 3b. DMR

(lines in DMR) / (lines in gene background)

In [25]:
(162 / 670301) * 100

0.024168246802555866

0.0242% overlap between DMR and the original background.

I also want proportions for the overlap between the original background and other genome feature files.

### 3c. Exons

(lines overlapped) / (lines in gene background)

In [23]:
(420979 / 670301) * 100

62.804471424031895

Overlap between exons and gene background accounts for 62.80% of the gene background.

### 3d. Introns

(lines overlapped) / (lines in gene background)

In [22]:
(202086 / 670301) * 100

30.148545205810525

Overlap between introns and gene background accounts for 30.15% of the gene background.

### 3e. mRNA

(lines overlapped) / (lines in gene background)

In [17]:
(614204 / 670301) * 100

91.63107320442607

Overlap between mRNA and gene background accounts for 91.63% of the gene background.

### 3f. Transposable elements (all)

(lines overlapped) / (lines in gene background)

In [18]:
(76830 / 670301) * 100

11.462014826175107

Overlap between transposable elements (all) and gene background accounts for 11.46% of the gene background.

### 3g. Transposable elements (*C. gigas* only)

(lines overlapped) / (lines in gene background)

In [19]:
(58340 / 670301) * 100

8.703552583093268

Overlap between transposable elements (all) and gene background accounts for 8.70% of the gene background.