65 - Exon Coverage

Author

Steven Roberts

Published

December 10, 2023

Following Li et al. Will take bams and get some data..

Trimmed reads were mapped to the Aiptasia genome using HISAT2 v2.1.0, and mapping coverage per position was extracted using BEDTools v2.17.0. To assess spurious transcription levels, we deter- mined the coverage per exon normalized across all six replicates (assuming every replicate had a coverage of 1 million sequences in total) and then calculated the average coverage ratios of exons 2 to 6 versus exon 1 for every gene

Individual BAMS and their corresponding index files are in each individual sample subirectory:

https://gannet.fish.washington.edu/Atumefaciens/20230821-cvir-stringtie-GCF_002022765.2-isoforms/

A merged BAM of all samples is here (79GB):

https://gannet.fish.washington.edu/Atumefaciens/20230821-cvir-stringtie-GCF_002022765.2-isoforms/20230821_cvir_stringtie_GCF_002022765-sorted-bams-merged.bam

Merged BAM index file (23MB):

https://gannet.fish.washington.edu/Atumefaciens/20230821-cvir-stringtie-GCF_002022765.2-isoforms/20230821_cvir_stringtie_GCF_002022765-sorted-bams-merged.bam.bai

Download BAMs

wget -r \
--no-directories --no-parent \
-P ../data/big/ \
-A "*sorted.bam" https://gannet.fish.washington.edu/Atumefaciens/20230821-cvir-stringtie-GCF_002022765.2-isoforms/
wget -r \
--no-directories --no-parent \
-P ../data/big/ \
-A "*sorted.bam.bai" https://gannet.fish.washington.edu/Atumefaciens/20230821-cvir-stringtie-GCF_002022765.2-isoforms/

bedtools coverage

/home/shared/bedtools2/bin/bedtools coverage \
-a ../genome-features/GCF_002022765.2_C_virginica-3.0_genomic.gff \
-b ../data/big/S9M.sorted.bam \
> ../output/65-exon-coverage/S9gff_exon_coverage.txt
head ../genome-features/GCF_002022765.2_C_virginica-3.0_genomic.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1 RefSeq  region  1   65668440    .   +   .   ID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
NC_035780.1 Gnomon  gene    13578   14594   .   +   .   ID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
cd ../data

/home/shared/datasets download genome accession GCF_002022765.2 --include gff3,gtf
head /home/shared/8TB_HDD_01/sr320/github/ceabigr/data/ncbi_dataset/data/GCF_002022765.2/genomic.gtf
#gtf-version 2.2
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
NC_035780.1 Gnomon  gene    13578   14594   .   +   .   gene_id "LOC111116054"; db_xref "GeneID:111116054"; gbkey "Gene"; gene "LOC111116054"; gene_biotype "lncRNA"; 
NC_035780.1 Gnomon  exon    13578   13603   .   +   .   gene_id "LOC111116054"; transcript_id "XR_002636969.1"; db_xref "GeneID:111116054"; gbkey "ncRNA"; gene "LOC111116054"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC111116054"; exon_number "1"; 
NC_035780.1 Gnomon  exon    14237   14290   .   +   .   gene_id "LOC111116054"; transcript_id "XR_002636969.1"; db_xref "GeneID:111116054"; gbkey "ncRNA"; gene "LOC111116054"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC111116054"; exon_number "2"; 
NC_035780.1 Gnomon  exon    14557   14594   .   +   .   gene_id "LOC111116054"; transcript_id "XR_002636969.1"; db_xref "GeneID:111116054"; gbkey "ncRNA"; gene "LOC111116054"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC111116054"; exon_number "3"; 
NC_035780.1 Gnomon  gene    28961   33324   .   +   .   gene_id "LOC111126949"; db_xref "GeneID:111126949"; gbkey "Gene"; gene "LOC111126949"; gene_biotype "protein_coding"; 
NC_035780.1 Gnomon  exon    28961   29073   .   +   .   gene_id "LOC111126949"; transcript_id "XM_022471938.1"; db_xref "GeneID:111126949"; gbkey "mRNA"; gene "LOC111126949"; model_evidence "Supporting evidence includes similarity to: 3 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 21 samples with support for all annotated introns"; product "UNC5C-like protein"; exon_number "1"; 
cd    ../genome-features
curl -O https://eagle.fish.washington.edu/Cvirg_tracks/C_virginica-3.0_Gnomon_exon.bed
head ../genome-features/C_virginica-3.0_Gnomon_exon.bed
NC_035780.1 13578   13603
NC_035780.1 14237   14290
NC_035780.1 14557   14594
NC_035780.1 28961   29073
NC_035780.1 30524   31557
NC_035780.1 31736   31887
NC_035780.1 31977   32565
NC_035780.1 32959   33324
NC_035780.1 66869   66897
NC_035780.1 64123   64334

Lets get bam that overlaps with exons

bedtools intersect -a sorted.bam -b exons.gtf > exon_reads.bam

bedtools multicov -bams exon_reads.bam -bed exons.gtf > exon_expression_counts.txt

/home/shared/bedtools2/bin/bedtools multicov \
-bams ../data/big/S77F.sorted.bam \
-bed ../data/ncbi_dataset/data/GCF_002022765.2/genomic.gtf \
> ../output/65-exon-coverage/S77_exon_exp_counts.txt
tail -20 ../output/65-exon-coverage/S77_exon_exp_counts.txt
/home/shared/bedtools2/bin/bedtools coverage \
-a ../genome-features/GCF_902806645.1_cgigas_uk_roslin_v1_genomic-mito.gtf \
-b ../data/big/S9M.sorted.bam \
> ../output/65-exon-coverage/S9gtf_exon_coverage.txt
head -1 ../output/65-exon-coverage/S9gtf_exon_coverage.txt
NC_047559.1 Gnomon  gene    9839    11386   .   +   .   gene_id "LOC117693020"; db_xref "GeneID:117693020"; gbkey "Gene"; gene "LOC117693020"; gene_biotype "lncRNA";   0   0   1548    0.0000000

loops

for bamfile in ../data/big/*sorted.bam; do
    # Calculate coverage using BEDTools
    /home/shared/bedtools2/bin/bedtools coverage -a ../genome-features/GCF_02022765.2_C_virginica- -b $bamfile > ../65-exon-coverage/${bamfile%.bam}_exon_coverage.txt
done