--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Feature Counts - Panopea-generosa-vv0.74.a4 date: '2019-10-29 14:45' tags: - jupyter notebook - Panopea generosa - geoduck - genome categories: - 2019 - Geoduck Genome Sequencing --- #### UPDATE 20191203 Due to a bug with GenSAS, the final GFF generated by their system to did not properly merge all of the repeats data properly. The _repeats data below should be ignored_! It appears that the other features data (e.g. CDS, exons, etc) was unaffected by the bug. Here's the updated analysis: https://robertslab.github.io/sams-notebook/posts/2019/2019-12-03-Data-Wrangling---Renaming,-Splitting,-and-Feature-Counts-of-Updated-Pgenerosa_v074-GenSAS-Merged-GFF/ I'll leave original notebook entry below for posterity. --- In preparation for a paper we're writing, we needed some summary stats for [Panopea-generosa-vv0.74.a4](https://robertslab.github.io/sams-notebook/posts/2019/2019-09-28-Genome-Annotation---Pgenerosa_v074-a4-Using-GenSAS/). This info will be compiled in to a table for the manuscript. See our Genomic Resources wiki for more info on GFFs: - [Genomic Resources Wiki](https://github.com/RobertsLab/resources/wiki/Genomic-Resources) (GitHub) Calculations were performed using Python in Jupyter Notebooks. Genome Features Jupyter Notebook (GitHub): - [20191029_swoose_pgen_v074.a4_genome_feature_counts.ipynb](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20191029_swoose_pgen_v074.a4_genome_feature_counts.ipynb) Repeat Features Jupyter Notbooke (GitHub): - [20191029_swoose_pgen_v074.a4_genome_repeats_counts.ipyn](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20191029_swoose_pgen_v074.a4_genome_repeats_counts.ipynb) --- # RESULTS I've copied/pasted the summary data for each of the GFFs that were analyzed, for quick reference. Will get this compiled in to a table of some sort for people to use for the manuscript. Also, the repeats were split into individual GFFs by repeat type. Those GFFs can be found here: - [20191029_pgen_v074.a4_repeats_counts](https://gannet.fish.washington.edu/Atumefaciens/20191029_pgen_v074.a4_repeats_counts/) #### Genome Features ``` Panopea-generosa-vv0.74.a4.mRNA.gff3 ------------------------- mean 12903.649559 min 166.000000 median 5453.000000 max 283066.000000 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.gene.gff3 ------------------------- mean 10811.04461 min 166.00000 median 4464.00000 max 283066.00000 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.CDS.gff3 ------------------------- mean 201.476988 min 3.000000 median 133.000000 max 13221.000000 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.tRNA.gff3 ------------------------- mean 74.805659 min 53.000000 median 75.000000 max 314.000000 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.rRNA.gff3 ------------------------- mean 118.428571 min 113.000000 median 115.000000 max 138.000000 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.exon.gff3 ------------------------- mean 201.476988 min 3.000000 median 133.000000 max 13221.000000 Name: seqlength, dtype: float64 ``` --- #### Repeats Features ``` Panopea-generosa-vv0.74.a4.repeats.LINE.gff3 ------------------------- percent 2.91 sum 27388849.00 mean 394.85 min 11.00 median 226.00 max 6604.00 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.Simple_repeat.gff3 ------------------------- percent 0.5 sum 4733271.0 mean 261.2 min 6.0 median 125.0 max 5981.0 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.Unknown.gff3 ------------------------- percent 29.09 sum 2.740281e+08 mean 1.991900e+02 min 1.100000e+01 median 1.440000e+02 max 6.574000e+03 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.LTR.gff3 ------------------------- percent 0.22 sum 2060084.00 mean 712.83 min 11.00 median 316.00 max 6541.00 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.RC.gff3 ------------------------- percent 0.02 sum 232303.00 mean 425.46 min 13.00 median 464.00 max 674.00 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.SINE.gff3 ------------------------- percent 0.65 sum 6133778.00 mean 155.69 min 11.00 median 164.00 max 934.00 Name: seqlength, dtype: float64 Panopea-generosa-vv0.74.a4.repeats.DNA.gff3 ------------------------- percent 0.91 sum 8602532.00 mean 407.82 min 11.00 median 247.00 max 7012.00 Name: seqlength, dtype: float64 ------------------------- Repeats composition of genome (percent): 34.3 ```