--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Data Wrangling - Panopea generosa Genome Feature Sequence Lengths date: '2019-08-26 14:37' tags: - Panopea generosa - geoduck - Jupyter - GFF - swoose - v070 - v074 - Pgenerosa_v074 - Pgenerosa_v070 categories: - 2019 - Geoduck Genome Sequencing --- In preparation for a paper we're writing, we needed some summary stats on the various genome assembly feature sequences. I determined the max/min, mean, and median sequence lengths for all the GFF feature files we currently have for Pgenerosa_v070, Pgenerosa_v070_top18_scaffolds, and Pgenerosa_v074. This info will be compiled in to a table for the manuscript. See our Genomic Resources wiki for more info on GFFs: - [Genomic Resources Wiki](https://github.com/RobertsLab/resources/wiki/Genomic-Resources) (GitHub) Calculations were performed using Python in a Jupyter Notebook. Jupyter Notebook (GitHub): - [20190826_swoose_pgen_genome_feature_counts.ipynb](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20190826_swoose_pgen_genome_feature_counts.ipynb) --- # RESULTS I've copied/pasted (and, reordered) the summary data for each of the GFFs that were analyzed, for quick reference. Will get this compiled in to a table of some sort for people to use for the manuscript. ``` Pgenerosa_v070_top18_scaffolds.CDS.gff ------------------------- mean 177.728604 min 1.000000 median 123.000000 max 9851.000000 Pgenerosa_v070_top18_scaffolds.exon.gff ------------------------- mean 219.024085 min 3.000000 median 131.000000 max 16912.000000 Pgenerosa_v070_top18_scaffolds.five_prime_UTR.gff ------------------------- mean 148.558072 min 1.000000 median 89.000000 max 16912.000000 Pgenerosa_v070_top18_scaffolds.gene.gff ------------------------- mean 8696.297953 min 180.000000 median 4621.000000 max 139826.000000 Pgenerosa_v070_top18_scaffolds.mRNA.gff ------------------------- mean 8696.297953 min 180.000000 median 4621.000000 max 139826.000000 Pgenerosa_v070_top18_scaffolds.three_prime_UTR.gff ------------------------- mean 481.02338 min 1.00000 median 221.00000 max 8355.00000 Pgenerosa_v074.CDS.gff ------------------------- mean 193.248049 min 1.000000 median 126.000000 max 11772.000000 Pgenerosa_v074.exon.gff ------------------------- mean 210.612431 min 3.000000 median 130.000000 max 11772.000000 Pgenerosa_v074.five_prime_UTR.gff ------------------------- mean 62.226496 min 1.000000 median 41.000000 max 1259.000000 Pgenerosa_v074.gene.gff ------------------------- mean 12170.897196 min 192.000000 median 5552.000000 max 175905.000000 Pgenerosa_v074.mRNA.gff ------------------------- mean 12170.897196 min 192.000000 median 5552.000000 max 175905.000000 Pgenerosa_v074.three_prime_UTR.gff ------------------------- mean 516.604585 min 1.000000 median 251.000000 max 4080.000000 Pgenerosa_v070.CDS.gff ------------------------- mean 194.539909 min 1.000000 median 132.000000 max 11048.000000 Pgenerosa_v070.exon.gff ------------------------- mean 217.971132 min 3.000000 median 139.000000 max 16912.000000 Pgenerosa_v070.gene.gff ------------------------- mean 3768.232394 min 177.000000 median 1474.000000 max 139826.000000 Pgenerosa_v070.mRNA.gff ------------------------- mean 3768.232394 min 177.000000 median 1474.000000 max 139826.000000 ```