--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Data Wrangling - FastA Subsetting of Pgenerosa_v070.fa Using samtools faidx date: '2019-06-25 13:04' tags: - samtools - faidx - Panopea generosa - geoduck - Pgenerosa_v070.fa - v070 categories: - 2019 - Miscellaneous --- Steven asked to subset the [Pgenerosa_v070.fa](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v070.fa) (2.1GB) in this [GitHub Issue #705](https://github.com/RobertsLab/resources/issues/705). In that issue, it was determined that a significant portion of the sequencing data that was assembled by Phase Genomics clustered in "scaffolds" 1 - 18. As such, Steven asked to subset just those 18 scaffolds. This was done by using the `samtools faidx` program. Process is documented in the following Jupyter Notebook (GitHub): - [20190625_swoose_pgen_v070_scaffolds.ipynb](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20190625_swoose_pgen_v070_scaffolds.ipynb) --- # RESULTS Output folder: - [20190625_pgen_v070_scaffold_subsetting/](https://gannet.fish.washington.edu/Atumefaciens/20190625_pgen_v070_scaffold_subsetting/) FastA (914MB): - [20190625_pgen_v070_scaffold_subsetting/Pgenerosa_v070.18.fa](https://gannet.fish.washington.edu/Atumefaciens/20190625_pgen_v070_scaffold_subsetting/Pgenerosa_v070.18.fa) FastA Index: - [20190625_pgen_v070_scaffold_subsetting/Pgenerosa_v070.18.fa.fai](https://gannet.fish.washington.edu/Atumefaciens/20190625_pgen_v070_scaffold_subsetting/Pgenerosa_v070.18.fa.fai)