--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Annotation - Pgenerosa_v071 Using GenSAS date: '2019-07-10 09:28' tags: - GenSAS - geoduck - Panopea generosa - annotation - Pgenerosa_v071 - v071.a2 categories: - 2019 - Geoduck Genome Sequencing --- In our various attempts to get the _Panopea generosa_ genome annotated in such a manner that we're comfortable with (the previous annotation attempts we're lacking any annotations in almost all of the largest scaffolds, which didn't seem right), Steven stumbled across [GenSAS](https://www.gensas.org/gensas), a web/GUI-based genome annotation program, so we gave it a shot. This version of the genome annotation will be referred to as: - Panopea-generosa-vv0.71.a2 I uploaded the following to the GenSAS website to potentially use as "evidence files": #### Transcriptome FastA files (links to notebook entries): - [singular transcriptome](https://robertslab.github.io/sams-notebook/posts/2018/2018-09-04-transcriptome-assembly-geoduck-rnaseq-data/) - [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Ctenidia-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Gonad-HiSeq-and-NovaSeq-Data-on-Mox/) - [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Heart/) - [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Larvae-Day5-EPI99-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI115-with-HiSeq-Data-on-Mox/) - [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI116-with-HiSeq-Data-on-Mox/) - [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI123-with-HiSeq-Data-on-Mox/) - [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI124-with-HiSeq-and-NovaSeq-Data-on-Mox/) #### TransDecoder protein FastA files (links to notebook entries) - [singular proteome](https://robertslab.github.io/sams-notebook/posts/2018/2018-11-21-Annotation---Geoduck-Transcritpome-with-TransDecoder/) - [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Ctenidia-with-Transdecoder-on-Mox/) - [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Larvae-Day5-EPI99-with-Transdecoder-on-Mox/) - [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI115-with-Transdecoder-on-Mox/) - [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI116-with-Transdecoder-on-Mox/) - [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI123-with-Transdecoder-on-Mox/) - [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI124-with-Transdecoder-on-Mox/) - [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Gonad-with-Transdecoder-on-Mox/) - [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-18-Transcriptome-Annotation---Geoduck-Heart-with-Transdecoder-on-Mox/) #### Repeats Files - [RepeatModeler library](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-26-RepeatModeler---Pgenerosa_v074-for-MAKER-Annotation-on-Emu/) --- # RESULTS This took way longer than I was expecting! This took nearly an entire month (the majority of that time was running Augustus _ab initio_ gene prediction, which took ~3 weeks): ![v071 GenSAS project summary processes and runtimes screencap](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190710_gensas_pgen-071_runtimes.png?raw=true) Output folder: - [20190710_Pgenerosa_v071_gensas_annotation/](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/) Feature counts: ```shell awk 'NR>3 { print $3 }' Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 | sort | uniq -c 264153 CDS 264153 exon 56167 gene 56167 mRNA ``` BUSCO assessment: - 80.7% complete BUSCOs present in predicted genes Individual feature GFFs were made with the following shell commands: ```shell features_array=(CDS exon gene mRNA) for feature in ${features_array[@]} do output="Panopea-generosa-v1.0.a2.${feature}.gff3" input="Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3" head -n 3 Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 \ >> ${output} awk -v feature="$feature" '$3 == feature {print}' ${input} \ >> ${output} done ``` - [Panopea-generosa-vv0.71.a2CDS.gff3](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-vv0.71.a2.CDS.gff3) - [Panopea-generosa-vv0.71.a2exon.gff3](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-vv0.71.a2.exon.gff3) - [Panopea-generosa-vv0.71.a2gene.gff3](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-vv0.71.a2.gene.gff3) - [Panopea-generosa-vv0.71.a2mRNA.gff3](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-vv0.71.a2.mRNA.gff3) SwissProt functional annotations (tab-delimited text): - BLASTp - [Panopea-generosa-v1.0.a2.5d66e5b736200-blast_functional.tab](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-v1.0.a2.5d66e5b736200-blast_functional.tab) - DIAMOND - [Panopea-generosa-v1.0.a2.5d66e5cca9bdd-diamond_functional.tab](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-v1.0.a2.5d66e5cca9bdd-diamond_functional.tab) Pfam annotations (tab-delimited text): - [Panopea-generosa-v1.0.a2.5d65aa832a1bf-pfam.tab](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-v1.0.a2.5d65aa832a1bf-pfam.tab) Grabbed the top 10 most abundant Pfam Accessions to see how things looked: | Feature Count | Pfam Accession | Pfam | |---------------|----------------|--------------------------| | 364 | PF00643.19 | B-box zinc finger | | 293 | PF07690.11 | Major facilitator family | | 228 | PF00001.16 | Rhodopsin-like receptors | | 220 | PF12796.2 | Ankyrin repeat | | 209 | PF00651.26 | BTB/POZ domain | | 206 | PF00069.20 | Protein kinase domain | | 180 | PF00067.17 | Cytochrome P450 | | 175 | PF02931.18 | Ligand-gated ion channel | | 174 | PF00400.27 | WD40 repeat | | 174 | PF00059.16 | C-type lectin | A rhodopsin protein family appears in the Top 10 most abundant Pfams?! Proteins in this family are involved in light detection... InterProScan annotations (tab-delimited text): - [Panopea-generosa-v1.0.a2.5d65aa8055fad-interproscan.tab](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/Panopea-generosa-v1.0.a2.5d65aa8055fad-interproscan.tab) - Contains gene ontology (GO) terms Project Summary file (TEXT): - [t_5d263cd070fe7-5d67fbd85ec41-publish-project-summary.txt](https://gannet.fish.washington.edu/Atumefaciens/20190710_Pgenerosa_v071_gensas_annotation/t_5d263cd070fe7-5d67fbd85ec41-publish-project-summary.txt) ``` ================================= Project Summary --------------------------------- # Project Information Project Name : Pgenerosa_v071 Create Date : 2019-07-10 12:30:24 # Project Properties Genus : Panopea Species : generosa Project Type : invertebrate Prefix : PGEN_ Common Name : Pacific geoduck Genetic Code : Standard Code # Input FASTA Filename : Pgenerosa_v071.fasta Filesize : 1.32 GB Number of Sequence : 14014 ================================= Job Information --------------------------------- # Official Gene Set >PASA Refinement - version : 2.3.3 - Transcripts FASTA file : Trinity.fasta # The source Job of the refinement job >Augustus-01 - version : 3.3.1 - Species : fly - Report genes on : both - Allowed gene structure : partial - cDNA (transcripts) sequences : Trinity.fasta - Protein sequences : 20180827_trinity_geoduck.fasta.transdecoder.fa # The consensus mask Job >Masked Repeat Consensus # The source jobs for consensus mask job >RepeatMasker >RepeatModeler # Family copy number summary Family Copy Numbers DNA 85 DNA/Academ 264 DNA/Crypton 200 DNA/Kolobok-T2 188 DNA/MuLE-MuDR 94 DNA/PIF-Harbinger 482 DNA/Sola 122 DNA/TcMar-Mariner 599 DNA/TcMar-Tc1 1266 DNA/hAT-Tip100 808 DNA/hAT-Tip100? 255 Type:DNA 4363 LINE 2153 LINE/CR1 4122 LINE/CR1-Zenon 1717 LINE/I-Nimb 72 LINE/Jockey 510 LINE/L1-Tx1 967 LINE/L2 1896 LINE/Penelope 735 LINE/Proto2 155 LINE/R2-Hero 211 LINE/RTE-X 2275 LINE/Tad1 97 Type:LINE 14910 Type:SINE 0 LTR/DIRS 192 LTR/Gypsy 1420 LTR/Ngaro 533 LTR/Pao 146 Type:LTR 2291 Type:EVERYTHING_TE 21564 Type:Simple_repeat 107 Type:Unknown 115322 # The functional Jobs on the OGS >BLAST protein vs protein (blastp)_SP01 - version : 2.7.1 - Protein Data Set : SwissProt - Maximum HSP Distinace : 30000 - Output type : tab - Matrix : BLOSUM62 - Expect : 1e-8 - Word Size : 3 - Gap Open : 11 - Gap Extend : 1 >DIAMOND Functional SP01 - version : 0.9.22 - Protein Data Set : SwissProt >BLAST protein vs protein (blastp) - version : 2.7.1 - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa - Maximum HSP Distinace : 30000 - Output type : tab - Matrix : BLOSUM62 - Expect : 1e-8 - Word Size : 3 - Gap Open : 11 - Gap Extend : 1 >DIAMOND Functional - version : 0.9.22 - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa >InterProScan - version : 5.29-68.0 >Pfam - version : 1.6 - E-value Sequence : 1 - E-value Domain : 10 >SignalP - version : 4.1 - Organism group : euk - Method : best - D-cutoff for SignalP-noTM networks : 0.45 - D-cutoff for SignalP-TM networks : 0.50 - Minimal predicted signal peptide length : 10 - Truncate to sequence length : 70 ``` Overall, this annotation is much more believable than the previous MAKER annotations, due to the fact that GenSAS actually predicts genes to exist on all the scaffolds (unlike MAKER)! Will be interesting to compare to the GenSAS [Panopea-generosa-vv0.74.a3 annotation](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-10-Genome-Annotation---Pgenerosa_v074-Using-GenSAS/).