--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Genome Annotation - Pgenerosa_v074 MAKER on Mox with Stringtie Transcripts GFF date: '2019-08-19 08:32' tags: - Panopea generosa - geoduck - v074 - MAKER - mox categories: - 2019 - Geoduck Genome Sequencing --- For reference, this annotation will be referred to as Pgenerosa_v074.1. [I previously created a subset of the Pgenerosa_v070 genome assembly](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-25-Data-Wrangling---FastA-Subsetting-of-Pgenerosa_v070.fa-Using-samtools-faidx/) that contains just the largest 18 scaffolds (these scaffolds were produced by Phase Genomics, utilizing some Hi-C sequencing). The new subsetted genome is labeled as [Pgenerosa_v074.fa](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.fa) (914MB). As part of that, [I previously annotated this using MAKER, on 20190709](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-01-Genome-Annotation---Pgenerosa_v074-MAKER-on-Mox/). This time, I've add transcripts and their isoforms, as [predicted by Stringtie on (from 20190723)](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-23-Genome-Annotation---Pgenerosa_v074-Transcript-Isoform-ID-with-Stringtie-on-Mox/) to the MAKER process to see if we can improve annotations, particularly in Scaffold1. This scaffold has been void of any gene-feature annotations in any previous MAKER runs (Pgenerosa_v070, Pgenerosa_v071, and Pgenerosa_v074). This will perform the following: - one round of MAKER gene model predictions - two rounds of SNAP gene model training/predictions - renaming of gene models to NCBI-standardized convention - functional characterization of protein models (via BLASTp) - functional characterization of protein domains (via InterProScan5) Here are a list of the input files used for the various components of the MAKER annotation: #### Stringtie GFF Pgenerosa_v074.stringtie.gff: Made from Pgenerosa_v074.stringtie.gtf using `gffread Pgenerosa_v074.stringtie.gtf -o Pgenerosa_v074.stringtie.gff` on 20190819. Run using Mox build node. #### Transcriptome FastA files (links to notebook entries): - [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Ctenidia-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Gonad-HiSeq-and-NovaSeq-Data-on-Mox/) - [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Heart/) - [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Larvae-Day5-EPI99-with-HiSeq-and-NovaSeq-Data-on-Mox/) - [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI115-with-HiSeq-Data-on-Mox/) - [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI116-with-HiSeq-Data-on-Mox/) - [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI123-with-HiSeq-Data-on-Mox/) - [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI124-with-HiSeq-and-NovaSeq-Data-on-Mox/) #### NCBI Protein FastA files - NCBI _Crassostrea gigas_ proteome (downloaded 20181119): `GCA_000297895.1_oyster_v9_protein.faa` - NCBI _Crassostrea virginica_ proteome (downloaded 20181119): `GCF_002022765.2_C_virginica-3.0_protein.faa` - SwissProt BLASTp database(downloaded 20190109): uniprot_sprot.fasta #### TransDecoder protein FastA files (links to notebook entries) - [singular proteome](https://robertslab.github.io/sams-notebook/posts/2018/2018-11-21-Annotation---Geoduck-Transcritpome-with-TransDecoder/) - [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Ctenidia-with-Transdecoder-on-Mox/) - [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Larvae-Day5-EPI99-with-Transdecoder-on-Mox/) - [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI115-with-Transdecoder-on-Mox/) - [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI116-with-Transdecoder-on-Mox/) - [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI123-with-Transdecoder-on-Mox/) - [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI124-with-Transdecoder-on-Mox/) - [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Gonad-with-Transdecoder-on-Mox/) - [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-18-Transcriptome-Annotation---Geoduck-Heart-with-Transdecoder-on-Mox/) #### Repeats Files - [RepeatModeler library](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-26-RepeatModeler---Pgenerosa_v074-for-MAKER-Annotation-on-Emu/) - [RepeatMasker GFF](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-26-RepeatMasker---Pgenerosa_v074-for-Transposable-Element-ID-on-Roadrunner/) SBATCH script (GitHub): - [20190819_pgen_maker_v074_with_stringtie_annotation.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190819_pgen_maker_v074_with_stringtie_annotation.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=maker_stingtie_pgen074 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=2 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=21-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20190819_pgen_maker_v074_with_stringtie_annotation # Exit if any command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 # Load Open MPI module for parallel, multi-node processing module load icc_19-ompi_3.1.2 # SegFault fix? export THREADS_DAEMON_MODEL=1 # Document programs in PATH (primarily for program version ID) date >> system_path.log echo "" >> system_path.log echo "System PATH for $SLURM_JOB_ID" >> system_path.log echo "" >> system_path.log printf "%0.s-" {1..10} >> system_path.log echo "${PATH}" | tr : \\n >> system_path.log # Add BLAST to system PATH export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008 ## Establish variables for more readable code wd=$(pwd) maker_dir=/gscratch/srlab/programs/maker-2.31.10/bin snap_dir=/gscratch/srlab/programs/maker-2.31.10/exe/snap ### Paths to Maker binaries maker=${maker_dir}/maker gff3_merge=${maker_dir}/gff3_merge maker2zff=${maker_dir}/maker2zff fathom=${snap_dir}/fathom forge=${snap_dir}/forge hmmassembler=${snap_dir}/hmm-assembler.pl fasta_merge=${maker_dir}/fasta_merge map_ids=${maker_dir}/maker_map_ids map_gff_ids=${maker_dir}/map_gff_ids map_fasta_ids=${maker_dir}/map_fasta_ids functional_fasta=${maker_dir}/maker_functional_fasta functional_gff=${maker_dir}/maker_functional_gff ipr_update_gff=${maker_dir}/ipr_update_gff iprscan2gff3=${maker_dir}/iprscan2gff3 blastp_dir=${wd}/blastp_annotation maker_blastp=${wd}/blastp_annotation/blastp.outfmt6 maker_prot_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.fasta maker_prot_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.renamed.fasta maker_transcripts_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.fasta maker_transcripts_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.renamed.fasta snap02_gff=${wd}/snap02/Pgenerosa_v074_snap02.all.gff snap02_gff_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.renamed.gff put_func_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.gff put_func_prot=Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta put_func_trans=Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta put_domain_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff ips_dir=${wd}/interproscan_annotation ips_base=Pgenerosa_v074_maker_proteins_ips ips_name=Pgenerosa_v074_maker_proteins_ips.tsv id_map=${wd}/snap02/Pgenerosa_v074_genome.map ips_domains=Pgenerosa_v074_genome_snap02.all.renamed.visible_ips_domains.gff ## Path to blastp blastp=/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp ## Path to InterProScan5 interproscan=/gscratch/srlab/programs/interproscan-5.31-70.0/interproscan.sh ## Store path to options control file maker_opts_file=./maker_opts.ctl ### Path to genome FastA file genome=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa ### Paths to transcriptome FastA files ctendia_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia/Trinity.fasta gonad_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad/Trinity.fasta heart_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart/Trinity.fasta EPI99_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99/Trinity.fasta EPI115_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115/Trinity.fasta EPI116_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116/Trinity.fasta EPI123_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123/Trinity.fasta EPI124_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124/Trinity.fasta ### Path to Stringtie transcripts GFF stringtie_gff=/gscratch/srlab/sam/data/P_generosa/transcriptomes/Pgenerosa_v074.stringtie.gff ### Path to Crassotrea gigas NCBI protein FastA gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa ### Path to Crassostrea virginica NCBI protein FastA virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa ### Path to Panopea generosa TransDecoder protein FastAs panopea_td_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/20180827_trinity_geoduck.fasta.transdecoder.pep pgen_td_ctenidia_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/ctenidia/Trinity.fasta.transdecoder.pep pgen_td_larvae_EPI99_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/larvae/EPI99/Trinity.fasta.transdecoder.pep pgen_td_juv_EPI115_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI115/Trinity.fasta.transdecoder.pep pgen_td_juv_EPI116_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI116/Trinity.fasta.transdecoder.pep pgen_td_juv_EPI123_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI123/Trinity.fasta.transdecoder.pep pgen_td_juv_EPI124_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI124/Trinity.fasta.transdecoder.pep pgen_td_gonad_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/gonad/Trinity.fasta.transdecoder.pep pgen_td_heart_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/heart/Trinity.fasta.transdecoder.pep ### Path to P.generosa-specific RepeatModeler library repeat_library=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074-families.fa ### Path to P.generosa-specific RepeatMasker GFF rm_gff=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074.fa.out.gff ### Path to SwissProt database for BLASTp sp_db_blastp=/gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta ## Make directories mkdir blastp_annotation mkdir interproscan_annotation mkdir snap01 mkdir snap02 ## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it. ### Edit options file ### Set paths to P.generosa genome and transcriptome. ### Set path to combined C. gigas, C.virginica, P.generosa proteomes. ### The use of the % symbol sets the delimiter sed uses for arguments. ### Normally, the delimiter that most examples use is a slash "/". ### But, we need to expand the variables into a full path with slashes, which screws up sed. ### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%"). if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" "$maker_opts_file" # Set transcriptomes to use sed -i "/^est=/ s% %\ ${ctendia_transcriptome},\ ${gonad_transcriptome},\ ${heart_transcriptome},\ ${EPI99_transcriptome},\ ${EPI115_transcriptome},\ ${EPI116_transcriptome},\ ${EPI123_transcriptome},\ ${EPI124_transcriptome} \ %" \ "$maker_opts_file" # Set transcipt GFFs to use sed -i "est_gff=/ s% %\ ${stringtie_gff} \ %" \ "$maker_opts_file" # Set proteomes to use sed -i "/^protein=/ s% %\ ${gigas_proteome},\ ${panopea_td_proteome},\ ${pgen_td_ctenidia_proteome},\ ${pgen_td_gonad_proteome},\ ${pgen_td_heart_proteome},\ ${pgen_td_juv_EPI115_proteome},\ ${pgen_td_juv_EPI116_proteome},\ ${pgen_td_juv_EPI123_proteome},\ ${pgen_td_juv_EPI124_proteome},\ ${pgen_td_larvae_EPI99_proteome},\ ${virginica_proteome} \ %" \ "$maker_opts_file" # Set RepeatModeler library to use sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file" # Set RepeatMasker GFF to use sed -i "/^rm_gff=/ s% %${rm_gff} %" "$maker_opts_file" # Set est2ggenome to 1 - tells MAKER to use transcriptome FastAs sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file" # Set protein2genome to 1 - tells MAKER to use protein FastAs sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file" fi ## Run Maker ### Specify number of nodes to use. mpiexec -n 56 $maker ## Merge gffs ${gff3_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log > Pgenerosa_v074.maker.all.noseqs.gff ## Merge all FastAs ${fasta_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log ## Extract GFF alignments for use in subsequent MAKER rounds ### Transcript alignments awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.est2genome.gff ### Protein alignments awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.protein2genome.gff ### Repeat alignments awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.repeats.gff ## Run SNAP training, round 1 cd "${wd}" cd snap01 ${maker2zff} ../Pgenerosa_v074.all.gff ${fathom} -categorize 1000 genome.ann genome.dna ${fathom} -export 1000 -plus uni.ann uni.dna ${forge} export.ann export.dna ${hmmassembler} genome . > Pgenerosa_v074_snap01.hmm ## Initiate second Maker run. ### Copy initial maker control files and ### Default gene prediction settings are 0 (i.e. don't generate Maker gene predictions) ### - use GFF subsets generated in first round of MAKER ### - set location of snaphmm file to use for gene prediction ### Percent symbols used below are the sed delimiters, instead of the default "/", ### due to the need to use file paths. if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" maker_opts.ctl # Set transcriptomes to use sed -i "/^est=/ s% %\ ${ctendia_transcriptome},\ ${gonad_transcriptome},\ ${heart_transcriptome},\ ${EPI99_transcriptome},\ ${EPI115_transcriptome},\ ${EPI116_transcriptome},\ ${EPI123_transcriptome},\ ${EPI124_transcriptome} %" \ "$maker_opts_file" # Set proteomes to use sed -i "/^protein=/ s% %\ ${gigas_proteome},\ ${panopea_td_proteome},\ ${pgen_td_ctenidia_proteome},\ ${pgen_td_gonad_proteome},\ ${pgen_td_heart_proteome},\ ${pgen_td_juv_EPI115_proteome},\ ${pgen_td_juv_EPI116_proteome},\ ${pgen_td_juv_EPI123_proteome},\ ${pgen_td_juv_EPI124_proteome},\ ${pgen_td_larvae_EPI99_proteome},\ ${virginica_proteome} \ %" \ "$maker_opts_file" # Set RepeatModeler library to use sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file" sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap01.hmm %" maker_opts.ctl fi ## Run Maker ### Set basename of files and specify number of CPUs to use mpiexec -n 56 $maker \ -base Pgenerosa_v074_snap01 ## Merge gffs ${gff3_merge} -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log > Pgenerosa_v074_snap01.maker.all.noseqs.gff ## Run SNAP training, round 2 cd "${wd}" cd snap02 ${maker2zff} ../snap01/Pgenerosa_v074_snap01.all.gff ${fathom} -categorize 1000 genome.ann genome.dna ${fathom} -export 1000 -plus uni.ann uni.dna ${forge} export.ann export.dna ${hmmassembler} genome . > Pgenerosa_v074_snap02.hmm ## Initiate third and final Maker run. if [ ! -e maker_opts.ctl ]; then $maker -CTL sed -i "/^genome=/ s% %$genome %" maker_opts.ctl # Set transcriptomes to use sed -i "/^est=/ s% %\ ${ctendia_transcriptome},\ ${gonad_transcriptome},\ ${heart_transcriptome},\ ${EPI99_transcriptome},\ ${EPI115_transcriptome},\ ${EPI116_transcriptome},\ ${EPI123_transcriptome},\ ${EPI124_transcriptome} %" \ "$maker_opts_file" # Set proteomes to use sed -i "/^protein=/ s% %\ ${gigas_proteome},\ ${panopea_td_proteome},\ ${pgen_td_ctenidia_proteome},\ ${pgen_td_gonad_proteome},\ ${pgen_td_heart_proteome},\ ${pgen_td_juv_EPI115_proteome},\ ${pgen_td_juv_EPI116_proteome},\ ${pgen_td_juv_EPI123_proteome},\ ${pgen_td_juv_EPI124_proteome},\ ${pgen_td_larvae_EPI99_proteome},\ ${virginica_proteome} \ %" \ "$maker_opts_file" # Set RepeatModeler library to use sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file" sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap02.hmm %" maker_opts.ctl fi ## Run Maker ### Set basename of files and specify number of CPUs to use mpiexec -n 56 $maker \ -base Pgenerosa_v074_snap02 ## Merge gffs ${gff3_merge} \ -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log ## GFF with no FastA in footer ${gff3_merge} -n -s -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log > Pgenerosa_v074_snap02.maker.all.noseqs.gff ## Merge FastAs ${fasta_merge} \ -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log # Create copies of files for mapping cp "${maker_prot_fasta}" "${maker_prot_fasta_renamed}" cp "${maker_transcripts_fasta}" "${maker_transcripts_fasta_renamed}" cp "${snap02_gff}" "${snap02_gff_renamed}" # Map IDs ## Change gene names ${map_ids} \ --prefix PGEN_ \ --justify 8 \ "${snap02_gff}" \ > "${id_map}" ## Map GFF IDs ${map_gff_ids} \ "${id_map}" \ "${snap02_gff_renamed}" ## Map FastAs ### Proteins ${map_fasta_ids} \ "${id_map}" \ "${maker_prot_fasta_renamed}" ### Transcripts ${map_fasta_ids} \ "${id_map}" \ "${maker_transcripts_fasta_renamed}" # Run InterProScan 5 ## disable-precalc since this requires external database access (which Mox does not allow) cd "${ips_dir}" ${interproscan} \ --input "${maker_prot_fasta_renamed}" \ --goterms \ --output-file-base ${ips_base} \ --disable-precalc # Run BLASTp cd "${blastp_dir}" ${blastp} \ -query "${maker_prot_fasta_renamed}" \ -db ${sp_db_blastp} \ -out "${maker_blastp}" \ -max_target_seqs 1 \ -evalue 1e-6 \ -outfmt 6 \ -num_threads 28 # Functional annotations cd "${wd}" ## Add putative gene functions ### GFF ${functional_gff} \ ${sp_db_blastp} \ "${maker_blastp}" \ "${snap02_gff_renamed}" \ > ${put_func_gff} ### Proteins ${functional_fasta} \ ${sp_db_blastp} \ "${maker_blastp}" \ "${maker_prot_fasta_renamed}" \ > ${put_func_prot} ### Transcripts ${functional_fasta} \ ${sp_db_blastp} \ "${maker_blastp}" \ "${maker_transcripts_fasta_renamed}" \ > ${put_func_trans} ## Add InterProScan domain info ### Add searchable tags ${ipr_update_gff} \ ${put_func_gff} \ "${ips_dir}"/${ips_name} \ > ${put_domain_gff} ### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo) ${iprscan2gff3} \ "${ips_dir}"/${ips_name} \ "${snap02_gff_renamed}" \ > ${ips_domains} ``` --- # RESULTS Well, this ran relatively quickly: A little over seven days. ![Pgenerosa_v074 MAKER with Stringtie runtime screencap](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190819_pgen_maker_v074_with_stringtie_annotation_runtime.png?raw=true) Output folder: - [20190819_pgen_maker_v074_with_stringtie_annotation/](https://gannet.fish.washington.edu/Atumefaciens/20190819_pgen_maker_v074_with_stringtie_annotation/) The important files: - [Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff](https://gannet.fish.washington.edu/Atumefaciens/20190819_pgen_maker_v074_with_stringtie_annotation/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff) (373MB) - GFF file with all contigs annotated with putative functions and functional domains. - _INCLUDES SEQUENCE FASTAS AT END OF FILE!_ - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction. - MD5: 8e8391912e481695addac8fed320a8a1 - [Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta](https://gannet.fish.washington.edu/Atumefaciens/20190819_pgen_maker_v074_with_stringtie_annotation/Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta) (432KB) - Annotated proteins FastA file. - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction. - [Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta](https://gannet.fish.washington.edu/Atumefaciens/20190819_pgen_maker_v074_with_stringtie_annotation/Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta) (1.2MB) - Annotated transcripts FastA file. - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction. I should've just split the GFF as part of the Mox SBATCH script, but I didn't so I did it locally on my computer. Here are the commands for splitting the GFF. All the GFF files have been addeed to our [Genomic Resources wiki](https://github.com/RobertsLab/resources/wiki/Genomic-Resources) (GitHub). - CDS GFF: - [Pgenerosa_v074.1.CDS.gff ](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.CDS.gff ) (605KB) - MD5: `78165439a8fd3e399e90e7eb1a0ea221` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "CDS" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.CDS.gff ``` - exon GFF: - [Pgenerosa_v074.1.exon.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.exon.gff) (655KB) - MD5: `a9f88dfc6fa3a2a6b43535341cd421f2` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "exon" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.exon.gff ``` - five prime UTR GFF: - [Pgenerosa_v074.1.five_prime_UTR.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.five_prime_UTR.gff) (54KB) - MD5: `f651827d5e6c9a235b78670af08f7b2c` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "five_prime_UTR" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.five_prime_UTR.gff ``` - gene GFF: - [Pgenerosa_v074.1.gene.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.gene.gff) (269KB) - MD5: `5b3af590a9b34c0aa87abef2442d9bde` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "gene" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.gene.gff ``` - mRNA GFF: - [Pgenerosa_v074.1.mRNA.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.mRNA.gff) (337KB) - MD5: `fd4889cfc5b1184f2c39054a4ead3f62` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "mRNA" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.mRNA.gff ``` - three prime UTR GFF: - [Pgenerosa_v074.1.three_prime_UTR.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.1.three_prime_UTR.gff) (37KB) - MD5: `3e9547af413d5e4230c4d556cac13d68` ```shell awk 'BEGIN { print "##gff-version 3" ; } $3 == "three_prime_UTR" {print}' \ Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \ > Pgenerosa_v074.1.three_prime_UTR.gff ``` With all of that out of the way, a cursory glance at the results are, honestly, quite shocking. A quick `grep -c ">"` on the FastA files reveals: - 825 proteins/transcripts Here's table with feature counts from each of the GFF files listed above: | Feature Count | Feature File | |---------------|--------------------------------------| | 4905 | Pgenerosa_v074.1.CDS.gff | | 5037 | Pgenerosa_v074.1.exon.gff | | 371 | Pgenerosa_v074.1.five_prime_UTR.gff | | 826 | Pgenerosa_v074.1.gene.gff | | 826 | Pgenerosa_v074.1.mRNA.gff | | 250 | Pgenerosa_v074.1.three_prime_UTR.gff | --- So, how does this compare to the [previous annotation of Pgenerosa_v074 from 20190709](https://robertslab.github.io/sams-notebook/posts/2019/2019-07-01-Genome-Annotation---Pgenerosa_v074-MAKER-on-Mox/)? 1. Pgenerosa_v074.1 has ~50% _fewer_ genes annotated than the Pgenerosa_v074 annotation. 2. Pgenerosa_v074.1 only exhibits annotations on PGA_scaffold_18. The Pgenerosa_v074 annotation had annotation present on PGA_scaffold_17 & PGA_scaffold_17 (no others). The Stringtie information clearly had an influence on gene predictions and the resulting annotations. Admittedly, I was expecting _more_ gene predictions, since the Stringtie info shows transcripts present on all the other scaffolds. Instead, we ended up with a more _stringent_ set of predicted genes! Will peruse all of this data a bit more and see what I can make of it...