---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Genome Annotation - Pgenerosa_v074 MAKER on Mox
date: '2019-07-01 10:29'
tags:
  - Panopea generosa
  - geoduck
  - MAKER
  - mox
  - annotation
  - genome
  - v074
categories:
  - 2019
  - Geoduck Genome Sequencing
---
[I previously created a subset of the Pgenerosa_v070 genome assembly](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-25-Data-Wrangling---FastA-Subsetting-of-Pgenerosa_v070.fa-Using-samtools-faidx/) that contains just the largest 18 scaffolds (these scaffolds were produced by Phase Genomics, utilizing some Hi-C sequencing). The new subsetted genome is labeled as [Pgenerosa_v074.fa](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.fa) (914MB).

As part of that, [Steven wanted this version annotated using MAKER](https://github.com/RobertsLab/resources/issues/707).

This will perform the following:

- one round of MAKER gene model predictions
- two rounds of SNAP gene model training/predictions
- renaming of gene models to NCBI-standardized convention
- functional characterization of protein models (via BLASTp)
- functional characterization of protein domains (via InterProScan5)

Here are a list of the input files used for the various components of the MAKER annotation:


#### Transcriptome FastA files (links to notebook entries):
- [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Ctenidia-with-HiSeq-and-NovaSeq-Data-on-Mox/)
- [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Gonad-HiSeq-and-NovaSeq-Data-on-Mox/)
- [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Heart/)
- [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Larvae-Day5-EPI99-with-HiSeq-and-NovaSeq-Data-on-Mox/)
- [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI115-with-HiSeq-Data-on-Mox/)
- [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Super-Low-OA-EPI116-with-HiSeq-Data-on-Mox/)
- [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI123-with-HiSeq-Data-on-Mox/)
- [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-04-09-Transcriptome-Assembly---Geoduck-Tissue-specific-Assembly-Juvenile-Ambient-OA-EPI124-with-HiSeq-and-NovaSeq-Data-on-Mox/)

#### NCBI Protein FastA files
- NCBI _Crassostrea gigas_ proteome (downloaded 20181119): `GCA_000297895.1_oyster_v9_protein.faa`

- NCBI _Crassostrea virginica_ proteome (downloaded 20181119): `GCF_002022765.2_C_virginica-3.0_protein.faa`

- SwissProt BLASTp database(downloaded 20190109): uniprot_sprot.fasta

#### TransDecoder protein FastA files (links to notebook entries)
- [singular proteome](https://robertslab.github.io/sams-notebook/posts/2018/2018-11-21-Annotation---Geoduck-Transcritpome-with-TransDecoder/)
- [ctenidia](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Ctenidia-with-Transdecoder-on-Mox/)
- [EPI99 (larvae)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Larvae-Day5-EPI99-with-Transdecoder-on-Mox/)
- [EPI115 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI115-with-Transdecoder-on-Mox/)
- [EPI116 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Super-Low-OA-EPI116-with-Transdecoder-on-Mox/)
- [EPI123 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI123-with-Transdecoder-on-Mox/)
- [EPI124 (juvenile)](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Juvenile-Ambient-OA-EPI124-with-Transdecoder-on-Mox/)
- [gonad](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-27-Transcriptome-Annotation---Geoduck-Gonad-with-Transdecoder-on-Mox/)
- [heart](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-18-Transcriptome-Annotation---Geoduck-Heart-with-Transdecoder-on-Mox/)

#### Repeats Files
- [RepeatModeler library](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-26-RepeatModeler---Pgenerosa_v074-for-MAKER-Annotation-on-Emu/)

- [RepeatMasker GFF](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-26-RepeatMasker---Pgenerosa_v074-for-Transposable-Element-ID-on-Roadrunner/)


SBATCH script (GitHub):

- [20190701_pgen_maker_v074_annotation.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20190701_pgen_maker_v074_annotation.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=maker_pgen074
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=2
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=40-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190701_pgen_maker_v074_annotation

# Exit if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?
export THREADS_DAEMON_MODEL=1

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log

# Add BLAST to system PATH
export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008


## Establish variables for more readable code

wd=$(pwd)
maker_dir=/gscratch/srlab/programs/maker-2.31.10/bin
snap_dir=/gscratch/srlab/programs/maker-2.31.10/exe/snap

### Paths to Maker binaries

maker=${maker_dir}/maker
gff3_merge=${maker_dir}/gff3_merge
maker2zff=${maker_dir}/maker2zff
fathom=${snap_dir}/fathom
forge=${snap_dir}/forge
hmmassembler=${snap_dir}/hmm-assembler.pl
fasta_merge=${maker_dir}/fasta_merge
map_ids=${maker_dir}/maker_map_ids
map_gff_ids=${maker_dir}/map_gff_ids
map_fasta_ids=${maker_dir}/map_fasta_ids
functional_fasta=${maker_dir}/maker_functional_fasta
functional_gff=${maker_dir}/maker_functional_gff
ipr_update_gff=${maker_dir}/ipr_update_gff
iprscan2gff3=${maker_dir}/iprscan2gff3

blastp_dir=${wd}/blastp_annotation
maker_blastp=${wd}/blastp_annotation/blastp.outfmt6
maker_prot_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.fasta
maker_prot_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.proteins.renamed.fasta
maker_transcripts_fasta=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.fasta
maker_transcripts_fasta_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.maker.transcripts.renamed.fasta
snap02_gff=${wd}/snap02/Pgenerosa_v074_snap02.all.gff
snap02_gff_renamed=${wd}/snap02/Pgenerosa_v074_snap02.all.renamed.gff
put_func_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.gff
put_func_prot=Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta
put_func_trans=Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta
put_domain_gff=Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff
ips_dir=${wd}/interproscan_annotation
ips_base=Pgenerosa_v074_maker_proteins_ips
ips_name=Pgenerosa_v074_maker_proteins_ips.tsv
id_map=${wd}/snap02/Pgenerosa_v074_genome.map
ips_domains=Pgenerosa_v074_genome_snap02.all.renamed.visible_ips_domains.gff

## Path to blastp
blastp=/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp

## Path to InterProScan5
interproscan=/gscratch/srlab/programs/interproscan-5.31-70.0/interproscan.sh

## Store path to options control file
maker_opts_file=./maker_opts.ctl

### Path to genome FastA file
genome=/gscratch/srlab/sam/data/P_generosa/genomes/Pgenerosa_v074.fa

### Paths to transcriptome FastA files
ctendia_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia/Trinity.fasta
gonad_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad/Trinity.fasta
heart_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart/Trinity.fasta
EPI99_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99/Trinity.fasta
EPI115_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115/Trinity.fasta
EPI116_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116/Trinity.fasta
EPI123_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123/Trinity.fasta
EPI124_transcriptome=/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124/Trinity.fasta

### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa

### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa

### Path to Panopea generosa TransDecoder protein FastAs
panopea_td_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/20180827_trinity_geoduck.fasta.transdecoder.pep
pgen_td_ctenidia_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/ctenidia/Trinity.fasta.transdecoder.pep
pgen_td_larvae_EPI99_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/larvae/EPI99/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI115_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI115/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI116_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI116/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI123_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI123/Trinity.fasta.transdecoder.pep
pgen_td_juv_EPI124_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/juvenile/EPI124/Trinity.fasta.transdecoder.pep
pgen_td_gonad_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/gonad/Trinity.fasta.transdecoder.pep
pgen_td_heart_proteome=/gscratch/srlab/sam/data/P_generosa/proteomes/heart/Trinity.fasta.transdecoder.pep


### Path to P.generosa-specific RepeatModeler library
repeat_library=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074-families.fa

### Path to P.generosa-specific RepeatMasker GFF
rm_gff=/gscratch/srlab/sam/data/P_generosa/repeats/Pgenerosa_v074.fa.out.gff

### Path to SwissProt database for BLASTp
sp_db_blastp=/gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta


## Make directories
mkdir blastp_annotation
mkdir interproscan_annotation
mkdir snap01
mkdir snap02


## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it.
### Edit options file
### Set paths to P.generosa genome and transcriptome.
### Set path to combined C. gigas, C.virginica, P.generosa proteomes.
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" "$maker_opts_file"

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${ctendia_transcriptome},\
  ${gonad_transcriptome},\
  ${heart_transcriptome},\
  ${EPI99_transcriptome},\
  ${EPI115_transcriptome},\
  ${EPI116_transcriptome},\
  ${EPI123_transcriptome},\
  ${EPI124_transcriptome} %" \
  "$maker_opts_file"

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${gigas_proteome},\
  ${panopea_td_proteome},\
  ${pgen_td_ctenidia_proteome},\
  ${pgen_td_gonad_proteome},\
  ${pgen_td_heart_proteome},\
  ${pgen_td_juv_EPI115_proteome},\
  ${pgen_td_juv_EPI116_proteome},\
  ${pgen_td_juv_EPI123_proteome},\
  ${pgen_td_juv_EPI124_proteome},\
  ${pgen_td_larvae_EPI99_proteome},\
  ${virginica_proteome} \
  %" \
  "$maker_opts_file"

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  # Set RepeatMasker GFF to use
  sed -i "/^rm_gff=/ s% %${rm_gff} %" "$maker_opts_file"

  # Set est2ggenome to 1 - tells MAKER to use transcriptome FastAs
  sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file"

  # Set protein2genome to 1 - tells MAKER to use protein FastAs
  sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file"
fi


## Run Maker
### Specify number of nodes to use.
mpiexec -n 56 $maker

## Merge gffs
${gff3_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log > Pgenerosa_v074.maker.all.noseqs.gff

## Merge all FastAs
${fasta_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log

## Extract GFF alignments for use in subsequent MAKER rounds
### Transcript alignments
awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.est2genome.gff
### Protein alignments
awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.protein2genome.gff
### Repeat alignments
awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.repeats.gff

## Run SNAP training, round 1
cd "${wd}"
cd snap01
${maker2zff} ../Pgenerosa_v074.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap01.hmm

## Initiate second Maker run.
### Copy initial maker control files and
### Default gene prediction settings are 0 (i.e. don't generate Maker gene predictions)
### - use GFF subsets generated in first round of MAKER
### - set location of snaphmm file to use for gene prediction
### Percent symbols used below are the sed delimiters, instead of the default "/",
### due to the need to use file paths.
if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" maker_opts.ctl

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${ctendia_transcriptome},\
  ${gonad_transcriptome},\
  ${heart_transcriptome},\
  ${EPI99_transcriptome},\
  ${EPI115_transcriptome},\
  ${EPI116_transcriptome},\
  ${EPI123_transcriptome},\
  ${EPI124_transcriptome} %" \
  "$maker_opts_file"

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${gigas_proteome},\
  ${panopea_td_proteome},\
  ${pgen_td_ctenidia_proteome},\
  ${pgen_td_gonad_proteome},\
  ${pgen_td_heart_proteome},\
  ${pgen_td_juv_EPI115_proteome},\
  ${pgen_td_juv_EPI116_proteome},\
  ${pgen_td_juv_EPI123_proteome},\
  ${pgen_td_juv_EPI124_proteome},\
  ${pgen_td_larvae_EPI99_proteome},\
  ${virginica_proteome} \
  %" \
  "$maker_opts_file"

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
  sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
  sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
  sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap01.hmm %" maker_opts.ctl
fi

## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap01

## Merge gffs
${gff3_merge} -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log > Pgenerosa_v074_snap01.maker.all.noseqs.gff

## Run SNAP training, round 2
cd "${wd}"
cd snap02
${maker2zff} ../snap01/Pgenerosa_v074_snap01.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap02.hmm

## Initiate third and final Maker run.

if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" maker_opts.ctl

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${ctendia_transcriptome},\
  ${gonad_transcriptome},\
  ${heart_transcriptome},\
  ${EPI99_transcriptome},\
  ${EPI115_transcriptome},\
  ${EPI116_transcriptome},\
  ${EPI123_transcriptome},\
  ${EPI124_transcriptome} %" \
  "$maker_opts_file"

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${gigas_proteome},\
  ${panopea_td_proteome},\
  ${pgen_td_ctenidia_proteome},\
  ${pgen_td_gonad_proteome},\
  ${pgen_td_heart_proteome},\
  ${pgen_td_juv_EPI115_proteome},\
  ${pgen_td_juv_EPI116_proteome},\
  ${pgen_td_juv_EPI123_proteome},\
  ${pgen_td_juv_EPI124_proteome},\
  ${pgen_td_larvae_EPI99_proteome},\
  ${virginica_proteome} \
  %" \
  "$maker_opts_file"

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
  sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
  sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
  sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap02.hmm %" maker_opts.ctl
fi

## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap02

## Merge gffs
${gff3_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log > Pgenerosa_v074_snap02.maker.all.noseqs.gff

## Merge FastAs
${fasta_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log

# Create copies of files for mapping
cp "${maker_prot_fasta}" "${maker_prot_fasta_renamed}"
cp "${maker_transcripts_fasta}" "${maker_transcripts_fasta_renamed}"
cp "${snap02_gff}" "${snap02_gff_renamed}"

# Map IDs
## Change gene names
${map_ids} \
--prefix PGEN_ \
--justify 8 \
"${snap02_gff}" \
> "${id_map}"

## Map GFF IDs
${map_gff_ids} \
"${id_map}" \
"${snap02_gff_renamed}"

## Map FastAs
### Proteins
${map_fasta_ids} \
"${id_map}" \
"${maker_prot_fasta_renamed}"

### Transcripts
${map_fasta_ids} \
"${id_map}" \
"${maker_transcripts_fasta_renamed}"

# Run InterProScan 5
## disable-precalc since this requires external database access (which Mox does not allow)
cd "${ips_dir}"

${interproscan} \
--input "${maker_prot_fasta_renamed}" \
--goterms \
--output-file-base ${ips_base} \
--disable-precalc

# Run BLASTp
cd "${blastp_dir}"

${blastp} \
-query "${maker_prot_fasta_renamed}" \
-db ${sp_db_blastp} \
-out "${maker_blastp}" \
-max_target_seqs 1 \
-evalue 1e-6 \
-outfmt 6 \
-num_threads 28


# Functional annotations

cd "${wd}"

## Add putative gene functions
### GFF
${functional_gff} \
${sp_db_blastp} \
"${maker_blastp}" \
"${snap02_gff_renamed}" \
> ${put_func_gff}

### Proteins
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_prot_fasta_renamed}" \
> ${put_func_prot}

### Transcripts
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_transcripts_fasta_renamed}" \
> ${put_func_trans}

## Add InterProScan domain info
### Add searchable tags
${ipr_update_gff} \
${put_func_gff} \
"${ips_dir}"/${ips_name} \
> ${put_domain_gff}

### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo)
${iprscan2gff3} \
"${ips_dir}"/${ips_name} \
"${snap02_gff_renamed}" \
> ${ips_domains}
```

---

# RESULTS

Output folder:

Well, this ran relatively quickly: A little over seven days.

![Pgenerosa_v074 MAKER runtime screencap](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190702_maker_pgen074_runtime.png?raw=true)

- [20190701_pgen_maker_v074_annotation/](https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/) (519MB)

The important files:

- [Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff](https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff)
  - GFF file with all contigs annotated with putative functions and functional domains.
  - _INCLUDES SEQUENCE FASTAS AT END OF FILE!_
  - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction.
  - MD5: 5a17f8dd40d534a544bec166c1b0c8be


- [Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta](https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/Pgenerosa_v074_genome_snap02.all.maker.proteins.renamed.putative_function.fasta) (902KB)
  - Annotated proteins FastA file.
  - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction.


- [Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta](https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/Pgenerosa_v074_genome_snap02.all.maker.transcripts.renamed.putative_function.fasta) (2.5MB)
  - Annotated transcripts FastA file.
  - Generated with one round of MAKER gene prediction, followed by two rounds of SNAP _ab-initio_ gene prediction.

I should've just split the GFF as part of the Mox SBATCH script, but I didn't so I did it locally on my computer. Here are the commands for splitting the GFF. All the GFF files have been addeed to our [Genomic Resources wiki](https://github.com/RobertsLab/resources/wiki/Genomic-Resources) (GitHub).

- CDS GFF:

  - [Pgenerosa_v074.CDS.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.CDS.gff) (1.3MB)
  - MD5: `0585919b6d819efee8c86fb3bdc104b6`

```shell
awk 'BEGIN { print "##gff-version 3" ; } $3 == "CDS" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.CDS.gff
```


- exon GFF:

  - [Pgenerosa_v074.exon.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.exon.gff) (1.4MB)
  - MD5: `92c1419894fb252d7c913313c725f3d5`

```shell
awk 'BEGIN { print "##gff-version 3" ; } $3 == "exon" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.exon.gff
```

- gene GFF:

  - [Pgenerosa_v074.gene.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.gene.gff)
  - MD5: `8bf8bb60ce5fc42f2dfad2b5ae236abb`

```shell
awk 'BEGIN { print "##gff-version 3" ; } $3 == "gene" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.gene.gff
```

- mRNA GFF:

  - [Pgenerosa_v074.mRNA.gff](http://owl.fish.washington.edu/halfshell/genomic-databank/Pgenerosa_v074.mRNA.gff)
  - MD5: `d1a6aba21b60f109510ee2399e03fb2c`

```shell
awk 'BEGIN { print "##gff-version 3" ; } $3 == "mRNA" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.mRNA.gff
```
With all of that out of the way, a cursory glance at the results are, honestly, quite shocking. A quick `grep -c ">"` on the FastA files reveals:

- 1712 proteins/transcripts

This is compared to 53,035 proteins/transcripts identified/annotated in [the Pgenerosa_v070 annotation](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-28-Genome-Annotation---Pgenerosa_v070-MAKER-on-Mox/transcripts in the [Pgenerosa_v071 annotation](https://robertslab.github.io/sams-notebook/2019/02/13/Genome-Annotation-Pgenerosa_v71-with-MAKER-on-Mox/) (only contigs >10kbp).

Additionally, annotations only occur on 2 out of 18 scaffolds: 
- PGA_scaffold_17
- PGA_scaffold_18.

Surprisingly, these two scaffolds comprise two of the three shortest (35Mbp and 28Mbp) of the 18 scaffolds subjected to annotation.

So, what does this mean? A few thoughts/conjectures in no particular order or prioritization:

1. Limiting to the longest 18 scaffolds [chosen to create this Pgenerosa_v074 subset](https://robertslab.github.io/sams-notebook/posts/2019/2019-06-25-Data-Wrangling---FastA-Subsetting-of-Pgenerosa_v070.fa-Using-samtools-faidx/) was the wrong approach. Looking back at the [GitHub issue where scaffold selection was discussed](https://github.com/RobertsLab/resources/issues/705) reveals the apparent confusion with which assembly provided by Phase Genomics is the proper assembly to use. However, that discussion seems to agree that the "manually curated" version of their genome assembly (which is what was used here) is the proper one to use.

2. Something went awry in the annotation. This will be difficult/impossible to assess, particularly since two of the scaffolds were successfully annotated. If two worked, why wouldn't the other 16? All 18 scaffolds were presented to MAKER as a single multi-FastA file. Presumably, each individual FastA would be processed by MAKER in the same way.

3. The data is real and the majority of genes in the _P.generosa_ genome lie outside of these large scaffolds. If this is the case, this means that there are _lengthy_ stretches of DNA that do not have readily-identifiable genomic features. It also suggests that, overall, the assembly we have (Pgenerosa_v070) is quite fragmented.


So, to assess some of this (particularly points 2 and 3 above), I decided to look at the three annotations we have for _P.generosa_ using IGV on some of the 18 scaffolds that are present in each of the three previous assembly variations we've annotated with MAKER.

#### PGA_scaffold_1

No annotations found by any of the three different annotations. Surprising, as this is the single longest scaffold in each of the three assemblies (89 Mbp)

![IGV comparison of PGA_scaffold_1](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190701_igv_compare_scaffold1.png?raw=true)

---

#### PGA_scaffold_3

Annotations only appear in the Pgenerosa_v070 assembly.

![IGV comparison of PGA_scaffold_3](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190701_igv_compare_scaffold3.png?raw=true)

---

#### PGA_scaffold_18

Annotations are found in each of the three assemblies.

![IGV comparison of PGA_scaffold_18](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20190701_igv_compare_scaffold18.png?raw=true)

---

Yeesh, so what do those IGV alignments tell us?

1. The single longest scaffold appears to have no identifiable genomic features, as no annotations are present in any of the various assemblies. That's a pretty lengthy stretch of DNA to not have anything there. Caveat: haven't looked at transposable elements, nor other repeats. Is it possible that an 89Mbp stretch of DNA could be comprised solely of low-complexity sequence?

2. The annotation process appears to be inconsistent. This is troublesome. Although this comes with a catch: I did not use the exact same data for each of the three annotations. For example, the Pgenerosa_v070 and Pgenerosa_v071 annotations used the singular _P.generosa_ transcriptome assembly ([from 20180827](https://robertslab.github.io/sams-notebook/posts/2018/2018-09-04-transcriptome-assembly-geoduck-rnaseq-data/))for EST evidence, whereas the Pgenerosa_v074 annotation utilized tissue-specific transcriptome assemblies (from 20190409) and Transdecoder peptide data from each of these assemblies. Could these changes account for the differences we see? Oddly, and this thought is solely based on the limited IGV comparisons posted above, it seems like the fewer the overall number of sequences in an assembly, the fewer annotations MAKER is able to generate within any given scaffold.

Now what?

Well, I have the Pgenerosa_v074 assembly currently running on [the Genome Sequence Annotation Server (GenSAS)](https://www.gensas.org) using most of the same files that I used for MAKER on Mox (exceptions are repeat masking/modeling files - those steps will be run on GenSAS). It will be interesting to see how the MAKER and GenSAS annotations compare.

I could also re-run the Pgenerosa_v071 annotation using the exact same files as Pgenerosa_v074 on GenSAS, as the files I need are already on their servers and it will be trivial to initiate the annotation process...