---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Transcriptome Annotation - Geoduck Ctenidia with Trinotate on Mox
date: '2019-03-18 15:40'
tags:
  - geoduck
  - Panopea generosa
  - trinotate
  - trinity
  - transcriptome
  - mox
categories:
  - 2019
  - Miscellaneous
---
I ran [Trinotate](https://github.com/Trinotate/Trinotate.github.io/wiki) (v3.1.1) on Mox to annotate the _P.generosa_ ctenidia transcriptome I previously assembled with the HiSeq data from Illumina.

Trinity:

- [20190215 Ctenidia Trinity](https://robertslab.github.io/sams-notebook/posts/2019/2019-02-15-Transcriptome-Assembly---Geoduck-Tissue-Specific-Assembly-Ctenidia/)

Transdecoder:

- [20190318 Ctenidia Transdecoder](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-18-Transcriptome-Annotation---Geoduck-Ctenidia-with-Transdecoder-on-Mox/)

BLASTx:

- [20190318 Ctenidia BLASTx](https://robertslab.github.io/sams-notebook/posts/2019/2019-03-18-Transcriptome-Annotation---Geoduck-Ctenidia-with-BLASTx-on-Mox/)


SBATCH script (GitHub):

- [20190318_trinotate_geoduck_ctenidia_RNAseq.sh](https://raw.githubusercontent.com/RobertsLab/sams-notebook/master/sbatch_scripts/20190318_trinotate_geoduck_ctenidia_RNAseq.sh)

<pre><code>
#!/bin/bash
## Job Name
#SBATCH --job-name=tate_ctenidia
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=25-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190318_trinotate_geoduck_ctenidia_RNAseq

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


wd="$(pwd)"


# Paths to input/output files
## Non-working directory locations
blastp_out_dir="/gscratch/scrubbed/samwhite/outputs/20190318_transdecoder_geoduck_ctenidia_RNAseq/blastp_out"
blastx_out_dir="/gscratch/scrubbed/samwhite/outputs/20190318_blastx_geoduck_ctenidia_RNAseq"
pfam_out_dir="/gscratch/scrubbed/samwhite/outputs/20190318_transdecoder_geoduck_ctenidia_RNAseq/pfam_out"
trinity_out_dir="/gscratch/scrubbed/samwhite/outputs/20190215_trinity_geoduck_ctenidia_RNAseq/trinity_out_dir"
transdecoder_out_dir="/gscratch/scrubbed/samwhite/outputs/20190318_transdecoder_geoduck_ctenidia_RNAseq/Trinity.fasta.transdecoder_dir"


## New folders for working directory
rnammer_out_dir="${wd}/RNAmmer_out"
signalp_out_dir="${wd}/signalp_out"
tmhmm_out_dir="${wd}/tmhmm_out"


blastp_out="${blastp_out_dir}/blastp.outfmt6"
blastx_out="${blastx_out_dir}/blastx.outfmt6"
pfam_out="${pfam_out_dir}/pfam.domtblout"
lORFs_pep="${transdecoder_out_dir}/longest_orfs.pep"
rnammer_out="${rnammer_out_dir}/Trinity.fasta.rnammer.gff"
signalp_out="${signalp_out_dir}/signalp.out"
tmhmm_out="${tmhmm_out_dir}/tmhmm.out"
trinity_fasta="${trinity_out_dir}/Trinity.fasta"
trinity_gene_map="${trinity_out_dir}/Trinity.fasta.gene_trans_map"
trinotate_report="${wd}/trinotate_annotation_report.txt"


# Paths to programs
rnammer_dir="/gscratch/srlab/programs/RNAMMER-1.2"
rnammer="${rnammer_dir}/rnammer"
signalp_dir="/gscratch/srlab/programs/signalp-4.1"
signalp="${signalp_dir}/signalp"
tmhmm_dir="/gscratch/srlab/programs/tmhmm-2.0c/bin"
tmhmm="${tmhmm_dir}/tmhmm"
trinotate_dir="/gscratch/srlab/programs/Trinotate-v3.1.1"
trinotate="${trinotate_dir}/Trinotate"
trinotate_rnammer="${trinotate_dir}/util/rnammer_support/RnammerTranscriptome.pl"
pfam_db="${trinotate_dir}/admin/Pfam-A.hmm"
sp_db="${trinotate_dir}/admin/uniprot_sprot.pep"
trinotate_sqlite_db="Trinotate.sqlite"

# Make output directories
mkdir ${rnammer_out_dir} ${signalp_out_dir} ${tmhmm_out_dir}

# Copy sqlite database template

cp ${trinotate_dir}/admin/Trinotate.sqlite .

# Run signalp
${signalp} \
-f short \
-n ${signalp_out} \
${lORFs_pep}

# Run tmHMM
${tmhmm} \
--short \
< ${lORFs_pep} \
> ${tmhmm_out}

# Run RNAmmer
cd ${rnammer_out_dir}
${trinotate_rnammer} \
--transcriptome ${trinity_fasta} \
--path_to_rnammer ${rnammer}
cd ${wd}

# Run Trinotate
## Load transcripts and coding regions into database
${trinotate} \
${trinotate_sqlite_db} \
init \
--gene_trans_map ${trinity_gene_map} \
--transcript_fasta ${trinity_fasta} \
--transdecoder_pep ${lORFs_pep}

## Load BLAST homologies
${trinotate} \
${trinotate_sqlite_db} \
LOAD_swissprot_blastp \
${blastp_out}

${trinotate} \
${trinotate_sqlite_db} \
LOAD_swissprot_blastx \
${blastx_out}

## Load Pfam
${trinotate} \
${trinotate_sqlite_db} \
LOAD_pfam \
${pfam_out}

## Load transmembrane domains
${trinotate} \
${trinotate_sqlite_db} \
LOAD_tmhmm \
${tmhmm_out}

## Load signal peptides
${trinotate} \
${trinotate_sqlite_db} \
LOAD_signalp \
${signalp_out}

## Load RNAmmer
${trinotate} \
${trinotate_sqlite_db} \
LOAD_rnammer \
${rnammer_out}

## Creat annotation report
${trinotate} \
${trinotate_sqlite_db} \
report \
> ${trinotate_report}
</code></pre>

---

# RESULTS

Output folder:

- [20190318_trinotate_geoduck_ctenidia_RNAseq/](http://gannet.fish.washington.edu/Atumefaciens/20190318_trinotate_geoduck_ctenidia_RNAseq/)

Trinotate Annotation Table (text):

- [20190318_trinotate_geoduck_ctenidia_RNAseq/trinotate_annotation_report.txt](http://gannet.fish.washington.edu/Atumefaciens/20190318_trinotate_geoduck_ctenidia_RNAseq/trinotate_annotation_report.txt)

Trinotate SQLlite Database:

- [20190318_trinotate_geoduck_ctenidia_RNAseq/Trinotate.sqlite](http://gannet.fish.washington.edu/Atumefaciens/20190318_trinotate_geoduck_ctenidia_RNAseq/Trinotate.sqlite)

Info on report output, from the [Trinotate wiki](https://github.com/Trinotate/Trinotate.github.io/wiki/Loading-generated-results-into-a-Trinotate-SQLite-Database-and-Looking-the-Output-Annotation-Report):

<pre><code>
The output has the following column headers:
0       #gene_id
1       transcript_id
2       sprot_Top_BLASTX_hit
3       RNAMMER
4       prot_id
5       prot_coords
6       sprot_Top_BLASTP_hit
7       custom_pombe_pep_BLASTX
8       custom_pombe_pep_BLASTP
9       Pfam
10      SignalP
11      TmHMM
12      eggnog
13      Kegg
14      gene_ontology_blast
15      gene_ontology_pfam
16      transcript
17      peptide

and the data are formatted like so:

0       TRINITY_DN179_c0_g1
1       TRINITY_DN179_c0_g1_i1
2       GCS1_SCHPO^GCS1_SCHPO^Q:53-2476,H:1-808^100%ID^E:0^RecName: Full=Probable mannosyl-oligosaccharide glucosidase;^Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces
3       .
4       TRINITY_DN179_c0_g1_i1|m.1
5       2-2479[+]
6       GCS1_SCHPO^GCS1_SCHPO^Q:18-825,H:1-808^100%ID^E:0^RecName: Full=Probable mannosyl-oligosaccharide glucosidase;^Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces
7       SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^Q:53-2476,H:1-808^100%ID^E:0^.^.
8       SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^Q:18-825,H:1-808^100%ID^E:0^.^.
9       PF16923.2^Glyco_hydro_63N^Glycosyl hydrolase family 63 N-terminal domain^58-275^E:6.9e-60`PF03200.13^Glyco_hydro_63^Glycosyl hydrolase family 63 C-terminal domain^315-823^E:5.1e-187
10      .
11      .
12      .
13      KEGG:spo:SPAC6G10.09`KO:K01228
14      GO:0005783^cellular_component^endoplasmic reticulum`GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0004573^molecular_function^mannosyl-oligosaccharide glucosidase activity`GO:0009272^biological_process^fungal-type cell wall biogenesis`GO:0009311^biological_process^oligosaccharide metabolic process`GO:0006487^biological_process^protein N-linked glycosylation
15      .
16      .
17      .

Backticks  and carets (^) are used as delimiters for data packed within an individual field, such as separating E-values, percent identity, and taxonomic info for best matches. When there are multiple assignments in a given field, the assignments are separated by (`) and the fields within an assignment are separated by (^).
</code></pre>