---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Annotation - Geoduck Transcritpome with TransDecoder
date: '2018-11-21'
tags:
  - TransDecoder
  - geoduck
  - Panopea generosa
  - annotation
  - transcriptome
  - mox
categories:
  - 2018
  - Miscellaneous
---
[I was tasked with generating some qPCR primers to analyze vitellogenin expression in geoduck](https://github.com/RobertsLab/resources/issues/491). In order to do so, I needed to annotate a geoduck transcriptome in order to identify potential vitellogenin genes. I had [previously assembled a geoduck transcriptome](https://robertslab.github.io//sams-notebook/2018/09/04/transcriptome-assembly-geoduck-rnaseq-data.html). For annotation, I used [TransDecoder (v5.5.0)](https://github.com/TransDecoder/TransDecoder/wiki). The annotation was run on our Mox HPC node.


Mox SBATCH script:

- [20181121_geo_transdecoder.sh](https://gannet.fish.washington.edu/Atumefaciens/20181121_geo_transdecoder/20181121_geo_transdecoder.sh)

<pre><code>
#!/bin/bash
## Job Name
#SBATCH --job-name=transcoder
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20181121_geo_transdecoder

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Make blast database available to blast
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008

## Establish variables for more readable code
### Transdecoder programs
td_longorfs="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.LongOrfs"

td_predict="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.Predict"

### BLASTp
blastp="/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp"

### HMMscan
hmmscan="/gscratch/srlab/programs/hmmer-3.2.1/src/hmmscan"
### Transcriptome
geo_trinity_loc="/gscratch/srlab/sam/data/P_generosa/generosa_transcriptomes/20180827_trinity_geoduck.fasta"

geo_trinity="20180827_trinity_geoduck.fasta"

### UniProt database
uniprot="/gscratch/srlab/blastdbs/UniProtKB_20181008/20181008_uniprot_sprot.fasta"

### Pfam databases
pfam="/gscratch/srlab/sam/data/databases/pfam_db/Pfam31.0/Pfam-A.hmm"

####################
# Run transdecoder longorfs
${td_longorfs} -t ${geo_trinity_loc}

# Run blastp on UniProt database
${blastp} \
-query ${geo_trinity}.transdecoder_dir/longest_orfs.pep \
-db ${uniprot} \
-max_target_seqs 1 \
-outfmt 6 \
-evalue 1e-5 \
-num_threads 28 \
> blastp.outfmt6 \
2> blastp.err

# Run HMMscan on Pfam databases
${hmmscan} \
--cpu 28 \
--domtblout pfam.domtblout \
${pfam} \
${geo_trinity}.transdecoder_dir/longest_orfs.pep \
2> hmmscan.err

# Run transdecoder predict, using info from blatp and hmmscan
${td_predict} \
-t ${geo_trinity_loc} \
--retain_pfam_hits pfam.domtblout \
--retain_blastp_hits blastp.outfmt6
</code></pre>

List of input files and databases used:

- [20180827_trinity_geoduck.fasta (FastA 972MB)](https://owl.fish.washington.edu/Athaliana/20180827_trinity_geoduck_RNAseq/Trinity.fasta)

- 20181008_uniprot_sprot.fasta: Downloaded 20181008.

- Pfam31.0/Pfam-A.hmm

Here's a quick rundown of the TransDecoder process:

1. Identify longest open reading frames (ORFs)

2. BLASTp ORFs against UniProt database to identify protein matches.

3. HMMscan (Hidden Markov Model) against Pfam database to identify protein families.

4. TransDecoder predicts final coding sequences (CDS) using BLASTp and HMMscan info to provide additional support functional CDS identification.

---
# RESULTS
Data was copied to Gannet via ```rsync```

Output directory:

- [20181121_geo_transdecoder/](https://gannet.fish.washington.edu/Atumefaciens/20181121_geo_transdecoder/)

CDS FastA (271MB):

- [20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.cds](https://gannet.fish.washington.edu/Atumefaciens/20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.cds)

BED file (81MB):
- [20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.bed](https://gannet.fish.washington.edu/Atumefaciens/20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.bed)

GFF3 file (277MB):

- [20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.gff3](https://gannet.fish.washington.edu/Atumefaciens/20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.gff3)

Alrighty, now we have an annotated transcriptome that I can use for finding vitellogenin transcripts and designing some primers!