--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: TransDecoder - C.bairdi Transcriptomes v2.1 and v3.1 on Mox date: '2020-08-26 06:44' tags: - Chionoecetes bairdi - Tanner crab - transcriptome - transdecoder - mox categories: - 2020 - Tanner Crab RNAseq --- To continue annotation of our _C.bairdi_ v2.1 & v3.1 transcriptome assemblies, I needed to run [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) before performing the more thorough annotation with [Trinotate](https://github.com/Trinotate/Trinotate.github.io/wiki). Info for each transcriptome version (library composition, assembly dates, BUSCO, etc) can be found in this table: - [cbai_transcriptome_comp](https://docs.google.com/spreadsheets/d/1XAgU_xQKJjWk4ThJHn1wLDtPuW6X7s6Jjh_373bMc0U/edit?usp=sharing) This was run on Mox. SBATCH script (GitHub): - [20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1.sh) ```shell #!/bin/bash ## Job Name #SBATCH --job-name=cbai_transdecoder_transcriptomes_v2.1_v.3.1 ## Allocation Definition #SBATCH --account=coenv #SBATCH --partition=coenv ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=7-00:00:00 ## Memory per node #SBATCH --mem=120G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1 # Script to run TransDecoder on C.bairdi transcriptomes: # v2.1, v3.1 ################################################################################### # These variables need to be set by user # Capture date. E.g. format is: 20190820 timestamp=$(date +"%Y%m%d") transcriptomes_dir=/gscratch/srlab/sam/data/C_bairdi/transcriptomes # Paths to program directories blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" hmmer_dir="/gscratch/srlab/programs/hmmer-3.2.1/src" transdecoder_dir="/gscratch/srlab/programs/TransDecoder-v5.5.0" # Paths to Trinotate databases pfam_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/Pfam-A.hmm" sp_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/uniprot_sprot.pep" # Array of the various comparisons to evaluate # Each condition in each comparison should be separated by a "-" declare -A transcriptomes_gene_maps_array transcriptomes_gene_maps_array=( ["${transcriptomes_dir}/cbai_transcriptome_v2.1.fasta"]="${transcriptomes_dir}/cbai_transcriptome_v2.1.fasta.gene_trans_map" \ ["${transcriptomes_dir}/cbai_transcriptome_v3.1.fasta"]="${transcriptomes_dir}/cbai_transcriptome_v3.1.fasta.gene_trans_map" ) declare -A programs_array programs_array=( [blastp]="${blast_dir}/blastp" \ [hmmscan]="${hmmer_dir}/hmmscan" \ [transdecoder_lORFs]="${transdecoder_dir}/TransDecoder.LongOrfs" \ [transdecoder_predict]="${transdecoder_dir}/TransDecoder.Predict" ) threads=28 ################################################################################### # Exit script if a command fails set -e # Load Python Mox module for Python module availability module load intel-python3_2017 for transcriptome in "${!transcriptomes_gene_maps_array[@]}" do # Remove path from transcriptome using parameter substitution transcriptome_name="${transcriptome##*/}" # Set a prefix that utilizes timestamp and name of transcriptome prefix="${timestamp}_${transcriptome_name}" # Make output directory and change to directory mkdir --parents "${prefix}.transdecoder" && cd "$_" # Paths to input/output files blastp_out_dir="${prefix}.blastp_out" pfam_out_dir="${prefix}.pfam_out" # Make output directories mkdir "${blastp_out_dir}" mkdir "${pfam_out_dir}" blastp_out="${blastp_out_dir}/${prefix}.blastp.outfmt6" pfam_out="${pfam_out_dir}/${prefix}.pfam.domtblout" lORFs_pep="${transcriptome_name}.transdecoder_dir/longest_orfs.pep" # Extract long open reading frames "${programs_array[transdecoder_lORFs]}" \ --gene_trans_map "${transcriptomes_gene_maps_array[$transcriptome]}" \ -t "${transcriptome}" # Run blastp on long ORFs "${programs_array[blastp]}" \ -query "${lORFs_pep}" \ -db "${sp_db}" \ -max_target_seqs 1 \ -outfmt 6 \ -evalue 1e-5 \ -num_threads ${threads} \ > "${blastp_out}" # Run pfam search "${programs_array[hmmscan]}" \ --cpu ${threads} \ --domtblout "${pfam_out}" \ "${pfam_db}" \ "${lORFs_pep}" # Run Transdecoder with blastp and Pfam results "${programs_array[transdecoder_predict]}" \ -t "${transcriptome}" \ --retain_pfam_hits "${pfam_out}" \ --retain_blastp_hits "${blastp_out}" # Capture FastA MD5 checksum for future reference md5sum "${transcriptome}" > "${prefix}".checksum.md5 # Move back up to main directory cd .. done # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log # Capture program options for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" ${programs_array[$program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true done ``` --- # RESULTS Took a bit over 3 days to process both transcriptomes: ![combined transdecoder runtimes for v2.1 and v3.1 transcriptomes](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1_runtime.png?raw=true) Output folder: - [20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/) ##### `cbai_transcriptome_v2.1` Coding Sequences (FastA): - [20200826_cbai_transcriptome_v2.1.fasta.transdecoder.cds](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v2.1.fasta.transdecoder/20200826_cbai_transcriptome_v2.1.fasta.transdecoder.cds) Peptide Sequences (FastA): - [20200826_cbai_transcriptome_v2.1.fasta.transdecoder.pep](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v2.1.fasta.transdecoder/20200826_cbai_transcriptome_v2.1.fasta.transdecoder.pep) BLASTp output (tab): - [20200826_cbai_transcriptome_v2.1.fasta.blastp.outfmt6](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v2.1.fasta.transdecoder/20200826_cbai_transcriptome_v2.1.fasta.blastp_out/20200826_cbai_transcriptome_v2.1.fasta.blastp.outfmt6) Pfam output: - [20200826_cbai_transcriptome_v2.1.fasta.pfam.domtblout](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v2.1.fasta.transdecoder/20200826_cbai_transcriptome_v2.1.fasta.pfam_out/20200826_cbai_transcriptome_v2.1.fasta.pfam.domtblout) ##### `cbai_transcriptome_v3.1` Coding Sequences (FastA): - [20200826_cbai_transcriptome_v3.1.fasta.transdecoder.cds](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v3.1.fasta.transdecoder/20200826_cbai_transcriptome_v3.1.fasta.transdecoder.cds) Peptide Sequences (FastA): - [20200826_cbai_transcriptome_v3.1.fasta.transdecoder.pep](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v3.1.fasta.transdecoder/20200826_cbai_transcriptome_v3.1.fasta.transdecoder.pep) BLASTp output (tab): - [20200826_cbai_transcriptome_v3.1.fasta.blastp.outfmt6](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v3.1.fasta.transdecoder/20200826_cbai_transcriptome_v3.1.fasta.blastp_out/20200826_cbai_transcriptome_v3.1.fasta.blastp.outfmt6) Pfam output: - [20200826_cbai_transcriptome_v3.1.fasta.pfam.domtblout](https://gannet.fish.washington.edu/Atumefaciens/20200826_cbai_transdecoder_transcriptomes_v2.1_v.3.1/20200826_cbai_transcriptome_v3.1.fasta.transdecoder/20200826_cbai_transcriptome_v3.1.fasta.pfam_out/20200826_cbai_transcriptome_v3.1.fasta.pfam.domtblout)