--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: ORF Identification - L.staminea De Novo Transcriptome Assembly v1.0 Using Transdecoder on Mox date: '2023-06-17 15:27' tags: - Leukoma staminea - little neck clam - transdecoder - mox categories: - 2023 - Miscellaneous --- After [performing a _de novo_ transcriptome assembly with _L.staminea_ RNA-seq data](https://robertslab.github.io/sams-notebook/posts/2023/2023-06-16-Transcriptome-Assembly---De-Novo-L.staminea-Trimmed-RNAseq-Using-Trinity-on-Mox/), the [`Trinity`](https://github.com/trinityrnaseq/trinityrnaseq/wiki) assembly stats were quite a bit more "exaggerated" than normally expected. In an attempt to get a better sense of which contigs might be more useful candidates for downstream analysis, I decided to run the assembly through [Transdecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify open reading frames (ORFs). This was run on Mox. SLURM script (GitHub): - [20230617-lsta-transdecoder-transcriptome_v1.0.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20230617-lsta-transdecoder-transcriptome_v1.0.sh) ```bash #!/bin/bash ## Job Name #SBATCH --job-name=20230617-lsta-transdecoder-transcriptome_v1.0 ## Allocation Definition #SBATCH --account=srlab #SBATCH --partition=srlab ## Resources ## Nodes #SBATCH --nodes=1 ## Walltime (days-hours:minutes:seconds format) #SBATCH --time=05-00:00:00 ## Memory per node #SBATCH --mem=200G ##turn on e-mail notification #SBATCH --mail-type=ALL #SBATCH --mail-user=samwhite@uw.edu ## Specify the working directory for this job #SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20230617-lsta-transdecoder-transcriptome_v1.0 # Transdecoder to identify ORFs and "functional" contigs from 20230616 L.staminea # de novo transcriptome assembly v1.0. ################################################################################### # These variables need to be set by user ## Assign Variables threads=28 # Paths to input/output files trinity_fasta="/gscratch/scrubbed/samwhite/outputs/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta" trinity_gene_trans_map="/gscratch/scrubbed/samwhite/outputs/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta.gene_trans_map" blastp_out_dir="blastp_out" transdecoder_out_dir="${trinity_fasta##*/}.transdecoder_dir" pfam_out_dir="pfam_out" blastp_out="${blastp_out_dir}/blastp.outfmt6" pfam_out="${pfam_out_dir}/pfam.domtblout" lORFs_pep="${transdecoder_out_dir}/longest_orfs.pep" pfam_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/Pfam-A.hmm" sp_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/uniprot_sprot.pep" # Paths to programs blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin" blastp="${blast_dir}/blastp" hmmer_dir="/gscratch/srlab/programs/hmmer-3.2.1/src" hmmscan="${hmmer_dir}/hmmscan" transdecoder_dir="/gscratch/srlab/programs/TransDecoder-v5.5.0" transdecoder_lORFs="${transdecoder_dir}/TransDecoder.LongOrfs" transdecoder_predict="${transdecoder_dir}/TransDecoder.Predict" ################################################################################### # Load Python Mox module for Python module availability module load intel-python3_2017 # Make output directories mkdir --parents "${blastp_out_dir}" mkdir --parents "${pfam_out_dir}" # Extract long open reading frames ${transdecoder_lORFs} \ -t ${trinity_fasta} \ --gene_trans_map ${trinity_gene_trans_map} # Run blastp on long ORFs ${blastp} \ -query ${lORFs_pep} \ -db ${sp_db} \ -max_target_seqs 1 \ -outfmt 6 \ -evalue 1e-5 \ -num_threads ${threads} \ > ${blastp_out} # Run pfam search ${hmmscan} \ --cpu ${threads} \ --domtblout ${pfam_out} \ ${pfam_db} \ ${lORFs_pep} # Run Transdecoder with blastp and Pfam results ${transdecoder_predict} \ -t ${trinity_fasta} \ --retain_pfam_hits ${pfam_out} \ --retain_blastp_hits ${blastp_out} #################################################################### # Capture program options if [[ "${#programs_array[@]}" -gt 0 ]]; then echo "Logging program options..." for program in "${!programs_array[@]}" do { echo "Program options for ${program}: " echo "" # Handle samtools help menus if [[ "${program}" == "samtools_index" ]] \ || [[ "${program}" == "samtools_sort" ]] \ || [[ "${program}" == "samtools_view" ]] then ${programs_array[$program]} # Handle DIAMOND BLAST menu elif [[ "${program}" == "diamond" ]]; then ${programs_array[$program]} help # Handle NCBI BLASTx menu elif [[ "${program}" == "blastx" ]]; then ${programs_array[$program]} -help # Handle fastp menu elif [[ "${program}" == "fastp" ]]; then ${programs_array[$program]} --help fi ${programs_array[$program]} -h echo "" echo "" echo "----------------------------------------------" echo "" echo "" } &>> program_options.log || true # If MultiQC is in programs_array, copy the config file to this directory. if [[ "${program}" == "multiqc" ]]; then cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml fi done fi # Document programs in PATH (primarily for program version ID) { date echo "" echo "System PATH for $SLURM_JOB_ID" echo "" printf "%0.s-" {1..10} echo "${PATH}" | tr : \\n } >> system_path.log ``` --- # RESULTS Run time was nearly 20hrs. ![Screencap of L.staminea Transdecoder run time on Mox showing a run time of 19hrs, 12mins, 51secs](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20230617-lsta-transdecoder-transcriptome_v1.0-runtime.png?raw=true) Output folder: - [20230617-lsta-transdecoder-transcriptome_v1.0/](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/) #### BED (text) - [lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.bed](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.bed) (30M) - MD5: `02497568c87e65279e83551ba8fb43ae` #### Coding Sequences (FastA) - [lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.cds](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.cds) (98M) - MD5: `1517d724b18d0bd8759cbccc0487e460` #### GFF (text) - [lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3) (99M) - MD5: `901ab1c9ca1c0998714ff9cd602b2063` #### Peptide Sequences (FastA) - [lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.pep](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.pep) (52M) - MD5: `53752bdb28bf2f89a04620bcee109a4f` #### Pfam output (text) - [pfam.domtblout](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/pfam_out/pfam.domtblout) (132M) - MD5: `18cd0fdff4993ed7f1d8f7d36f83e1d6` #### BLASTp output (format 6; text) - [blastp.outfmt6](https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/blastp_out/blastp.outfmt6) (6.2M) - MD5: `c416fbe543555672083f1a63a4935ed0` When counting complete ORFs (`awk -F"\t" '$3=="gene"' lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3 | grep -c "complete"`), the result is 28,451. This is a _significant_ reduction in potential genes compared to what [Trinity identified in the _do novo_ assembly from yesterday](https://robertslab.github.io/sams-notebook/posts/2023/2023-06-16-Transcriptome-Assembly---De-Novo-L.staminea-Trimmed-RNAseq-Using-Trinity-on-Mox/) (502,826 "genes"). Additionally, this is a much more realistic number of genes. Overall, ORF identification broke out like so: ``` 74533 CDS 74533 exon 35983 five_prime_UTR 74533 gene 74533 mRNA 43403 three_prime_UTR ``` These numbers include partial ORFs. Even including the partial OFRs, these counts are much more realistic compared to the Trinity _de novo_ assembly stats.