---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Transcriptome Comparison - C.bairdi Transcriptomes Evaluations with DETONATE on Mox
date: '2020-06-01 11:45'
tags:
  - Tanner crab
  - Chionoecetes bairdi
  - mox
  - DETONATE
categories:
  - 2020
  - Miscellaneous
---
Attempting to get some other metrics regarding our various _C.bairdi_ transcriptome assemblies other than BUSCO scores, I decided to try running [DETONATE](http://deweylab.biostat.wisc.edu/detonate/), as this is a recommended tool by [`Trinity`](https://github.com/trinityrnaseq/trinityrnaseq/wiki).

I recently ran DETONATE's `ref-eval` ([see 20200529](https://robertslab.github.io/sams-notebook/posts/2020/2020-05-29-Transcriptome-Comparison---C.bairdi-Transcriptomes-Compared-with-DETONATE-on-Mox/)) because it was relatively easy to do and thought it might be useful, as it's included in the [DETONATE](http://deweylab.biostat.wisc.edu/detonate/) package. However, the results are complicated to interpret and I'm not sure they actually tell us anything.

Continuing with the [DETONATE](http://deweylab.biostat.wisc.edu/detonate/) package, I ran the other component (which is probably what we really want and will provide us with a simple "score" which will be easier to understand), `rsem-eval`.  Plus, it's a very low effort process, so might as well give it a whirl.

The job was run on Mox.

SBATCH script (GitHub):

- [20200616_cbai_detonate_transcriptome_evaluations.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200616_cbai_detonate_transcriptome_evaluations.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_detonate_transcriptome_evaluations
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=15-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200616_cbai_detonate_transcriptome_evaluations


###################################################################################
# These variables need to be set by user

# Assign Variables
## frag_size is guesstimate of library fragment sizes
frag_size=500
reads_dir=/gscratch/srlab/sam/data/C_bairdi/RNAseq
transcriptomes_dir=/gscratch/srlab/sam/data/C_bairdi/transcriptomes
threads=28

# Array of the various comparisons to evaluate
# Each condition in each comparison should be separated by a "-"
transcriptomes_array=(
"${transcriptomes_dir}"/cbai_transcriptome_v1.0.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v1.5.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v1.6.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v1.7.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v2.0.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v2.1.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v3.0.fasta \
"${transcriptomes_dir}"/cbai_transcriptome_v3.1.fasta
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017


# Programs array
declare -A programs_array
programs_array=(
[bowtie2]="/gscratch/srlab/programs/bowtie2-2.3.5.1-linux-x86_64" \
[detonate_trans_length]="/gscratch/srlab/programs/detonate-1.11/rsem-eval/rsem-eval-estimate-transcript-length-distribution" \
[detonate]="/gscratch/srlab/programs/detonate-1.11/rsem-eval/rsem-eval-calculate-score"
)


# Loop through each comparison
for transcriptome in "${!transcriptomes_array[@]}"
do

  ## Inititalize arrays
  R1_array=()
  R2_array=()
  reads_array=()

  # Variables
  R1_list=""
  R2_list=""

  transcriptome_name="${transcriptomes_array[$transcriptome]##*/}"


  rsem_eval_dist_mean_sd="${transcriptome_name}_true_length_dis_mean_sd.txt"

  # Capture FastA checksums for verification
  echo "Generating checksum for ${transcriptome_name}"
  md5sum "${transcriptomes_array[transcriptome]}" >> fasta.checksums.md5
  echo "Finished generating checksum for ${transcriptome_name}"
  echo ""

	if [[ "${transcriptome_name}" == "cbai_transcriptome_v1.0.fasta" ]]; then

    reads_array=("${reads_dir}"/20200[15][13][138]*megan*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/20200[15][13][138]*megan*R1.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/20200[15][13][138]*megan*R2.fq)


  elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.5.fasta" ]]; then

    reads_array=("${reads_dir}"/20200[145][13][138]*megan*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/20200[145][13][138]*megan*R1.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/20200[145][13][138]*megan*R2.fq)

  elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.6.fasta" ]]; then

    reads_array=("${reads_dir}"/*megan*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*megan*R1.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*megan*R2.fq)

  elif [[ "${transcriptome_name}" == "cbai_transcriptome_v1.7.fasta" ]]; then

    reads_array=("${reads_dir}"/20200[145][13][189]*megan*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/20200[145][13][189]*megan*R1.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/20200[145][13][189]*megan*R2.fq)

  elif [[ "${transcriptome_name}" == "cbai_transcriptome_v2.0.fasta" ]] \
  || [[ "${transcriptome_name}" == "cbai_transcriptome_v2.1.fasta" ]]; then

    reads_array=("${reads_dir}"/*fastp-trim*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*R1*fastp-trim*.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*R2*fastp-trim*.fq)

  elif [[ "${transcriptome_name}" == "cbai_transcriptome_v3.0.fasta" ]] \
  || [[ "${transcriptome_name}" == "cbai_transcriptome_v3.1.fasta" ]]; then

    reads_array=("${reads_dir}"/*fastp-trim*20[12][09][01][24]1[48]*.fq)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*R1*fastp-trim*20[12][09][01][24]1[48]*.fq)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*R2*fastp-trim*20[12][09][01][24]1[48]*.fq)


  fi

  # Create list of fastq files used in analysis
  ## Uses parameter substitution to strip leading path from filename
  printf "%s\n" "${reads_array[@]##*/}" >> "${transcriptome_name}".fastq.list.txt

  # Create comma-separated lists of FastQ reads
  R1_list=$(echo "${R1_array[@]}" | tr " " ",")
  R2_list=$(echo "${R2_array[@]}" | tr " " ",")

  # Determine transcript length
  ${programs_array[detonate_trans_length]} \
  "${transcriptomes_array[$transcriptome]}" \
  "${rsem_eval_dist_mean_sd}"


  # Run rsem-eval
  # Use bowtie2 and paired-end options
  ${programs_array[detonate]} \
  --bowtie2 \
  --bowtie2-path "${programs_array[bowtie2]}" \
  --num-threads ${threads} \
  --transcript-length-parameters "${rsem_eval_dist_mean_sd}" \
  --paired-end \
  "${R1_list}" \
  "${R2_list}" \
  "${transcriptomes_array[$transcriptome]}" \
  "${transcriptome_name}" \
  ${frag_size}


done

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

# Capture program options
for program in "${!programs_array[@]}"
do
	{
  echo "Program options for ${program}: "
	echo ""
	${programs_array[$program]} --help
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
} &>> program_options.log || true
done
```


---

# RESULTS

This process was PAINFUL. Here's the "runtime" for the failed job (due to it timing out; after 65 DAYS!). Also, this was _just_ for `cbai_transcriptome_v2.0`!!!

![ref-eval runtime of 65 days](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200616_cbai_detonate_transcriptome_evaluations_runtime.png?raw=true)

Overall, even though this required very little effort on my part, it was kind of a pain to manage. For some reason (I guess it's due to the number of sequences it has to align) `cbai_transcriptome_v2.0` alignments took too long (i.e. longer than the 30 days between Mox node maintenance). I restarted this job a couple of times and I finally lucked out for a bit when the October, November, and December 2020 Mox maintenance dates were canceled. Despite that, I neglected to extend the runtime further in December and the job timed out.

But, with that said, part of me thinks something weird was going on anyway. I mean, look at the size of the BAM file (remember, a BAM file is a _compressed_ version of a SAM file, and is usually close to _10x smaller_ than the originating SAM file!) that was still being made when the job died:

![cbai_transcriptome_v2.0 BAM file size screencap](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200616_cbai_detonate_transcriptome_evaluations_file_size.png?raw=true)

789GB!!!!

That's absurd! Not to mention the fact that this was generated over the course of _two months_!

I'm going to try one more thing to see if I can get `rsem-eval` to work. Again, it's low effort, so won't take too much of my time. I'm going to run [`bowtie2`](https://github.com/BenLangmead/bowtie2) independently of `rsem-eval` ([`bowtie2`](https://github.com/BenLangmead/bowtie2) alignment is built-in to that, if the user wants to use it) and see if that is somehow faster. If it is, then I can provide the resulting BAM files as input to `rsem-eval`.