---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: Transcript Abundance - C.bairdi Alignment-free with Salmon on Mox for Grace
date: '2020-04-15 11:10'
tags:
  - Trinity
  - salmon
  - transcript abundance
  - Tanner crab
  - Chionoecetes bairdi
  - mox
categories:
  - 2020
  - Miscellaneous
---
[Per this GitHub Issue](https://github.com/RobertsLab/resources/issues/902), Grace and Steven asked if I could help by generating a transcript abundance file for Grace to use with EdgeR. To do so, I used [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) for alignment-free transcript abundance estimates due to its speed and its incorporation into [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#salmon-output) with the following files:

- [Trimmed FastQs from 20191025](https://robertslab.github.io/sams-notebook/posts/2019/2019-12-18-TrimmingFastQCMultiQC---C.bairdi-RNAseq-FastQ-with-fastp-on-Mox/)

- [_C.bairdi_ transcriptome from 20200409](https://robertslab.github.io/sams-notebook/posts/2020/2020-03-30-Transcriptome-Assembly---C.bairdi-with-MEGAN6-Taxonomy-specific-Reads-with-Trinity-on-Mox/) (NOTE: Due to delays in running the initial assembly, FastA file is dated 20200408, despite notebook dated 20200330).

- [Trinotate annotations from 20200409](https://robertslab.github.io/sams-notebook/posts/2020/2020-04-09-Transcriptome-Annotation---Trinotate-C.bairdi-MEGAN6-Taxonomic-specific-Trinity-Assembly-on-Mox/)


SBATCH script (GitHub):

- [20200415_cbai_salmon_abundance.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200415_cbai_salmon_abundance.sh)

```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=cbai_salmon_abundance_estimates
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=04-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200415_cbai_salmon_abundance

## Script to get gene abundance estimates via salmon alignment-free
## Specifically for Grace, per this GitHub issue: https://github.com/RobertsLab/resources/issues/902


# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

wd="$(pwd)"
threads=28

samples=samples.txt

fasta_prefix="20200408.C_bairdi.megan.Trinity"


## Set input file locations
trimmed_reads_dir="/gscratch/srlab/sam/data/C_bairdi/RNAseq"
transcriptome_dir="/gscratch/srlab/sam/data/C_bairdi/transcriptomes"
transcriptome="${transcriptome_dir}/${fasta_prefix}.fasta"

trinotate_feature_map="${transcriptome_dir}/20200409.cbai.trinotate.annotation_feature_map.txt"
gene_map="${transcriptome_dir}/${fasta_prefix}.fasta.gene_trans_map"

# Standard output/error files
matrix_stdout="matrix_stdout.txt"
matrix_stderr="matrix_stderr.txt"
salmon_stdout="salmon_stdout.txt"
salmon_stderr="salmon_stderr.txt"


#programs
trinity_home=/gscratch/srlab/programs/trinityrnaseq-v2.9.0
trinity_annotate_matrix="${trinity_home}/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl"
trinity_abundance=${trinity_home}/util/align_and_estimate_abundance.pl
trinity_matrix=${trinity_home}/util/abundance_estimates_to_matrix.pl

# Create salmon index of Trinity FastA
# Useful for saving time if needed in future for
# additional runs.
${trinity_abundance} \
--transcripts ${transcriptome} \
--est_method salmon \
--prep_reference \
--thread_count "${threads}" \
--output_dir "${wd}"


# Rsync trimmed reads
rsync \
--archive \
--verbose \
${trimmed_reads_dir}/3297*trim*.gz .

# Populate array with unique sample names
## NOTE: Requires Bash >=v4.0
mapfile -t samples_array < <( for fastq in 3297*.gz; do echo "${fastq}" | awk -F"_" '{print $1}'; done | sort -u )

# Loop to concatenate same sample R1 and R2 reads
# Also create sample list file
for sample in "${!samples_array[@]}"
do
  # Concatenate R1 reads for each sample
  for fastq in *R1*.gz
  do
    fastq_sample=$(echo "${fastq}" | awk -F"_" '{print $1}')
    if [ "${samples_array[sample]}" == "${fastq_sample}" ]; then
      echo "${fastq}" >> fastq.list.txt
      reads_1=${samples_array[sample]}_reads_1.fq
      gunzip --to-stdout "${fastq}" >> "${reads_1}"
    fi
  done

  # Concatenate R2 reads for each sample
  for fastq in *R2*.gz
  do
    fastq_sample=$(echo "${fastq}" | awk -F"_" '{print $1}')
    if [ "${samples_array[sample]}" == "${fastq_sample}" ]; then
      echo "${fastq}" >> fastq.list.txt
      reads_2=${samples_array[sample]}_reads_2.fq
      gunzip --to-stdout "${fastq}" >> "${reads_2}"
    fi
  done

  # Create tab-delimited samples file.
  printf "%s\t%s\t%s\t%s\n" "${samples_array[sample]}" "${samples_array[sample]}_01" "${reads_1}" "${reads_2}" \
  >> ${samples}
done

# Create directory/sample list for ${trinity_matrix} command
trin_matrix_list=$(awk '{printf "./%s%s", $2, "/quant.sf " }' "${samples}")


# Runs salmon and stranded library option
${trinity_abundance} \
--transcripts ${transcriptome} \
--seqType fq \
--left reads_1.fq \
--right reads_2.fq \
--SS_lib_type RF \
--est_method salmon \
--samples_file "${samples}" \
--gene_trans_map "${gene_map}" \
--thread_count "${threads}" \
--output_dir "${wd}" \
1> ${salmon_stdout} \
2> ${salmon_stderr}


# Convert abundance estimates to matrix
${trinity_matrix} \
--est_method salmon \
--gene_trans_map ${gene_map} \
--out_prefix salmon \
--name_sample_by_basedir \
${trin_matrix_list} \
1> ${matrix_stdout} \
2> ${matrix_stderr}

# Integrate functional Trinotate functional annotations
"${trinity_annotate_matrix}" \
"${trinotate_feature_map}" \
salmon.gene.counts.matrix \
> salmon.gene.counts.annotated.matrix

# Clean up
rm ./*trim*.gz
rm ./*.fq
```

---

# RESULTS

Pretty quick, ~46mins:

![runtime salmon abundance estimates](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200415_cbai_salmon_abundance_runtime.png?raw=true)

Output folder:

- [20200415_cbai_salmon_abundance/](https://gannet.fish.washington.edu/Atumefaciens/20200415_cbai_salmon_abundance/)


Transcript counts matrix (text):

- [20200415_cbai_salmon_abundance/salmon.isoform.counts.matrix](https://gannet.fish.washington.edu/Atumefaciens/20200415_cbai_salmon_abundance/salmon.isoform.counts.matrix)


Gene counts matrix (text):

- [20200415_cbai_salmon_abundance/salmon.gene.counts.matrix](https://gannet.fish.washington.edu/Atumefaciens/20200415_cbai_salmon_abundance/salmon.gene.counts.matrix)

Annotated (Trinotate) gene counts matrix (text):

- [20200415_cbai_salmon_abundance/salmon.gene.counts.annotated.matrix](https://gannet.fish.washington.edu/Atumefaciens/20200415_cbai_salmon_abundance/salmon.gene.counts.annotated.matrix)