---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: RNAseq Alignments - Trimmed S.salar RNAseq to GCF_000233375.1_ICSASG_v2_genomic.fa Using Hisat2 on Mox
date: '2020-11-03 11:27'
tags:
  - RNAseq
  - Salmo salar
  - Atlantic salmon
  - hisat2
  - mox
categories:
  - 2020
  - Miscellaneous
---
This is a continuation of addressing [Shelly Trigg's (regarding some _Salmo salar_ RNAseq data) request (GitHub Issue)](https://github.com/RobertsLab/resources/issues/1016) to trim ([completed 20201029](https://robertslab.github.io/sams-notebook/posts/2020/2020-10-29-Trimming---Shelly-S.salar-RNAseq-Using-fastp-and-MultiQC-on-Mox/)), perform genome alignment, and transcriptome alignment.

Ran [`HISAT2`](https://daehwankimlab.github.io/hisat2/) with the [trimmed FastQ files from 20201029](https://robertslab.github.io/sams-notebook/posts/2020/2020-10-29-Trimming---Shelly-S.salar-RNAseq-Using-fastp-and-MultiQC-on-Mox/) with the following options:

- `--dta`: This stands for "downstream transcriptome alignment". Since we'll be performing a subsequent alignment using the transcriptome using [`StringTie`](https://ccb.jhu.edu/software/stringtie/), I decided to add this option.

- `--new-summary`: This creates a summary file that can be easily read by [`MultiQC`](https://multiqc.info/).

This was run on Mox.


SBATCH script (GitHub):

- [20201103_ssal_RNAseq_hisat2_alignment.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20201103_ssal_RNAseq_hisat2_alignment.sh)


```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=20201103_ssal_RNAseq_hisat2_alignment
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201103_ssal_RNAseq_hisat2_alignment


### S.salar RNAseq Hisat2 alignment.

### Uses fastp-trimmed FastQ files from 20201029.

### Uses GCF_000233375.1_ICSASG_v2_genomic.fa as reference,
### created by Shelly Trigg.
### This is a subset of the NCBI RefSeq GCF_000233375.1_ICSASG_v2_genomic.fna.
### Includes only "chromosome" sequence entries.


###################################################################################
# These variables need to be set by user

## Assign Variables

# Set number of CPUs to use
threads=27

# Input/output files
fastq_checksums=fastq_checksums.md5
fastq_dir="/gscratch/srlab/sam/data/S_salar/RNAseq/"
genome_fasta="/gscratch/srlab/sam/data/S_salar/genomes/GCF_000233375.1_ICSASG_v2_genomic.fa"

genome_index_name="GCF_000233375.1_ICSASG_v2"

# Paths to programs
hisat2_dir="/gscratch/srlab/programs/hisat2-2.1.0"
hisat2="${hisat2_dir}/hisat2"
hisat2_build="${hisat2_dir}/hisat2-build"
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"

## Inititalize arrays
fastq_array_R1=()
fastq_array_R2=()
names_array=()


# Programs associative array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2}" \
[hisat2-build]="${hisat2_build}"
[samtools_index]="${samtools} index" \
[samtools_sort]="${samtools} sort" \
[samtools_view]="${samtools} view"
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Create array of fastq R1 files
for fastq in "${fastq_dir}"*_1.fastp-trim.20201029.fq.gz
do
    fastq_array_R1+=("${fastq}")
  # Create array of sample names
  ## Uses parameter substitution to strip leading path from filename
  ## Uses awk to parse out sample name from filename
  names_array+=($(echo "${fastq#${fastq_dir}}" | awk -F"[_]" '{print $1 "_" $2}'))
done

# Create array of fastq R2 files
for fastq in "${fastq_dir}"*_2.fastp-trim.20201029.fq.gz
do
  fastq_array_R2+=("${fastq}")
done


# Build Hisat2 reference index
"${programs_array[hisat2-build]}" \
"${genome_fasta}" \
"${genome_index_name}" \
-p "${threads}" \
2> hisat2_build.err


# Hisat2 alignments
for index in "${!fastq_array_R1[@]}"
do
  # Get current sample name
  sample_name=$(echo "${names_array[index]}")

  # Run Hisat2
  # Sets --dta which tailors output for downstream transcriptome assemblers (e.g. Stringtie)
  # Sets --new-summary option for use with MultiQC
  "${programs_array[hisat2]}" \
  -x "${genome_index_name}" \
  --dta \
  --new-summary \
  -1 "${fastq_array_R1[index]}" \
  -2 "${fastq_array_R2[index]}" \
  -S "${sample_name}".sam \
  2> "${sample_name}"_hisat2.err
# Sort SAM files, convert to BAM
  ${programs_array[samtools_view]} \
  -@ "${threads}" \
  -Su "${sample_name}".sam \
  | ${programs_array[samtools_sort]} - \
  -@ "${threads}" \
  -o "${sample_name}".sorted.bam
  # Index sorted BAM file
  ${programs_array[samtools_index]} "${sample_name}".sorted.bam
done

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${fastq_dir}"*fastp-trim.20201029.fq.gz
do
  echo "${fastq#${fastq_dir}}" >> fastq.list.txt
  md5sum "${fastq}" >> ${fastq_checksums}
done

# Capture program options
for program in "${!programs_array[@]}"
do
	{
  echo "Program options for ${program}: "
	echo ""
  # Handle samtools help menus
  if [[ "${program}" == "samtools_index" ]] \
  || [[ "${program}" == "samtools_sort" ]] \
  || [[ "${program}" == "samtools_view" ]]
  then
    ${programs_array[$program]}
  fi
	${programs_array[$program]} -h
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
} &>> program_options.log || true

  # If MultiQC is in programs_array, copy the config file to this directory.
  if [[ "${program}" == "multiqc" ]]; then
  	cp --preserve ~/.multiqc_config.yaml "${timestamp}_multiqc_config.yaml"
  fi
done


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
```

NOTE: I manually removed the SAM files that were generated during this process. This step was not included in the SBATCH script above because I didn't realize the SAM files would remain after creating the (desired) BAM files. The SAM files were very large and not necessary for downstream analysis.

---

# RESULTS

Runtime was close to 3hrs:

![Cumulative HISAT2 runtime on Mox](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20201103_ssal_RNAseq_hisat2_alignment_runtime.png?raw=true)


Output folder:

- [20201103_ssal_RNAseq_hisat2_alignment/](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/)

BAM files for each alignment are linked below. The BAM files will be used for subsequent usage in [`StringTie`](https://ccb.jhu.edu/software/stringtie/). Additionally, each BAM file has an associated index file and a [`HISAT2`](https://daehwankimlab.github.io/hisat2/) "error" file. This "error" file is actually the summary report of the alignment and will be used by [`MultiQC`](https://multiqc.info/) later on. In the future, I'll try to remember to change the labelling for this...

  - [Pool26_16.sorted.bam](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_16.sorted.bam) (2.4G)

    - [Pool26_16.sorted.bam.bai](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_16.sorted.bam.bai) (2.3M)

    - [Pool26_16_hisat2.err](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_16_hisat2.err) (4.0K)

  - [Pool26_8.sorted.bam](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_8.sorted.bam) (2.5G)

    - [Pool26_8.sorted.bam.bai](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_8.sorted.bam.bai) (2.3M)

    - [Pool26_8_hisat2.err](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool26_8_hisat2.err) (4.0K)

  - [Pool32_16.sorted.bam](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_16.sorted.bam) (1.9G)

    - [Pool32_16.sorted.bam.bai](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_16.sorted.bam.bai) (2.2M)

    - [Pool32_16_hisat2.err](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_16_hisat2.err) (4.0K)

  - [Pool32_8.sorted.bam](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_8.sorted.bam) (2.4G)

    - [Pool32_8.sorted.bam.bai](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_8.sorted.bam.bai) (2.3M)

    - [Pool32_8_hisat2.err](https://gannet.fish.washington.edu/Atumefaciens/20201103_ssal_RNAseq_hisat2_alignment/Pool32_8_hisat2.err) (4.0K)