---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
layout: post
title: FastQ Read Alignment and Quantification - P.generosa Water Metagenomic Libraries to MetaGeneMark Assembly with Hisat2 on Mox
date: '2020-07-31 13:41'
tags:
  - metagenemark
  - alignment
  - hisat2
  - Panopea generosa
  - geoduck
  - metagenomics
  - mox
categories:
  - 2020
  - Miscellaneous
---
Continuing working on the manuscript for this data, Emma wanted the number of reads aligned to each gene. [I previously created and assembly with genes/proteins using MetaGeneMark on 20190103](https://robertslab.github.io/sams-notebook/posts/2019/2019-01-02-Metagenome-Assembly---P.generosa-Water-Sample-HiSeqX-Data-Using-Megahit/), but the assemby process didn't output any sort of stastics on read counts.

So, to get this data, I used [Hisat2](https://ccb.jhu.edu/software/hisat2/manual.shtml#running-hisat2) to align reads (creating a BAM file) and then used [`samtools idxstats`](http://www.htslib.org/doc/samtools-idxstats.html) to generate a file with read counts aligned to each gene.

This was all done on Mox.

Here's how the sample names breakdown:

| Sample | Develomental Stage (days post-fertilization) | pH Treatment |
|--------|-------------------------|--------------|
| MG1    | 13                      | 8.2          |
| MG2    | 17                      | 8.2          |
| MG3    | 6                       | 7.1          |
| MG5    | 10                      | 8.2          |
| MG6    | 13                      | 7.1          |
| MG7    | 17                      | 7.1          |


SBATCH script (GitHub):

- [20200731_metagenome_hisat2_alignments.sh](https://github.com/RobertsLab/sams-notebook/blob/master/sbatch_scripts/20200731_metagenome_hisat2_alignments.sh)


```shell
#!/bin/bash
## Job Name
#SBATCH --job-name=metagenome_hisat2_alignments
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=7-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200731_metagenome_hisat2_alignments


###################################################################################
# These variables need to be set by user

# Assign Variables
reads_dir=/gscratch/srlab/sam/data/metagenomics/P_generosa/sequencing
assembly=/gscratch/srlab/sam/data/metagenomics/P_generosa/assemblies/20190103-mgm-nucleotides.fa
threads=28
# Set hisat2 basename
hisat2_basename=20190103-mgm

# Array of the various comparisons to evaluate
libraries_array=(
MG_1 \
MG_2 \
MG_3 \
MG_5 \
MG_6 \
MG_7
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
## Hisat2 requires Python2. Fails with syntax error if using Python3
#module load intel-python3_2017
module load intel-python2_2017

# Program directories
hisat2_dir="/gscratch/srlab/programs/hisat2-2.2.0/"
samtools_dir="/gscratch/srlab/programs/samtools-1.10/samtools"

# Programs array
declare -A programs_array
programs_array=(
[hisat2]="${hisat2_dir}hisat2" \
[hisat2_build]="${hisat2_dir}hisat2-build" \
[samtools_view]="${samtools_dir} view" \
[samtools_sort]="${samtools_dir} sort" \
[samtools_index]="${samtools_dir} index"
[samtools_idxstats]="${samtools_dir} idxstats"
)

# Capture FastA checksums for verification
echo "Generating checksum for ${assembly}"
md5sum "${assembly}" >> fasta.checksums.md5
echo "Finished generating checksum for ${assembly}"
echo ""

# Build hisat2 index
${programs_array[hisat2_build]} \
-f "${assembly}" \
"${hisat2_basename}" \
-p ${threads}

# Loop through each library
for library in "${libraries_array[@]}"
do

  ## Inititalize arrays
  R1_array=()
  R2_array=()
  reads_array=()

  # Variables
  R1_list=""
  R2_list=""


  if [[ "${library}" == "MG_1" ]]; then

    reads_array=("${reads_dir}"/*MG_1*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_1*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_1*R2*fastq.gz)


  elif [[ "${library}" == "MG_2" ]]; then

    reads_array=("${reads_dir}"/*MG_2*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_2*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_2*R2*fastq.gz)

  elif [[ "${library}" == "MG_3" ]]; then

    reads_array=("${reads_dir}"/*MG_3*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_3*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_3*R2*fastq.gz)

  elif [[ "${library}" == "MG_5" ]]; then

    reads_array=("${reads_dir}"/*MG_5*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_5*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_5*R2*fastq.gz)

  elif [[ "${library}" == "MG_6" ]]; then

    reads_array=("${reads_dir}"/*MG_6*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_6*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_6*R2*fastq.gz)

  elif [[ "${library}" == "MG_7" ]]; then

    reads_array=("${reads_dir}"/*MG_7*fastq.gz)

    # Create array of fastq R1 files
    R1_array=("${reads_dir}"/*MG_7*R1*fastq.gz)

    # Create array of fastq R2 files
    R2_array=("${reads_dir}"/*MG_7*R2*fastq.gz)


  fi

  # Create list of fastq files used in analysis
  ## Uses parameter substitution to strip leading path from filename
  printf "%s\n" "${reads_array[@]##*/}" >> "${library}".fastq.list.txt

  # Create comma-separated lists of FastQ reads
  R1_list=$(echo "${R1_array[@]}" | tr " " ",")
  R2_list=$(echo "${R2_array[@]}" | tr " " ",")

  # Align reads to metagenome assembly
  ${programs_array[hisat2]} \
  --threads ${threads} \
  -x "${hisat2_basename}" \
  -q \
  -1 "${R1_list}" \
  -2 "${R2_list}" \
  -S "${library}".sam \
  2>&1 | tee "${library}".alignment_stats.txt

  # Convert SAM file to BAM
  ${programs_array[samtools_view]} \
  --threads ${threads} \
  -b "${library}".sam \
  > "${library}".bam

  # Sort BAM
  ${programs_array[samtools_sort]} \
  --threads ${threads} \
  "${library}".bam \
  -o "${library}".sorted.bam

  # Index for use in IGV
  ##-@ specifies thread count; --thread option not available in samtools index
  ${programs_array[samtools_index]} \
  -@ ${threads} \
  "${library}".sorted.bam

  # Get index stats from sorted bam
  # Third column is number of reads
  ${programs_array[samtools_idxstats]} \
  --threads ${threads} \
  "${library}".sorted.bam \
  > "${library}".sorted.bam.stats.txt

  # Remove original SAM and unsorted BAM
  rm "${library}".bam "${library}".sam


done

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log


# Capture program options
for program in "${!programs_array[@]}"
do
	{
  echo "Program options for ${program}: "
	echo ""
	${programs_array[$program]} --help
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
  } &>> program_options.log || true
done

```

---

# RESULTS

Runtime: Took about 5hrs:

![Runtime](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20200731_metagenome_hisat2_alignments_runtime.png?raw=true)

Output folder:

- [20200731_metagenome_hisat2_alignments/](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/)

Samtools idxstats output files. They are tab-delimited.

Format:

`<sequence name> <sequence length> <# mapped read-segments> <# unmapped read-segments>`

NOTE: The last line of each file begins with an asterisk and seems to have a total read count? It's not clear what this line is, as it is not described [in the `samtools idxstats` documentation](http://www.htslib.org/doc/samtools-idxstats.html).

- [MG_1.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_1.sorted.bam.stats.txt)

- [MG_2.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_2.sorted.bam.stats.txt)

- [MG_3.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_3.sorted.bam.stats.txt)

- [MG_5.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_5.sorted.bam.stats.txt)

- [MG_6.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_6.sorted.bam.stats.txt)

- [MG_7.sorted.bam.stats.txt](https://gannet.fish.washington.edu/Atumefaciens/20200731_metagenome_hisat2_alignments/MG_7.sorted.bam.stats.txt)