---
title: "Step 4: Align sequences to annotated *M. capitata* genome"
subtitle: "Using `HISAT2`"
author: "Sarah Tanja"
date: 10/21/2024
format:
  gfm: default  # or html if you want to render in HTML
toc: true
toc-depth: 3
link-external-icon: true
link-external-newwindow: true
reference-location: margin
citation-location: margin
---

# 1 \| INTRODUCTION

This notebook will align trimmed M. capitata RNA-seq data to the M. capitata genome using hierarchical indexing for spliced alignment of transcripts HISAT2 [@zhang_rapid_2021]. Followed by StringTie (Pertea et al. 2016, 2015) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2.

Input(s)

-   Trimmed FastQ files, with format: `*fastp-trim.fq.gz` are located in `output/03_qaqc/trim_reads_fastp`

-   HISAT 2 genome index:

-   Genome GTF:

    -   Genome [Version 3](http://cyanophora.rutgers.edu/montipora/) [@stephens2022]
    -   [GFF](http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz) from Rutgers (or [GFF fixed](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gz) from AHuffmyer?)
    -   [genomes, indexes, & feature tracks](https://robertslab.github.io/resources/Genomic-Resources/#montipora-capitata) from Roberts Lab Handbook

-   Sample metadata: `~metadata/rna_metadata.csv`

# 2 \| Genome downloads

## Annotated Reference Genome for *Montipora capitata*

Deep Dive project with genomes of interest: https://github.com/urol-e5/deep-dive

*Montipora capitata* Genome version V3, Rutgers University: <http://cyanophora.rutgers.edu/montipora/>

Genome publication: <https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac098/6815755>

Nucleotide Coding Sequence (CDS): <http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz>

This code grabs the *Montipora capitata* fasta file (rna.fna) of genes.

```{r, genome-download, engine = 'bash'}
# change from code directory to work in data directory
cd ../data

# make mcapgenome directory a subdirectory of data if not already present
mkdir -p mcapgenome

# change directory to ~data/mcapgenome
cd mcapgenome

# download the annotated genomes from the rutgers site if not already present
[-f Montipora_capitata_HIv3.assembly.fasta.gz] || wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz
```

-   [`Montipora_capitata_HIv3.assembly.fasta`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.assembly.fasta) (745MB)

    -   MD5 checksum: `99819eadba1b13ed569bb902eef8da08`

    -   Downloaded 2023017: <http://cyanophora.rutgers.edu/montipora/>

```{r, genome-checksums, engine = 'bash'}
# change to work in data genome folder
cd ../data/mcapgenome

# generate checksum for the genome assembly file
md5sum *assembly*

```

## HISAT Index

-   [`Montipora_capitata_HIv3-hisat2-indices.tar.gz`](https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz) (tarball gzip; 1.2GB)

    -   MD5 checksum: `c8accb6c54e843198c776f0d6f0c603d`

    -   Needs to be unpacked before use!

```{r, hisat-download, engine = 'bash'}
# change to work in data genome directory
cd ../data/mcapgenome

# download the hisat index from Robert's Lab gannet server
[-f Montipora_capitata_HIv3-hisat2-indices.tar.gz] || wget https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz
```

```{r, hisat-checksum, engine = 'bash'}
# change to work in data genome directory
cd ../data/mcapgenome

# generate checksum for the hisat index zip file
md5sum *hisat2*
```

Unpack the tar.gz hisat index file using `tar -xvzf`

-   `-x`: Extracts the contents of the archive.

-   `-v`: Verbose, shows the files being extracted.

-   `-z`: Tells `tar` that the archive is compressed with `gzip` (for `.tar.gz` files).

-   `-f`: Specifies the file name of the archive to extract.

This command will extract the contents of the `.tar.gz` file into the current directory:

```{r, hisat-unzip, engine = 'bash'}
cd ../data/mcapgenome
tar -xvzf Montipora_capitata_HIv3-hisat2-indices.tar.gz
```

## Genome Feature Tracks

[Generic Feature Format (GFF3)](https://gmod.org/wiki/GFF3)

-   [`Montipora_capitata_HIv3.genes.gff3`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.genes.gff3) (67MB)

    -   MD5 checksum: `5f6b80ba2885471c8c1534932ccb7e84`

    -   Downloaded 2023017: <http://cyanophora.rutgers.edu/montipora/>

-   [`Montipora_capitata_HIv3.genes.gtf`](https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf) (101MB)

    -   MD5 checksum: `ceef8eca945199415b23d2f1f0dd2066`

    -   Created 2023017: <https://github.com/urol-e5/timeseries_molecular/blob/cf654b9967c7a5d81e3aeca370b0c8f324badcbd/D-Apul/code/02.00-D-Apul-RNAseq-gff-to-gtf.Rmd>

```{r, feature-tracks-download, engine = 'bash'}
# change to work in data genome directory
cd ../data/mcapgenome

wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz

wget https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf
```

```{r, feature-tracks-checksum, engine = 'bash'}
cd ../data/mcapgenome

md5sum *.gff3.gz *.gtf *gff3
```

```{r, engine = 'bash'}

cd ../data/mcapgenome

head -10 Montipora_capitata_HIv3.genes.gff3

```

Load libraries and data.

```{r}

if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')

library(tidyverse)
library(R.utils)

```

```{r}

gff <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes.gff3", header=FALSE, sep="\t")

gff_fixed <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes_fixed.gff3", header=FALSE, sep="\t")

gtf <-
```

# 3 \| Create a `bash` variables file

This file will overwrite any existing `.bashvars` file that is in the code directory

```{r, engine= 'bash'}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export project_dir=/home/shared/8TB_HDD_02/stanja/sarahtanja/coral-embryo-leachate'
echo 'export genome_dir="${project_dir}/data/mcapgenome"'
echo 'export genome_index_dir="${genome_dir}/hisat-index"'
echo 'export output_dir_top=${project_dir}/output'
echo 'export output_dir_align=${output_dir_top}/04_align'
echo 'export trimmed_fastqs_dir=${output_dir_top}/03_qaqc/trim_reads_fastp'
echo 'export raw_reads_dir=${project_dir}/rawfastq/00_fastq'
echo ""

echo "# Location of Hisat2 index files"
echo "# Must keep variable name formatting, as it's used by HiSat2"
echo 'export HISAT2_INDEXES="${genome_index_dir}"'


echo "# Input files"
#echo 'export exons="${output_dir_top}/Apulchra-genome_hisat2_exons.tab"'
echo 'export genome_index_name="Montipora_capitata_HIv3"'
echo 'export genome_gff="${genome_dir}/Montipora_capitata_HIv3.genes_fixed.gff3"'
echo 'export genome_fasta="${genome_dir}/Montipora_capitata_HIv3.assembly.fasta"'
#echo 'export splice_sites="${output_dir_top}/Apulchra-genome_hisat2_splice_sites.tab"'
echo 'export transcripts_gtf="${genome_dir}/Montipora_capitata_HIv3.genes.gtf"'

echo "# Output files"
echo 'export gtf_list="${output_dir_align}/gtf_list.txt"'
echo 'export merged_bam="${output_dir_align}/sorted-bams-merged.bam"'
echo ""

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export hisat2_dir="${programs_dir}/hisat2-2.2.1"'
echo 'export hisat2="${hisat2_dir}/hisat2"'
echo 'export multiqc=/home/sam/programs/mambaforge/bin/multiqc'
echo 'export samtools="${programs_dir}/samtools-1.12/samtools"'
echo 'export prepDE="${programs_dir}/stringtie-2.2.1.Linux_x86_64/prepDE.py3"'
echo 'export stringtie="${programs_dir}/stringtie-2.2.1.Linux_x86_64/stringtie"'
echo ""

echo "# Set FastQ filename patterns"
echo "export R1_fastq_pattern='*_R1_*.fq.gz'"
echo "export R2_fastq_pattern='*_R2_*.fq.gz'"
echo "export trimmed_fastq_pattern='*fastp-trim.fq.gz'"
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""

echo "# Set average read length - for StringTie prepDE.py"
echo 'export read_length=150'
echo ""

echo "## Initialize arrays"
echo 'export fastq_array_R1=()'
echo 'export fastq_array_R2=()'
echo 'export R1_names_array=()'
echo 'export R2_names_array=()'
echo "declare -A sample_info_map"
echo ""

echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[hisat2]="${hisat2}" \'
echo '[multiqc]="${multiqc}" \'
echo '[prepDE]="${prepDE}" \'
echo '[samtools]="${samtools}" \'
echo '[stringtie]="${stringtie}" \'
echo ")"
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
```

# 4 \| Align reads using HISAT 2

## Sam's Loop

This requires usage of the `mcap_RNAseq_simplified_metadata.csv`

This step has a lengthy, semi-complex workflow:

1.  Parse `mcap_RNAseq_simplified_metadata.csv` for *M. capitata* sample names, leachate treatments, and embryonic phase. This info will be used for downstream file naming and for passing the treatment variables to the read group (`SM:`) in the alignment file.

2.  Loop through all samples and perform individual alignments using HISAT2.

3.  HISAT2 output is piped to through multiple samtools tools: flagstat (stats aggregation), sort (creates/sorts BAM), index (creates BAM index). Piping saves time and disk space, by avoiding the generation of large SAM files.

4.  Loop continues and runs StringTie on sorted BAM file to produce individual GTF file.

5.  Loop continues and adds GTF path/filename to a list file, which will be used downstream.

```{r}
metadata <- read.csv("../metadata/mcap_RNAseq_simplified_metadata.csv")
head(metadata)
```

```{r, engine = 'bash'}
# Load bash variables into memory
source .bashvars

# Make output directories, if they don't exist
mkdir --parents "${output_dir_align}"

# Change to ouput directory
cd "${output_dir_align}"

# Create associative array with sample and timepoint
metadata="../../metadata/mcap_RNAseq_simplified_metadata.csv"

# Declare the associative array to store all the sample information
declare -A sample_info_map

# Read the metadata file line by line
while IFS=',' read -r sample_no sample_name organism collection_date pvc_leachate_level hours_post_fertilization plate well sample_type sample_buffer; do
    # Add the sample name as the key and the combined info as the value
    sample_info_map["${sample_name}"]="${collection_date},${pvc_leachate_level},${hours_post_fertilization}"
done < <(tail -n +2 "${metadata}") # Skip the header


## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})

############## BEGIN HISAT2 ALIGNMENTS ##############
for sample in "${!sample_info_map[@]}"
do


  # Create and switch to dedicated sample directory
  mkdir --parents "${sample}" && cd "$_"

  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R1[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')

      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5

      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R1 '%s,' "${fastq}"
      fastq_list_R1=$(echo "${joined_R1%,}")
  done

  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R2[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')

      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5

      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R2 '%s,' "${fastq}"
      fastq_list_R2=$(echo "${joined_R2%,}")
  done


  # HISAT2 alignments
  # Sets read group info (RG) using samples array
  "${programs_array[hisat2]}" \
  -x "${genome_index_name}" \
  -1 "${fastq_list_R1}" \
  -2 "${fastq_list_R2}" \
  --threads "${threads}" \
  --rg-id "${sample}" \
  --rg "SM:""${sample_info_map[$sample]}" \
  2> "${sample}"_hisat2.stats \
  | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
  | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
  | tee "${sample}".sorted.bam \
  | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai


  # Run stringtie on alignments
  # Uses "-B" option to output tables intended for use in Ballgown
  # Uses "-e" option; recommended when using "-B" option.
  # Limits analysis to only reads alignments matching reference.
  "${programs_array[stringtie]}" "${sample}".sorted.bam \
  -p "${threads}" \
  -o "${sample}".gtf \
  -G "${genome_gff}" \
  -C "${sample}.cov_refs.gtf" \
  -B \
  -e


  # Add GTFs to list file, only if non-empty
  # Identifies GTF files that only have header
  gtf_lines=$(wc -l < "${sample}".gtf )
  if [ "${gtf_lines}" -gt 2 ]; then
    echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
  fi

  # Generate checksums
  find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
  # Move up to orig. working directory
  cd ..

done
```

## Broken into chunks

```{r, engine = 'bash'}
# Load bash variables into memory
source .bashvars

# Make output directories, if they don't exist
mkdir --parents "${output_dir_align}"

# Change to output directory
cd "${output_dir_align}"

# Create associative array with sample and timepoint
metadata="../../metadata/mcap_RNAseq_simplified_metadata.csv"

# Declare the associative array to store all the sample information
declare -A sample_info_map

# Read the metadata file line by line
while IFS=',' read -r sample_no sample_name organism collection_date development_stage pvc_leachate_level hours_post_fertilization pvc_leachate_concentration_mg_L plate well sample_type sample_buffer; do
    # Add the sample name as the key and the combined info as the value
    sample_info_map["${sample_name}"]="${collection_date},${development_stage},${pvc_leachate_level},${hours_post_fertilization},${pvc_leachate_concentration_mg_L}"
done < <(tail -n +2 "${metadata}") # Skip the header


## Populate trimmed reads arrays
fastq_array_R1=("${trimmed_fastqs_dir}"/${R1_fastq_pattern})
fastq_array_R2=("${trimmed_fastqs_dir}"/${R2_fastq_pattern})

```

```{r, engine = 'bash'}

# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_align}"

############## BEGIN HISAT2 ALIGNMENTS ##############
for sample in "${!sample_info_map[@]}"
do


  # Create and switch to dedicated sample directory
  mkdir --parents "${sample}" && cd "$_"

  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R1[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')

      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5

      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R1 '%s,' "${fastq}"
      fastq_list_R1=$(echo "${joined_R1%,}")
  done

  # Create HISAT2 list of fastq R1 files
  # and generated MD5 checksums file.
  for fastq in "${fastq_array_R2[@]}"
  do
    # Parse sample name from FastQ filename
    fastq_sample=$(echo "${fastq##*/}" | awk -F"[_]" '{print $1}')

      # Generate checksum/list of input files used
      md5sum "${fastq}" >> input_fastqs_checksums.md5

      # Create comma-separated lists of FastQs for HISAT2
      printf -v joined_R2 '%s,' "${fastq}"
      fastq_list_R2=$(echo "${joined_R2%,}")
  done
```

```{r, engine = 'bash'}
# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_align}"

  # HISAT2 alignments
  # Sets read group info (RG) using samples array
  "${programs_array[hisat2]}" \
  -x "${genome_index_name}" \
  -1 "${fastq_list_R1}" \
  -2 "${fastq_list_R2}" \
  --threads "${threads}" \
  --rg-id "${sample}" \
  --rg "SM:""${sample_info_map[$sample]}" \
  2> "${sample}"_hisat2.stats \
  | tee >(${programs_array[samtools]} flagstat - > "${sample}"-hisat2_output.flagstat) \
  | ${programs_array[samtools]} sort - -@ "${threads}" -O BAM \
  | tee "${sample}".sorted.bam \
  | ${programs_array[samtools]} index - "${sample}".sorted.bam.bai

```

```{r, engine = 'bash'}
# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_align}"

  # Run stringtie on alignments
  # Uses "-B" option to output tables intended for use in Ballgown
  # Uses "-e" option; recommended when using "-B" option.
  # Limits analysis to only reads alignments matching reference.
  "${programs_array[stringtie]}" "${sample}".sorted.bam \
  -p "${threads}" \
  -o "${sample}".gtf \
  -G "${genome_gff}" \
  -C "${sample}.cov_refs.gtf" \
  -B \
  -e

```

```{r, engine = 'bash'}
# Load bash variables into memory
source .bashvars

# Change to output directory
cd "${output_dir_align}"

  # Add GTFs to list file, only if non-empty
  # Identifies GTF files that only have header
  gtf_lines=$(wc -l < "${sample}".gtf )
  if [ "${gtf_lines}" -gt 2 ]; then
    echo "$(pwd)/${sample}.gtf" >> "${gtf_list}"
  fi

  # Generate checksums
  find ./ -type f -not -name "*.md5" -exec md5sum {} \; > ${sample}_checksums.md5
  # Move up to orig. working directory
  cd ..

done
```