---
title: "Step 4: Align sequence reads to annotated *M. capitata* genome"
subtitle: "Using `HISAT2`"
author: "Sarah Tanja"
date: 10/21/2024
format:
  gfm: default  # or html if you want to render in HTML
toc: true
toc-depth: 3
link-external-icon: true
link-external-newwindow: true
reference-location: margin
citation-location: margin
---

This notebook will align trimmed M. capitata RNA-seq data to the M. capitata genome using hierarchical indexing for spliced alignment of transcripts HISAT2 [@zhang_rapid_2021]. Followed by StringTie (Pertea et al. 2016, 2015) for transcript assembly/identification and count matrices for downstream expression analysis with DESeq2.

#### Inputs

-   Trimmed FastQ files, with format: `*fastp-trim.fq.gz` are located in `output/03_qaqc/trimmed_reads`

-   [`Montipora_capitata_HIv3.assembly.fasta`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.assembly.fasta) (745MB)

    -   MD5 checksum: `99819eadba1b13ed569bb902eef8da08`

    -   Downloaded 2023017: <http://cyanophora.rutgers.edu/montipora/>

    -   [Version 3](http://cyanophora.rutgers.edu/montipora/) [@stephens2022]

-   [`Montipora_capitata_HIv3.genes.gff3`](https://owl.fish.washington.edu/halfshell/genomic-databank/Montipora_capitata_HIv3.genes.gff3) (67MB)

    -   MD5 checksum: `5f6b80ba2885471c8c1534932ccb7e84`

    -   Downloaded 2023017: <http://cyanophora.rutgers.edu/montipora/>

    -   [GFF](http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz) from Rutgers (or [GFF fixed](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/raw/master/Mcap2020/Data/TagSeq/Montipora_capitata_HIv3.genes_fixed.gff3.gz) from AHuffmyer script [here](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd) )

-   [`Montipora_capitata_HIv3.genes.gtf`](https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf) (101MB)

    -   MD5 checksum: `ceef8eca945199415b23d2f1f0dd2066`

    -   Created 2023017: <https://robertslab.github.io/sams-notebook/2023/01/27/Data-Wrangling-M.capitata-Genome-GFF-to-GTF-Using-gffread.html>

-   Genome Indexes ([`HISAT2`](https://daehwankimlab.github.io/hisat2/))

    -   [`Montipora_capitata_HIv3-hisat2-indices.tar.gz`](https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz) (tarball gzip; 1.2GB)

        -   MD5 checksum: `c8accb6c54e843198c776f0d6f0c603d`

        -   Needs to be unpacked before use!

::: callout-note
You can located [genomes, indexes, & feature tracks](https://robertslab.github.io/resources/Genomic-Resources/#montipora-capitata) for *M. capitata* in the Roberts Lab Handbook
:::

-   Sample metadata: `metadata/mcap_RNAseq_simplified_metadata.csv`

#### Outputs

# Download GTF

-   [`Montipora_capitata_HIv3.genes.gtf`](https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf) (101MB)

    -   MD5 checksum: `ceef8eca945199415b23d2f1f0dd2066`

    -   Created 2023017: <https://robertslab.github.io/sams-notebook/2023/01/27/Data-Wrangling-M.capitata-Genome-GFF-to-GTF-Using-gffread.html>

```{r, gtf-download, engine='bash'}
cd ../input/genome

wget https://gannet.fish.washington.edu/Atumefaciens/20230127-mcap-gff_to_gtf/Montipora_capitata_HIv3.genes.gtf

```

```{r, gtf-checksum, engine='bash'}
# change to work in data genome directory
cd ../input/genome

# generate checksum for the hisat index zip file
md5sum *gtf*
```

# Download HISAT Index

-   [`Montipora_capitata_HIv3-hisat2-indices.tar.gz`](https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz) (tarball gzip; 1.2GB)

    -   MD5 checksum: `c8accb6c54e843198c776f0d6f0c603d`

    -   Needs to be unpacked before use!

```{r, hisat-download, engine = 'bash'}
# change to work in data genome directory
cd ../input/hisat

# download the hisat index from Robert's Lab gannet server
[-f Montipora_capitata_HIv3-hisat2-indices.tar.gz] || wget https://gannet.fish.washington.edu/Atumefaciens/20230131-mcap-HIv3-hisat2-build-index/Montipora_capitata_HIv3-hisat2-indices.tar.gz
```

```{r, hisat-checksum, engine = 'bash'}
# change to work in data genome directory
cd ../input/hisat

# generate checksum for the hisat index zip file
md5sum *hisat2*
```

Unpack the tar.gz hisat index file using `tar -xvzf`

-   `-x`: Extracts the contents of the archive.

-   `-v`: Verbose, shows the files being extracted.

-   `-z`: Tells `tar` that the archive is compressed with `gzip` (for `.tar.gz` files).

-   `-f`: Specifies the file name of the archive to extract.

This command will extract the contents of the `.tar.gz` file into the current directory:

```{r, hisat-unzip, engine = 'bash'}
cd ../input/hisat
tar -xvzf Montipora_capitata_HIv3-hisat2-indices.tar.gz
```

# Load Libraries

```{r}

if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')

library(tidyverse)
library(R.utils)

```

# Load Metadata

```{r}

gff <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes.gff3", header=FALSE, sep="\t")

gff_fixed <- read.csv(file="../data/mcapgenome/Montipora_capitata_HIv3.genes_fixed.gff3", header=FALSE, sep="\t")
```

# Trimmed RNA-seq reads

> For now we're going to do a test run on just 10 samples from the early gastrula phase

7 samples from control samples at 14 hours post fertilization, gastrula (C14):

```{r, engine = 'bash'}
cd ../output/03_qaqc/trimmed_reads
ls *C14*fq.gz
```

5 from highest pollution concentration at 14 hours post fertilization, gastrula (H14):

```{r, engine = 'bash'}
cd ../output/03_qaqc/trimmed_reads
ls *H14*fq.gz
```

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

## Name directory paths

```{r, engine = bash}
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export project_dir=/media/4TB_JPG_ext/stanja/gitprojects/coral-embryo-ecotox'
echo 'export output_dir=${project_dir}/output'
echo 'export output_dir_align=${project_dir}/output/04_align'
echo 'export output_dir_trim=${project_dir}/output/03_qaqc/trimmed_reads'
echo ""
} > .bashvars

cat .bashvars
```

# Explore genome files

## Fasta

Here is the fasta file

```{r, locate-fasta, engine='bash'}

ls  ../input/genome/Montipora_capitata_HIv3.assembly.fasta

```

## GFF

Here is the gff file provided by rutgers

```{r, explore-gff, engine='bash'}

head ../input/genome/Montipora_capitata_HIv3.genes.gff3
```

Here is the modified gff rutgers file generated by A. Huffmyer. Note that the gff-fixed file generated by A. Huffmyer [here](https://github.com/AHuffmyer/EarlyLifeHistory_Energetics/blob/master/Mcap2020/Scripts/TagSeq/Genome_V3/fix_gff_format.Rmd) *adds transcript and gene id into GFF file for alignment... adding transcript_id= and gene_id= to 'gene' column because we need that label to map our data"*

```{r, explore-gff-fixed, engine='bash'}

head -n 5 ../input/genome/Montipora_capitata_HIv3.genes_fixed.gff3
```

## GTF

```{r, gtf-head, engine='bash'}
head -n 5 ../input/genome/Montipora_capitata_HIv3.genes.gtf
```

```{r, gtf-gene-count, engine='bash'}
grep -c "CDS" ../input/genome/Montipora_capitata_HIv3.genes.gtf
```

```{r, gtf-exon-count, engine='bash'}
grep -c "exon" ../input/genome/Montipora_capitata_HIv3.genes.gtf
```

```{r, gtf-transcript-count, engine='bash'}
grep -c "transcript" ../input/genome/Montipora_capitata_HIv3.genes.gtf
```

```{r, gtf-gene-count, engine='bash'}
grep -c "gene" ../input/genome/Montipora_capitata_HIv3.genes.gtf
```

# Extract exons

```{r, exon, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../input/genome/Montipora_capitata_HIv3.genes.gtf \
> ../output/04_align/exon.tab
```

```{r, exon-head, engine='bash'}
head ../output/04_align/exon.tab
wc -l ../output/04_align/exon.tab #count the number of files 
```

# Extract splice sites

```{r, extract-splices, engine='bash'}

/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../input/genome/Montipora_capitata_HIv3.genes.gtf \
> ../output/04_align/splice_sites.tab
```

```{r, exon-head, engine='bash'}
head ../output/04_align/splice_sites.tab
wc -l ../output/04_align/splice_sites.tab #count the number of files 
```

# Test HISAT Alignment & make sams

Align only the early gastrula control group (this is all files in `../output/03_qaqc/trimmed_reads/` that have a `*C14*R2_001.fastp-trim.fq.gz` filename pattern)

The following code lines explained:

1.  `find ../output/03_qaqc/trimmed_reads/*C14*R2_001.fastp-trim.fq.gz` : This finds all reverse read files (\_R2_001.fastp-trim.fq.gz) that match the filename pattern in the `../output/03_qaqc/trimmed_reads` directory.

2.  `xargs -n 1 basename -s _R2_001.fastp-trim.fq.gz`: basename -s removes the suffix \_R2_001.fastp-trim.fq.gz from each file, leaving only the sample identifier (e.g., 101112C14, 123C14, etc.).

3.  `xargs -I{}`: The xargs -I{} loop replaces {} with each sample identifier to construct the hisat2 command for each sample.

4.  `/home/shared/hisat2-2.2.1/hisat2`: This is the HISAT2 command for aligning reads to a reference genome.

5.  `-x ../input/hisat/Montipora_capitata_HIv3`: Specifies the path to the HISAT2 index for the reference genome.[^1]

6.  `--dta`: This option enables HISAT2's "Downstream Transcriptome Analysis" mode, which optimizes output for transcriptome assembly and gene-level analyses.

7.  `-p 20`: Specifies that the command should use 20 CPU threads to speed up the alignment process.

8.  `-1 ../output/03_qaqc/trimmed_reads/{}_R1_001.fastp-trim.fq.gz`: Uses {} to insert the sample identifier and specify the path to the corresponding forward read file (\_R1_001.fastp-trim.fq.gz).

9.  `-2 ../output/03_qaqc/trimmed_reads{}_R2_001.fastp-trim.fq.gz`: Uses {} to specify the reverse read file (\_R2_001.fastp-trim.fq.gz).

10. `-S ../output/15-Apul-hisat/{}.sam`: Specifies the output path for the alignment results in SAM format, using `{}` to name the file after each sample identifier.

11. `2> ../output/15-Apul-hisat/hisat.out`: Redirects any error messages or log information from `hisat2` to a file named `hisat.out`, where you can check for alignment statistics or troubleshooting information.

[^1]: When a HISAT2 index is split into multiple files with extensions like `.1.ht2`, `.2.ht2`, up to `.8.ht2`, the `-x` option in the code for HISAT2 automatically detects and loads all parts of a multi-file index as long as you specify the **base name**

This modification should correctly align each sample's forward and reverse reads using the HISAT2 index. It will result in one sam file per pair of forward and reverse reads.

::: callout-caution
SAM files are very large (these are each \~20GB!) and take a long time to generate... look up ways to speed this up for running larger batches of files
:::

```{r, align-test, engine='bash'}
find ../output/03_qaqc/trimmed_reads/*C14*R2_001.fastp-trim.fq.gz \
| xargs -n 1 basename -s _R2_001.fastp-trim.fq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../input/hisat/Montipora_capitata_HIv3 \
--dta \
-p 20 \
-1 ../output/03_qaqc/trimmed_reads/{}_R1_001.fastp-trim.fq.gz \
-2 ../output/03_qaqc/trimmed_reads/{}_R2_001.fastp-trim.fq.gz \
-S ../output/04_align/sam/{}.sam \
2> ../output/04_align/hisat.out
```

```{r, align-testH14, engine='bash'}

find ../output/03_qaqc/trimmed_reads/*H14*R2_001.fastp-trim.fq.gz \
| xargs -n 1 basename -s _R2_001.fastp-trim.fq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../input/hisat/Montipora_capitata_HIv3 \
--dta \
-p 20 \
-1 ../output/03_qaqc/trimmed_reads/{}_R1_001.fastp-trim.fq.gz \
-2 ../output/03_qaqc/trimmed_reads/{}_R2_001.fastp-trim.fq.gz \
-S ../output/04_align/sam/{}.sam \
2> ../output/04_align/hisat.out

```

# Look at alignment stats

```{r, peep-align, engine='bash'}

cat ../output/04_align/hisat.out
```

Use MultiQc to make a html report of alignment stats in the `hisat.out` file

```{r, engine='bash'}
cd ../output/04_align

/home/sam/programs/mambaforge/bin/multiqc hisat.out
```

# Convert to bams

```{r, sam-to-bam, engine='bash'}

for samfile in ../output/04_align/sam/*.sam; do 

# Define the base filename without path and extension 
base_name=$(basename "${samfile%.sam}")

# Set the paths for the BAM and sorted BAM files in the new output directory 
bamfile="../output/04_align/bam/${base_name}.bam"  sorted_bamfile="../output/04_align/bam/${base_name}.sorted.bam"

# Convert SAM to BAM 
/home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"

# Sort the BAM file and save it in the desired directory 
/home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"

# Index the sorted BAM file 
/home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"

# Optionally, remove the unsorted BAM file to save space 
rm "$bamfile" done
```

# Summary & Next Steps

We generated alignment files in this step that we will assemble using `StringTie` in step 5, `05_assemble`

::: callout-important
###### Don't forget to always rsync backup!

```         
rsync -avz /media/4TB_JPG_ext/stanja/gitprojects\
stanja\@gannet.fish.washington.edu:/volume2/web/stanja/ravenbackup
```
:::

#