---
title: "02-align-hisat"
author: "Sarah Tanja"
date: "`r format(Sys.time(), '%d %B, %Y')`" 
editor: visual
format:
  gfm: default
  html:
    df-print: paged
    toc: true
    smooth-scroll: true
    link-external-icon: true
    link-external-newwindow: true
    code-fold: show
    code-tools: true
    code-copy: true
    highlight-style: arrow
    code-overflow: wrap
    theme:
      light: sandstone
      dark: vapor
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE)
```
## Overview


- sample sequence files located at absolute path `/home/shared/8TB_HDD_01/mcap/` 
- reference genome V3 files located at `../data/`


Look at the GFF (generic feature format) genomic annotation files and the FASTA genome file
```{r, engine='bash'}
cd ../data
tail -n 10 Montipora_capitata_HIv3.genes.gff3
tail -n 10 Montipora_capitata_HIv3.genes_fixed.gff3
tail -n 10 Montipora_capitata_HIv3.assembly.fasta
```

## Build an index file from the reference genome with HISAT2

::: callout-info title="ChatGPT Info on 'Indexing with HISAT2'"
An index file built from a genome is a data structure that enables fast searching and retrieval of information from the genome. Here we use `HISAT2` to break down the genome into smaller fragments and build an index file that stores the location and sequence of these fragments. This index file is created using a reference genome as a template.

HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts 2) is a popular RNA-seq aligner that uses a type of index file called a Hierarchical Graph FM Index (HGFM).

The HGFM index is an extension of the FM-index data structure that is used in many other aligners such as Bowtie, BWA, and HISAT. The HGFM index builds on the FM-index by adding a hierarchical structure that allows for more efficient searching of splice junctions in RNA-seq data.

In HISAT2, the HGFM index is constructed from a reference genome and a gene annotation file, which together define the locations of splice junctions. The HGFM index stores a hierarchical graph of the reference genome that includes both the exonic and intronic regions, along with the splice junctions between them. This allows HISAT2 to align RNA-seq reads across splice junctions with high sensitivity and accuracy.

Overall, the HGFM index used by HISAT2 is optimized for aligning RNA-seq reads, which have unique challenges compared to genomic DNA sequencing. By incorporating splice junction information into the index, HISAT2 can accurately align RNA-seq reads to the genome and help identify gene expression and alternative splicing events.
:::

When using HISAT2's hisat2-build command to build an index from a genome, the input file should be a FASTA file containing the reference genome sequence. in this case, `Montipora_capitata_HIv3.assembly.fasta`. The `-f` option soecifies the input is a fasta file (which can come in the fasta, fna, or fa file extension format. We then title and create an index `Mcapv3_assembly.index`  which is stored into the output folder.
```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/Montipora_capitata_HIv3.assembly.fasta \
../output/Mcapv3_assembly.index
```

Or should I use the cds.fna file to build the index?
CDS coding sequences .fna denotes FASTA Nucleotides file...
Build index with cds.fna as well... 
```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/Montipora_capitata_HIv3.genes.cds.fna \
../output/Mcapv3_cds.index
```

# Align to Index (making SAM files)

```
```{r, engine='bash'}
hisat2 -x path/to/hisat2_index/genome\
-U path/to/read1.fastq\
-S output.sam\
--threads num_threads
```
```

- specify the path to the single-end fastq file with the -U option

```{r, engine='bash'}
# Set input and output directories
input_dir=/home/shared/8TB_HDD_01/mcap
output_dir=../output

# Set Hisat2 options
hisat2_options="-x /home/shared/8TB_HDD_02/stanja/sarahtanja/sarahtanja-coralRNA/output/Mcapv3_assembly.index --threads 7"

# Loop through fastq files in input directory
for f in ${input_dir}/*; do
    # Get base filename
    base=$(basename ${f} .fastq)
    
    # Align with Hisat2
    /home/shared/hisat2-2.2.1/hisat2 \
    ${hisat2_options} \
    -U ${f} \
    -S ${output_dir}/${base}.sam \
    
done

```

## Convert SAM file to sorted BAM file

The following code chunk will iteratively take all sam files in the output directory and convert them to bam files.
```{r, engine='bash'}
#!/bin/bash

# Directory containing SAM files
sam_dir="../output/"

# Iterate over all SAM files in the directory
for sam_file in ${sam_dir}*.sam
do
    # Generate output file names
    base_name=$(basename ${sam_file} .sam)
    bam_file="${sam_dir}${base_name}.sorted.bam"
    index_file="${bam_file}.bai"

    # Convert SAM to BAM and sort
    /home/shared/samtools-1.12/samtools view \
    -@ 7 \
    -Su ${sam_file} \
    | /home/shared/samtools-1.12/samtools sort - \
    -@ 7 \
    -o ${bam_file}

    # Index BAM file
    /home/shared/samtools-1.12/samtools index ${bam_file}
done

```


## Generate a reference transcriptome annotation to input into trinity?
I can't make abundance estimates until I have gtf it seems. Here is an example of the code chunk that ChatGPT spit out for me. 

```{r, engine='bash'}
 #!/bin/bash

# Set input and output directories
input_dir=../output
output_dir=../output/stringtie

# Loop through the sorted BAM files in input directory
for f in ${input_dir}/*.sorted.bam; do
    # Get base filename
    base=$(basename ${f} .sorted.bam)

    # Run StringTie to estimate transcript abundances, botznik gff not accepted as a valid reference transcript... :(
    /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
    ${input_dir}/${base}.sorted.bam \
    -G ../data/genome-alignment/botznik-chr-all.gff \
    -o ${output_dir}/${base}.gtf \
    -p 7

done

# Merge GTF files with StringTie
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge ${output_dir}/mergelist.txt \
-G ../data/botznik-chr-ctg.gff \
-o ${output_dir}/merged.gtf \
-p 7

```