---
title: "20-hisat2-genecount-matrices"
output: html_document
date: "2024-09-27"
---
Rmd to use `hisat2` to get gene count matrices for use in `DESeq2` and to annotate with gene IDs retrieved in [paper-pycno-sswd-2021-2022/code/19-get-gene-annotations.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/19-get-gene-annotations.Rmd)

Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd)

All trimmed reads from summer 2021 and 2022 live on Raven:       
Summer 2021     
`/home/shared/8TB_HDD_02/graceac9/data/pycno2021`

Summer 2022    
`/home/shared/8TB_HDD_02/graceac9/data/pycno2022` 

Get genome from NCBI into data directory on Raven:        
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_032158295.1/       

```{r, engine='bash'}
cd ../data

```

```{r, engine='bash', eval=TRUE}
head ../data/augustus.hints.gtf

```


For `stringtie` later, the gtf needs to be in gff format. There isn't currently one in existence that works. Steven figured out how to fix this issue (https://sr320.github.io/tumbling-oysters/posts/sr320-20-seastar/)  

Won't run code because Steven did this already and the new file lives in:
` ../analyses/12-fix-gff/mod_augustus.gtf`

```{r}
library(knitr)
library(tidyverse)
library(Biostrings)
library(pheatmap)
library(DESeq2)
library(tidyverse)
```


# 1. BUILD AN INDEX

Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html 

```{bash}
pwd
```

From the gtf, get exon list
```{bash}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../analyses/12-fix-gff/mod_augustus.gtf \
> ../analyses/20-hisat2/m_exon.tab
```

```{bash}
head ../analyses/20-hisat2/m_exon.tab
```


from the gtf, get the splice sites
```{bash}
#!/bin/bash

# This script will extract splice sites from the gtf file

# This is the command to extract splice sites from the gtf file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../analyses/12-fix-gff/mod_augustus.gtf \
> ../analyses/20-hisat2/m_splice_sites.tab

```

use the genome fasta to make an index for alignment
```{bash}
# build an index 
/home/shared/hisat2-2.2.1/hisat2-build \
../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \
../data/Phel_genome_index \
--exon ../analyses/20-hisat2/m_exon.tab \
--ss ../analyses/20-hisat2/m_splice_sites.tab \
-p 40 \
../analyses/12-fix-gff/mod_augustus.gtf \
2> ../analyses/20-hisat2/hisat2-build_stats.txt
```

```{bash}
tail ../analyses/20-hisat2/hisat2-build_stats.txt
```

# 2. ALIGNMENT 2021 RNAseq Data


```{bash}
for file in ../data/2021_trimmed/*_R1*fq.gz; do
    # Remove the  part to get the base name
    base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20220810.fq.gz)
    # Construct the names of the pair of files
    file1=${base}_R1_001.fastq.gz.fastp-trim.20220810.fq.gz
    file2=${base}_R2_001.fastq.gz.fastp-trim.20220810.fq.gz
    # Run the hisat2 command
    /home/shared/hisat2-2.2.1/hisat2 \
    -x ../data/Phel_genome_index \
    --dta \
    -p 20 \
    -1 ../data/2021_trimmed/$file1 \
    -2 ../data/2021_trimmed/$file2 \
    -S ../analyses/20-hisat2/${base}.sam \
    2> ../analyses/20-hisat2/${base}-hisat.out
done
```


```{bash}
cat ../analyses/20-hisat2/*-hisat.out
```

```{bash}
pwd
```


Convert sam to bam
```{bash}
for samfile in ../analyses/20-hisat2/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"
  /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"
  /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done
```

```{bash}
ls ../analyses/20-hisat2/*sorted.bam | wc -l
```
Woo! 32 .bam files converted from the 32 .sam files. 


```{bash}
ls ../analyses/20-hisat2/*sam | wc -l
```

```{r, engine='bash', eval=TRUE}
cat ../analyses/20-hisat2/*hisat.out \
| grep "overall alignment rate"
``` 


## `stringtie`
```{bash}
find ../analyses/20-hisat2/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../data/augustus.hints.gtf \
-o ../analyses/20-hisat2/{}.gtf \
../analyses/20-hisat2/{}.sorted.bam
```

steven did this to get the gtf into gff for `stringtie` use
```{bash}
#/home/shared/gffread-0.12.7.Linux_x86_64/gffread \
#../analyses/12-fix-gff/mod_augustus.gtf \
#-T \
#-o ../analyses/13-hisat-deseq2/mod_augustus.gff
```


```{bash}
find ../analyses/20-hisat2/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../analyses/13-hisat-deseq2/mod_augustus.gff \
-o ../analyses/20-hisat2/{}.gtf \
../analyses/20-hisat2/{}.sorted.bam
```

```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc


multiqc ../analyses/20-hisat2/ \
-o ../analyses/20-hisat2/multiqc2021/
```

## get a count matrix
```{bash}
ls ../analyses/20-hisat2/*gtf
```

copy the above and put into a txt file, add the sample ID before each path
called 2021.txt


```{bash}
cat ../analyses/20-hisat2/2021.txt
```

```{r, engine='bash'}
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
-i ../analyses/20-hisat2/2021.txt \
-g ../data/gene_count_matrix_2021.csv \
-t ../data/transcript_count_matrix_2021.csv
```


```{bash}
head ../data/*matrix_2021.csv
```

# 3. ALIGNMENT 2022 RNAseq Data
pycno2022 trimmed data is in: `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` on Raven
moving it into the data/2022_trimmed directory in this repo: 
```
rsync --archive --progress --verbose PSC*fq.gz graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/2022_trimmed
```
run `rsync` while in the `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` directory. 


```{bash}
for file in ../data/2022_trimmed/*_R1*fq.gz; do
    # Remove the  part to get the base name
    base=$(basename "$file" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz)
    # Construct the names of the pair of files
    file1=${base}_R1_001.fastq.gz.fastp-trim.20231101.fq.gz
    file2=${base}_R2_001.fastq.gz.fastp-trim.20231101.fq.gz
    # Run the hisat2 command
    /home/shared/hisat2-2.2.1/hisat2 \
    -x ../data/Phel_genome_index \
    --dta \
    -p 20 \
    -1 ../data/2022_trimmed/$file1 \
    -2 ../data/2022_trimmed/$file2 \
    -S ../analyses/20-hisat2/2022/${base}.sam \
    2> ../analyses/20-hisat2/2022/${base}-hisat.out
done
```

```{bash}
cat ../analyses/20-hisat2/2022/*-hisat.out
```

```{bash}
ls ../analyses/20-hisat2/2022/*sam | wc -l
```

Convert sam to bam
```{bash}
for samfile in ../analyses/20-hisat2/2022/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"
  /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"
  /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done
```

```{bash}
ls ../analyses/20-hisat2/2022/*sorted.bam | wc -l
```

woooo

```{r, engine='bash', eval=TRUE}
cat ../analyses/20-hisat2/2022/*hisat.out \
| grep "overall alignment rate"
``` 

## `stringtie`
```{bash}
find ../analyses/20-hisat2/2022/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../data/augustus.hints.gtf \
-o ../analyses/20-hisat2/2022/{}.gtf \
../analyses/20-hisat2/2022/{}.sorted.bam
```

```{bash}
ls ../analyses/20-hisat2/2022/*sorted.bam | wc -l
```

```{bash}
find ../analyses/20-hisat2/2022/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../analyses/13-hisat-deseq2/mod_augustus.gff \
-o ../analyses/20-hisat2/2022/{}.gtf \
../analyses/20-hisat2/2022/{}.sorted.bam
```


```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc


multiqc ../analyses/20-hisat2/2022/ \
-o ../analyses/20-hisat2/multiqc2022/
```

## get a count matrix
```{bash}
ls ../analyses/20-hisat2/2022/*gtf
```

copy the above and put into a txt file, add the sample ID before each path
called 2022.txt

```{bash}
cat ../analyses/20-hisat2/2022/2022.txt
```

```{r, engine='bash'}
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
-i ../analyses/20-hisat2/2022/2022.txt \
-g ../data/gene_count_matrix_2022.csv \
-t ../data/transcript_count_matrix_2022.csv
```


```{bash}
head ../data/*matrix_2022.csv
```