---
title: "12-hisat2"
output: html_document
date: "2025-02-18"
---
Rmd to get gene and transcript count matrices for _Pycnopodia helianthiodes_ from day 12 from the multispecies experiment.
Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd)
Untrimmed reads:
`/home/shared/8TB_HDD_02/graceac9/multispecies2023`
Trimmed reads:
`/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/phel`
Only want RNAseq files for the _P. helianthoides_:
| Library_ID | Bin_Number |
|------------|------------|
| 519 | 1 |
| 525 | 2 |
| 531 | 3 |
| 537 | 4 |
| 549 | 6 |
| 561 | 8 |
# Key Files
Key files and genomes can be found linked in this wikipage: [project-pycno-multispecies-2023/wiki/Key-Files](https://github.com/grace-ac/project-pycno-multispecies-2023/wiki/Key-Files)
notes from [paper-pycno-sswd-2021-2022/code/20-hisat2-genecount-matrices.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/20-hisat2-genecount-matrices.Rmd)
For `stringtie` later, the gtf needs to be in gff format. There isn't currently one in existence that works. Steven figured out how to fix this issue (https://sr320.github.io/tumbling-oysters/posts/sr320-20-seastar/)
Won't run code because Steven did this already and new file lives in:
`/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/mod_augustus.gtf`
Need to get the genome fasta into `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/`
```{r, engine='bash'}
cd ../data
/home/shared/datasets download genome accession GCA_032158295.1 --include gff3,rna,cds,protein,genome,seq-report
```
```{r, engine='bash'}
cd ../data
unzip ncbi_dataset.zip
```
```{r, engine='bash', eval=TRUE}
ls ../data/ncbi_dataset/data/GCA_032158295.1
```
```{r, engine='bash', eval=TRUE}
head ../data/mod_augustus.gtf
head ../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
```
# LOAD PACKAGES
```{r}
library(knitr)
library(tidyverse)
library(Biostrings)
library(pheatmap)
library(DESeq2)
library(tidyverse)
```
# 1. BUILD AN INDEX
Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html
```{bash}
pwd
```
From the gtf, get exon list
```{bash}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/mod_augustus.gtf \
> ../output/12-hisat2_pycno/m_exon.tab
```
```{bash}
head ../output/12-hisat2_pycno/m_exon.tab
```
from the gtf, get the splice sites
```{bash}
#!/bin/bash
# This script will extract splice sites from the gtf file
# This is the command to extract splice sites from the gtf file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/mod_augustus.gtf \
> ../output/12-hisat2_pycno/m_splice_sites.tab
```
use the genome fasta to make an index for alignment
```{bash}
# build an index
/home/shared/hisat2-2.2.1/hisat2-build \
../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \
../data/Phel_genome_index \
--exon ../output/12-hisat2_pycno/m_exon.tab \
--ss ../output/12-hisat2_pycno/m_splice_sites.tab \
-p 40 \
../data/mod_augustus.gtf \
2> ../output/12-hisat2_pycno/hisat2-build_stats.txt
```
re-run without `../data/mod_augustus.gtf` and get new output files to then run an md5 to compare:
```{bash}
# build an index
/home/shared/hisat2-2.2.1/hisat2-build \
../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna \
../data/Phel_genome_index_nogtf \
--exon ../output/12-hisat2_pycno/m_exon.tab \
--ss ../output/12-hisat2_pycno/m_splice_sites.tab \
-p 40 \
2> ../output/12-hisat2_pycno/hisat2-build_stats_nogtf.txt
```
Cehcking to see if files are the same:
```{bash}
md5sum ../data/Phel_genome_index_nogtf.1.ht2
md5sum ../data/Phel_genome_index.1.ht2
```
THEY ARE!
```{bash}
tail ../output/12-hisat2_pycno/hisat2-build_stats.txt
```
Need to move the RNAseq files for _P. helianthoides_ into own folder:
now live in: `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/phel`
```{bash}
for file in ../output/11-multi-fastp/phel/*_R1*fq.gz; do
# Remove the part to get the base name
base=$(basename "$file" _R1.fastp-trim.fq.gz)
# Construct the names of the pair of files
file1=${base}_R1.fastp-trim.fq.gz
file2=${base}_R2.fastp-trim.fq.gz
# Run the hisat2 command
/home/shared/hisat2-2.2.1/hisat2 \
-x ../data/Phel_genome_index \
--dta \
-p 20 \
-1 ../output/11-multi-fastp/phel/$file1 \
-2 ../output/11-multi-fastp/phel/$file2 \
-S ../output/12-hisat2_pycno/${base}.sam \
2> ../output/12-hisat2_pycno/${base}-hisat.out
done
```
```{bash}
cat ../output/12-hisat2_pycno/*-hisat.out
```
```{bash}
pwd
```
Convert sam to bam
```{bash}
for samfile in ../output/12-hisat2_pycno/*.sam; do
bamfile="${samfile%.sam}.bam"
sorted_bamfile="${samfile%.sam}.sorted.bam"
/home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"
/home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"
/home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done
```
```{bash}
ls ../output/12-hisat2_pycno/*sorted.bam | wc -l
```
```{r, engine='bash', eval=TRUE}
cat ../output/12-hisat2_pycno/*hisat.out \
| grep "overall alignment rate"
```
alignment rates look good!!
do this from Steven to get gtf into gff:
steven did this to get the gtf into gff for `stringtie` use
```{bash}
/home/shared/gffread-0.12.7.Linux_x86_64/gffread \
../data/mod_augustus.gtf \
-T \
-o ../data/mod_augustus.gff
```
## `stringtie`
```{bash}
find ../output/12-hisat2_pycno/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../data/mod_augustus.gtf \
-o ../output/12-hisat2_pycno/{}.gtf \
../output/12-hisat2_pycno/{}.sorted.bam
```
get a multiqc file on the alignment:
```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc
multiqc ../output/12-hisat2_pycno/ \
-o ../output/12-hisat2_pycno/multiqc_pycno/
```
navigate into folder: `~/GitHub/project-pycno-multispecies-2023/output/12-hisat2_pycno/multiqc_pycno`
copy alignment multiqc report to OWL:
`rrsync --archive --progress --verbose multiqc_report_1.html grace@owl.fish.washington.edu:/volume1/web/gcrandall/multispeciesSSWD/QCreports`
on Owl-- rename report to pycno_2023_alignment_multiqc_report.html
# GET A COUNT MATRIX
```{bash}
ls ../output/12-hisat2_pycno/*gtf
```
copy the above and put into a txt file, add the sample ID before each path
called pycno2023.txt
```{bash}
cat ../output/12-hisat2_pycno/pycno2023.txt
```
```{bash}
head -5 ../output/12-hisat2_pycno/PSC-0519.gtf
```
```{r, engine='bash'}
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
-i ../output/12-hisat2_pycno/pycno2023.txt \
-g ../data/pycno_gene_count_matrix_2023.csv \
-t ../data/pycno_transcript_count_matrix_2023.csv
```
```{bash}
head ../data/pycno*matrix_2023.csv
```