---
title: "13-hisat2_pisaster"
output: html_document
date: "2025-02-20"
---
Rmd to get gene and transcript count matrices for _Pisaster ochraceus_ from day 12 from the multispecies experiment. 

Modelling after [paper-pycno-sswd-2021-2022/code/03-hisat2-summer_2021_2022.Rmd](https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/03-hisat2-summer_2021_2022.Rmd)

Untrimmed reads:    
`/home/shared/8TB_HDD_02/graceac9/multispecies2023`

Trimmed reads:     
`/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/output/11-multi-fastp/pisaster`    

Only want the libraries from samples from _P. ochraceus_ at Day 12:  
| Library_ID | Bin_Number |
|------------|------------|
| 518        | 1          |
| 524        | 2          |
| 530        | 3          |
| 536        | 4          |
| 548        | 6          |
| 560        | 8          |

# KEY FILES
Key files and genomes can be found linked in this wikipage: [project-pycno-multispecies-2023/wiki/Key-Files](https://github.com/grace-ac/project-pycno-multispecies-2023/wiki/Key-Files)

Need to get genome fasta into `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/`

Downloaded from https://ucmerced.app.box.com/s/uyq6kx52awjuiy0mfepcrwowt44yjqpf onto my laptop. 

Then `rsync` from my personal laptop onto raven into the `/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data/`:

navigate in terminal on laptop to my Downloads folder, then run:     
`rsync --archive --progress --verbose jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data`

File is now in data, and it is called: `jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta`

```{bash}
head ../data/jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta
```


Get the .gtf file into `../data/`

Download from [https://ucmerced.app.box.com/s/vsqqkvbl0ie39nu1457efi5emln7qggu](https://ucmerced.app.box.com/s/vsqqkvbl0ie39nu1457efi5emln7qggu)   
file called augustus.hints.gtf.gz 

rename to add pisaster: pisaster_augustus.hints.gtf.gz 

`rsync` from laptop to raven `../data/` folder: 
`rsync --archive --progress --verbose pisaster_augustus.hints.gtf.gz graceac9@raven.fish.washington.edu:/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-multispecies-2023/data`

unzip the file: 
```{r, engine='bash'}
#cd ../data 
#gunzip pisaster_augustus.hints.gtf.gz
```

```{bash}
head ../data/pisaster_augustus.hints.gtf
```

# LOAD PACKAGES

```{r}
library(knitr)
library(tidyverse)
library(Biostrings)
library(pheatmap)
library(DESeq2)
library(tidyverse)
```

# 1. BUILD AN INDEX

Follow this code: https://htmlpreview.github.io/?https://github.com/urol-e5/deep-dive/blob/8fd4ad4546d1d95464952f0509406efd9e42ffa0/D-Apul/code/04-Apulcra-hisat.html 

```{bash}
pwd
```

From the gtf, get exon list
```{bash}
#/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
#../data/pisaster_augustus.hints.gtf \
#> ../output/13-hisat2_pisaster/m_exon.tab
```

```{bash}
head ../output/13-hisat2_pisaster/m_exon.tab
```

from the gtf, get the splice sites
```{bash}
#!/bin/bash

# This script will extract splice sites from the gtf file

# This is the command to extract splice sites from the gtf file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/pisaster_augustus.hints.gtf \
> ../output/13-hisat2_pisaster/m_splice_sites.tab

```

use the genome fasta to make an index for alignment
```{bash}
# build an index 
/home/shared/hisat2-2.2.1/hisat2-build \
../data/jordan-uni3279-mb-hirise-slc3v__06-07-2022__final_assembly.fasta \
../data/Pisaster_genome_index \
--exon ../output/13-hisat2_pisaster/m_exon.tab \
--ss ../output/13-hisat2_pisaster/m_splice_sites.tab \
-p 40 \
../data/pisaster_augustus.hints.gtf \
2> ../output/13-hisat2_pisaster/hisat2-build_stats.txt
```

```{bash}
tail ../output/13-hisat2_pisaster/hisat2-build_stats.txt
```

```{bash}
for file in ../output/11-multi-fastp/pisaster/*_R1*fq.gz; do
    # Remove the  part to get the base name
    base=$(basename "$file" _R1.fastp-trim.fq.gz)
    # Construct the names of the pair of files
    file1=${base}_R1.fastp-trim.fq.gz
    file2=${base}_R2.fastp-trim.fq.gz
    # Run the hisat2 command
    /home/shared/hisat2-2.2.1/hisat2 \
    -x ../data/Pisaster_genome_index \
    --dta \
    -p 20 \
    -1 ../output/11-multi-fastp/pisaster/$file1 \
    -2 ../output/11-multi-fastp/pisaster/$file2 \
    -S ../output/13-hisat2_pisaster/${base}.sam \
    2> ../output/13-hisat2_pisaster/${base}-hisat.out
done
```

```{bash}
cat ../output/13-hisat2_pisaster/*-hisat.out
```

```{bash}
pwd
```

Convert sam to bam
```{bash}
for samfile in ../output/13-hisat2_pisaster/*.sam; do
  bamfile="${samfile%.sam}.bam"
  sorted_bamfile="${samfile%.sam}.sorted.bam"
  /home/shared/samtools-1.12/samtools view -bS -@ 20 "$samfile" > "$bamfile"
  /home/shared/samtools-1.12/samtools sort -@ 20 "$bamfile" -o "$sorted_bamfile"
  /home/shared/samtools-1.12/samtools index -@ 20 "$sorted_bamfile"
done
```

```{bash}
ls ../output/13-hisat2_pisaster/*sorted.bam | wc -l
```

```{r, engine='bash', eval=TRUE}
cat ../output/13-hisat2_pisaster/*hisat.out \
| grep "overall alignment rate"
```


alignment rates look good!!

do this from Steven to get gtf into gff: 
steven did this to get the gtf into gff for `stringtie` use
```{bash}
/home/shared/gffread-0.12.7.Linux_x86_64/gffread \
../data/pisaster_augustus.hints.gtf \
-T \
-o ../data/pisaster_augustus.hints.gff
```

## `stringtie`
```{bash}
find ../output/13-hisat2_pisaster/*sorted.bam \
| xargs basename -s .sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 36 \
-eB \
-G ../data/pisaster_augustus.hints.gff \
-o ../output/13-hisat2_pisaster/{}.gff \
../output/13-hisat2_pisaster/{}.sorted.bam
```

get a multiqc file on the alignment:
```{bash}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc


multiqc ../output/13-hisat2_pisaster/ \
-o ../output/13-hisat2_pisaster/multiqc_pisaster/
```

navigate into folder: `~/GitHub/project-pycno-multispecies-2023/output/13-hisat2_pisaster/multiqc_pisaster`    
copy alignment multiqc report to OWL: 
`rsync --archive --progress --verbose multiqc_report.html grace@owl.fish.washington.edu:/volume1/web/gcrandall/multispeciesSSWD/QCreports`

on Owl-- rename report to pisaster_2023_alignment_multiqc_report.html

# GET A COUNT MATRIX
```{bash}
ls ../output/13-hisat2_pisaster/*gff
```

copy the above and put into a txt file, add the sample ID before each path
called pisaster2023.txt

```{bash}
cat ../output/13-hisat2_pisaster/pisaster2023.txt
```

```{bash}
head -5 ../output/13-hisat2_pisaster/PSC-0518.gff
tail -5 ../output/13-hisat2_pisaster/PSC-0518.gff
```


```{r, engine='bash'}
python /home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
-i ../output/13-hisat2_pisaster/pisaster2023.txt \
-g ../data/pisaster_gene_count_matrix_2023.csv \
-t ../data/pisaster_transcript_count_matrix_2023.csv
```

```{bash}
head ../data/pisaster*matrix_2023.csv
```