---
title: "A pulcra HiSat2"
author: Steven Roberts
date: "`r format(Sys.time(), '%d %B, %Y')`"  
output: 
  html_document:
    theme: readable
    highlight: zenburn
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
---

```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,         # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```

#millipora genome prep

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
> ../output/04-Apulcra-hisat/m_exon.tab
```

```{r, engine='bash'}
head ../output/04-Apulcra-hisat/m_exon.tab

```
```{r, engine='bash'}
#!/bin/bash

# This script will extract splice sites from the gtf file

# This is the command to extract splice sites from the gtf file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \

# This is the gtf file
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \

# This is the output file
> ../output/04-Apulcra-hisat/m_splice_sites.tab

```

```
"${programs_array[hisat2_build]}" \
"${genome_fasta}" \
"${genome_index_name}" \
--exon "${exons}" \
--ss "${splice_sites}" \
-p "${threads}" \
2> hisat2-build_stats.txt
```


```{r, engine='bash'}
#!/bin/bash

# hisat2-build is a program that builds a hisat2 index for the reference genome
# ../data/Amil/ncbi_dataset/data/GCF_013753865.1/GCF_013753865.1_Amil_v2.1_genomic.fna is the reference genome
# GCF_013753865.1_Amil_v2.1 is the name of the index
# --exon ../output/04-Apulcra-hisat/m_exon.tab is the exon file
# --ss ../output/04-Apulcra-hisat/m_splice_sites.tab is the splice site file
# -p 40 is the number of threads
# ../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf is the gtf file
# 2> ../output/04-Apulcra-hisat/hisat2-build_stats.txt is the output file

/home/shared/hisat2-2.2.1/hisat2-build \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/GCF_013753865.1_Amil_v2.1_genomic.fna \
GCF_013753865.1_Amil_v2.1 \
--exon ../output/04-Apulcra-hisat/m_exon.tab \
--ss ../output/04-Apulcra-hisat/m_splice_sites.tab \
-p 40 \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gtf \
2> ../output/04-Apulcra-hisat/hisat2-build_stats.txt

```


# Alignment

```
Hisat2 alignments
"${programs_array[hisat2]}" \
-x "${genome_index_name}" \
-1 "${fastq_list_R1}" \
-2 "${fastq_list_R2}" \
-S "${sample_name}".sam \
2> "${sample_name}"-hisat2_stats.txt
```

```{r, engine='bash'}
#!/bin/bash comment with GPT Studio

# run hisat2 to align reads to the reference genome
/home/shared/hisat2-2.2.1/hisat2 \

# specify the reference genome
-x ../output/04-Apulcra-hisat/GCF_013753865.1_Amil_v2.1 \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_mil.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats.txt

```


### chatgpt comment test - same chunk as above
this with ChatGPT not inline 


This R code is actually a command-line script written in Bash which can be run in the terminal or within an R script using the `system()` or `system2()` commands. It runs the program HISAT2 with specific options and parameters on paired-end RNA-seq data for a specific species of organism called *Acanthopleura granulata* (also known as *Acmaea intestinalis*) to align the sequencing reads to its reference genome.


---

ChatGPT output:

This code is running the HISAT2 alignment tool, which is used to align RNA sequencing reads to a reference genome. It is specifying the following options:

`-x` specifies the reference genome index to use for the alignment.

`-p` specifies the number of CPU cores to use in the alignment. 

`-1` and `-2` specify the input paired-end fastq read files to align. 

`-S` specifies the output SAM file to save the alignment results.

`2>&1 | tee` specifies that both stdout and stderr are to be written to a file `../output/04-Apulcra-hisat/hisat2_stats.txt`, while also being printed to the console for the user to see.

The code is assuming that the input fastq files are located at `../data/SRR8601366_1.fastq` and `../data/SRR8601366_2.fastq`, and that the reference genome index is located at `/home/shared/hisat2-2.2.1/hisat2`.

The aligned output will be saved to a SAM file located at `../output/04-Apulcra-hisat/SRR8601366_mil.sam`.

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_013753865.1_Amil_v2.1 \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_mil.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats.txt
```

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_013753865.1_Amil_v2.1 \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_mil.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats.txt
```

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_013753865.1_Amil_v2.1 \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_mil.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats.txt
```
# Align with no splice sites


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
../data/Amil/ncbi_dataset/data/GCF_013753865.1/GCF_013753865.1_Amil_v2.1_genomic.fna \
GCF_013753865.1_Amil_v2.1_nosplice \
-p 48 \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2-build_stats_nosplice.txt
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_013753865.1_Amil_v2.1_nosplice \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_mil_nosplice.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats_nosplice.txt
```

# A hyacinthus


## no splice

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
../data/Ahya/ncbi_dataset/data/GCA_020536085.1/GCA_020536085.1_Ahyacinthus.chrsV1_genomic.fna \
../output/04-Apulcra-hisat/GCA_020536085.1_Ahya_nosplice \
-p 48 \
2> ../output/04-Apulcra-hisat/hisat2-build_stats_Ahya_nosplice.txt
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCA_020536085.1_Ahya_nosplice \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_hya_nosplice.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats_Ahya_nosplice.txt
```


# A digitera


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/Adig/ncbi_dataset/data/GCF_000222465.1/genomic.gtf \
> ../output/04-Apulcra-hisat/d_exon.tab
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/Adig/ncbi_dataset/data/GCF_000222465.1/genomic.gtf \
> ../output/04-Apulcra-hisat/d_splice_sites.tab
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
../data/Adig/ncbi_dataset/data/GCF_000222465.1/GCF_000222465.1_Adig_1.1_genomic.fna \
../output/04-Apulcra-hisat/GCF_000222465.1_Adig \
--exon ../output/04-Apulcra-hisat/d_exon.tab \
--ss ../output/04-Apulcra-hisat/d_splice_sites.tab \
-p 48 \
../data/Adig/ncbi_dataset/data/GCF_000222465.1/genomic.gtf \
2> ../output/04-Apulcra-hisat/hisat2-build_stats_Adig.txt
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_000222465.1_Adig \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_dig.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats_dig.txt
```
## dig no splice

```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2-build \
../data/Adig/ncbi_dataset/data/GCF_000222465.1/GCF_000222465.1_Adig_1.1_genomic.fna \
../output/04-Apulcra-hisat/GCF_000222465.1_Adig_nosplice \
-p 48 \
2> ../output/04-Apulcra-hisat/hisat2-build_stats_Adig.txt
```


```{r, engine='bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/04-Apulcra-hisat/GCF_000222465.1_Adig_nosplice \
-p 48 \
-1 ../data/SRR8601366_1.fastq \
-2 ../data/SRR8601366_2.fastq \
-S ../output/04-Apulcra-hisat/SRR8601366_dig_nosplice.sam \
2>&1 | tee ../output/04-Apulcra-hisat/hisat2_stats_dig_nosplice.txt
```