---
title: "Differential Gene Expression of Developmental stages in *Botrylus schlosseri*"
author: "Celeste Valdivia"
output: md_document
date: "2023-04-13"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE)
```
This is file 3 of 3.

## 1. Accessing genome
For this workflow, we will utilize the assembled genome of [*Botryllus schlosseri*](https://elifesciences.org/articles/00569) as a reference when quantifying the transcripts of the samples.

Curling the *Botryllus schlosseri* genome assembly. 
```{r, engine= 'bash'}
cd ../data
curl --insecure -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/444/245/GCA_000444245.1_356a-chromosome-assembly/GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz

# unzipping file
gzip -d GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz
```

## 2. Accessing annotated genome
```{r, engine = 'bash'}
cd ../data
curl --insecure -O http://botryllus.stanford.edu/botryllusgenome/download/botznik-chr-ctg.gff
```
If you check through the gff file we just downloaded, we will see there is a lot of meta-information in it that we don't need. We will need to filter through it in order to create exon and splice site tabs to use in our HISat2 workflow. 
```{bash}
cd ../output
head -n 70 botznik-chr-all.gff
tail -n 70 botznik-chr-all.gff
```


```{r, engine = 'bash'}
cd ../data
# grep for CDS or intron and extract relevant columns using awk
grep -E '(CDS|intron)' botznik-chr-all.gff | awk '{print $1, $4, $5, $7, $9}' > exons_splice_sites.txt
cat exons_splice_sites.txt
```

```{bash}
cd output
tail -n 70 botznik-chr-all.gff
```


## 2. Build an index file
We will be aligning our sequence reads to the reference genome

The below command takes the path to the HISAT2 executable.
Then with -f option we specify that the input is a fasta file (which can come in the fasta, fna, or fa file extension format).

We then title and create an index which is stored into the output folder. If you don't have an output folder in your working directory already, add one before running the below code.

GTF files typically include information about gene structures, such as the locations of exons, introns, and other features, and are commonly used for genome annotation and RNAseq analysis. 

We can proceed with making an index using HISAT2 without information regarding exons and splice sites with the following code chunk.


```{bash}
# Create Hisat2 exons tab file
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/botznik-chr-all.gff \
> ../output/botznik-chr-all-exons.tab

# Create Hisat2 splice sites tab file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/botznik-chr-all.gff \
> ../output/botznik-chr-all-splice-sites.tab

# Build Hisat2 reference index using splice sites and exons
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/GCA_000444245.1_356a-chromosome-assembly_genomic.fna \
../output/GCA_000444245.1_356a-annote.index \
--exon ../output/botznik-chr-all-exons.tab \
--ss ../output/botznik-chr-all-splice-sites.tab \
-p 7 \
2> hisat2-build_stats.txt
```

## iterative alignment 

```{r, engine='bash'}
# Set input and output directories
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw
output_dir=../output

# Set Hisat2 options
hisat2_options="-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-annote.index --threads 7"

# Loop through fastq files in input directory
for f in ${input_dir}/*.fastq; do
    # Get base filename
    base=$(basename ${f} .fastq)
    
    # Align with Hisat2
    /home/shared/hisat2-2.2.1/hisat2 \
    ${hisat2_options} \
    -1 ${f} \
    -2 ${input_dir}/${base}_2.fastq \
    -S ${output_dir}/${base}.sam \
    
done

```


## 3. Perform alignment
Here we will align paired-end RNAseq files against the index we created using the HISAT2 software.

It seems like at least in the code chunk I need to specify the absolute path in order for hisat2 to find my input files. 

We will align each RNAseq file one at a time just replacing the input fastq file and output name.

The output will be sorted.bam files that we can then use downstream to create transcript abundance estimates.


### Query SRR10352205
```{r, engine= 'bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-annote.index \
-1 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_1.fastq   \
-2 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_2.fastq \
-S ../output/SRR10352205.sam \
--threads 7 

```

```{r, engine = 'bash'}
/home/shared/samtools-1.12/samtools view \
-@ 7 \
-Su ../output/SRR10352205.sam \
| /home/shared/samtools-1.12/samtools sort - \
-@ 7 \
-o ../output/SRR10352205.sorted.bam 
/home/shared/samtools-1.12/samtools index ../output/SRR10352205.sorted.bam
```
### Query SRR10352206
```{r, engine= 'bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-valid.index \
-1 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352206_1.fastq   \
-2 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352206_2.fastq \
-S ../output/SRR10352206.sam \
--threads 7 

```

```{r, engine = 'bash'}
/home/shared/samtools-1.12/samtools view \
-@ 7 \
-Su ../output/SRR10352206.sam \
| /home/shared/samtools-1.12/samtools sort - \
-@ 7 \
-o ../output/SRR10352206.sorted.bam 
/home/shared/samtools-1.12/samtools index ../output/SRR10352206.sorted.bam
```


### Query SRR10352224
```{r, engine= 'bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-valid.index \
-1 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352224_1.fastq   \
-2 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352224_2.fastq \
-S ../output/SRR10352224.sam \
--threads 7 

```

```{r, engine = 'bash'}
/home/shared/samtools-1.12/samtools view \
-@ 7 \
-Su ../output/SRR10352224.sam \
| /home/shared/samtools-1.12/samtools sort - \
-@ 7 \
-o ../output/SRR10352224.sorted.bam 
/home/shared/samtools-1.12/samtools index ../output/SRR10352224.sorted.bam
```


### Query SRR10352254
```{r, engine= 'bash'}
/home/shared/hisat2-2.2.1/hisat2 \
-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-valid.index \
-1 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352254_1.fastq   \
-2 /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352254_2.fastq \
-S ../output/SRR10352254.sam \
--threads 7 

```

```{r, engine = 'bash'}
/home/shared/samtools-1.12/samtools view \
-@ 7 \
-Su ../output/SRR10352254.sam \
| /home/shared/samtools-1.12/samtools sort - \
-@ 7 \
-o ../output/SRR10352254.sorted.bam 
/home/shared/samtools-1.12/samtools index ../output/SRR10352254.sorted.bam
```

### Deleting unecessary files
Once you have completed the alignment for each of your queries, delete unneeded index and SAM files.

```{r, engine = 'bash'}
rm ./ GCA_000444245.1_356a-valid*.ht2
rm ./*.sam
```

## 4. Generating a reference transcriptome annotation to input into trinity?
I can't make abundance estimates until I have gtf it seems. Here is an example of the code chunk that ChatGPT spit out for me. 

"To generate a reference transcriptome annotation in GTF format using StringTie, you can follow these steps:

1. Align RNA-seq reads to the genome assembly: Use a program such as HISAT2 to align RNA-seq reads to the genome assembly. This will produce a sorted BAM file containing the alignment information.

2. Generate a GTF file for genome annotation: Use a program such as MAKER or BRAKER to generate a GTF file for genome annotation. This file should contain the predicted gene models, along with their exon and intron boundaries.

3. Run StringTie to assemble transcripts: Use the sorted BAM file from step 1 and the GTF file from step 2 as input to StringTie. This will assemble the transcripts and produce a GTF file containing the transcript annotations.

Here's an example command for running StringTie:

```{r, engine='bash'}
stringtie [sorted_bam_file] -G [gtf_file] -o [output_gtf_file] -p [num_threads] -l [label_prefix]
```

Replace [sorted_bam_file] with the name of your sorted BAM file, [gtf_file] with the name of your genome annotation GTF file, [output_gtf_file] with the name you want to give to your output transcript annotation file, [num_threads] with the number of threads you want to use for parallel processing, and [label_prefix] with a prefix to use for the transcript IDs.

After running StringTie, you should have a GTF file containing the transcript annotations for your genome assembly."

## 5. Quantifying gene expression levels

We will now create transcript abundance estimates from the sorted BAM files we just generated.

For this we will use the "align_and_estimate_abundance.pl" script included with Trinity.


```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/align_and_estimate_abundance.pl \
--transcripts reference.gtf \
--seqType sam \
--left reads_1.fq \
--right reads_2.fq \
--est_method RSEM \
--aln_method bowtie2 \
--prep_reference \
--thread_count 4 \
--output_dir output_dir

```