---
title: "Differential Gene Expression of Developmental stages in *Botrylus schlosseri*"
author: "Celeste Valdivia"
output: md_document
date: "2023-04-13"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE)
```

## 1. Accessing genome
For this workflow, we will utilize the assembled genome of [*Botryllus schlosseri*](https://elifesciences.org/articles/00569) as a reference when quantifying the transcripts of the samples.

Curling the *Botryllus schlosseri* genome assembly. 
```{r, engine= 'bash'}
cd ../data
curl --insecure -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/444/245/GCA_000444245.1_356a-chromosome-assembly/GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz

# unzipping file
gzip -d GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz
```

## 2. Accessing annotated genome
Downloading the **complete** annotated genome from the [*Botryllus schlosseri* Genome Project](http://botryllus.stanford.edu/botryllusgenome/download/):
```{r, engine = 'bash'}
cd ../data
curl --insecure -O http://botryllus.stanford.edu/botryllusgenome/download/botznik-chr-ctg.gff
```

Taking a peak at the annotated genome:
```{bash}
cd ../data/genome-alignment
tail -n 70 botznik-chr-all.gff
```
When running through workflow, particularly at the development of gtf files, it says that this is not a valid reference gff. Below I will curl a different gff from the same server. I tried all the gff's from this server and none of them produce exon and splice site tabs or are convertable into gtf files with string tie.

```{r, engine = 'bash'}
cd ../data/genome-alignment
curl --insecure -O http://botryllus.stanford.edu/botryllusgenome/download/botznik-chr-ctg.gff
```


## 3. Build an index file from the reference genome with HISat2
We will be aligning our sequence reads to the reference genome.

The below command takes the path to the HISAT2 executable.
Then with -f option we specify that the input is a fasta file (which can come in the fasta, fna, or fa file extension format).

We then title and create an index which is stored into the output folder. If you don't have an output folder in your working directory already, add one before running the below code.

GTF files typically include information about gene structures, such as the locations of exons, introns, and other features, and are commonly used for genome annotation and RNAseq analysis. 


```{bash}
# Create Hisat2 exons tab file
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/genome-alignment/botznik-chr-all.gff \
> ../output/botznik-chr-all-exons.tab

# Create Hisat2 splice sites tab file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/genome-alignment/botznik-chr-all.gff \
> ../output/botznik-chr-all-splice-sites.tab
```

```{r, engine='bash'}
# Build Hisat2 reference index using splice sites and exons
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/genome-alignment/GCA_000444245.1_356a-chromosome-assembly_genomic.fna \
../output/GCA_000444245.1_356a-annote.index \
--exon ../output/botznik-chr-all-exons.tab \
--ss ../output/botznik-chr-all-splice-sites.tab \
-p 7 \
2> hisat2-build_stats.txt
```

## 4. Iterative alignment to index (making SAM files)

The below code chunk will align each pair of my RNAseq files to the reference index we created in the last step using hisat2. 

```{r, engine='bash'}
# Set input and output directories
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw
output_dir=../output

# Set Hisat2 options
hisat2_options="-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-annote.index --threads 7"

# Loop through fastq files in input directory
for f in ${input_dir}/*_1.fastq; do
    # Get base filename
    base=$(basename ${f} _1.fastq)
    
    # Align with Hisat2
    /home/shared/hisat2-2.2.1/hisat2 \
    ${hisat2_options} \
    -1 ${f} \
    -2 ${input_dir}/${base}_2.fastq \
    -S ${output_dir}/${base}.sam \
    
done

```


## 5. Iteratively make sorted.BAM files from SAM files 

The following code chunk will iterateively take all sam files in the output directory and convert them to bam files.
```{r, engine='bash'}
#!/bin/bash

# Directory containing SAM files
sam_dir="../output/"

# Iterate over all SAM files in the directory
for sam_file in ${sam_dir}*.sam
do
    # Generate output file names
    base_name=$(basename ${sam_file} .sam)
    bam_file="${sam_dir}${base_name}.sorted.bam"
    index_file="${bam_file}.bai"

    # Convert SAM to BAM and sort
    /home/shared/samtools-1.12/samtools view \
    -@ 7 \
    -Su ${sam_file} \
    | /home/shared/samtools-1.12/samtools sort - \
    -@ 7 \
    -o ${bam_file}

    # Index BAM file
    /home/shared/samtools-1.12/samtools index ${bam_file}
done

```

Cleaning up the output directory since those sam files are pretty large and taking up space.
```{r, engine = 'bash'}
# Delete unnecessary index files
rm GCA_000444245.1_356a-valid*.ht2

# Delete unneeded SAM files
rm ./*.sam
```


```{r, engine = 'bash'}
rm ./ *.ht2
rm ./*.sam
```

## 6. Generating a reference transcriptome annotation to input into trinity?
I can't make abundance estimates until I have gtf it seems. Here is an example of the code chunk that ChatGPT spit out for me. 

```{r, engine='bash'}
 #!/bin/bash

# Set input and output directories
input_dir=../output
output_dir=../output/stringtie

# Loop through the sorted BAM files in input directory
for f in ${input_dir}/*.sorted.bam; do
    # Get base filename
    base=$(basename ${f} .sorted.bam)

    # Run StringTie to estimate transcript abundances, botznik gff not accepted as a valid reference transcript... :(
    /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
    ${input_dir}/${base}.sorted.bam \
    -G ../data/genome-alignment/botznik-chr-all.gff \
    -o ${output_dir}/${base}.gtf \
    -p 7

done

# Merge GTF files with StringTie
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge ${output_dir}/mergelist.txt \
-G ../data/botznik-chr-ctg.gff \
-o ${output_dir}/merged.gtf \
-p 7

```

## 7. Quantifying gene expression levels

We will now create transcript abundance estimates from the sorted BAM files we just generated.

For this we will use the "align_and_estimate_abundance.pl" script included with Trinity.


```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/align_and_estimate_abundance.pl \
--transcripts ../data/genome-alignmentbotznik-chr-all.gff \
--seqType sam \
--left reads_1.fq \
--right reads_2.fq \
--est_method RSEM \
--aln_method bowtie2 \
--prep_reference \
--thread_count 4 \
--output_dir output_dir

```