---
title: "Differential Gene Expression of Developmental stages in *Botrylus schlosseri*"
author: "Celeste Valdivia"
output: md_document
date: "2023-04-13"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE)
```

## 1. Accessing genome
For this workflow, we will utilize the assembled genome of [*Botryllus schlosseri*](https://elifesciences.org/articles/00569) as a reference when quantifying the transcripts of the samples.

Curling the *Botryllus schlosseri* genome assembly. 
```{r, engine= 'bash'}
cd ../data
curl --insecure -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/444/245/GCA_000444245.1_356a-chromosome-assembly/GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz

# unzipping file
gzip -d GCA_000444245.1_356a-chromosome-assembly_genomic.fna.gz
```

## 2. Accessing annotated genome
Downloading the start and stop transcripts annotated genome from the [*Botryllus schlosseri* Genome Project](http://botryllus.stanford.edu/botryllusgenome/download/). This version of the gff file from this site does not have extra meta information which may make it easier to work with:
```{r, engine = 'bash'}
cd ../data
curl --insecure -O http://botryllus.stanford.edu/botryllusgenome/download/start_stop_transcripts30.gff
```

Taking a peak at the gff file we downloaded from that server:
```{r, engine ='bash'}
cd ../data/genome-alignment
tail -n 70 start_stop_transcripts30.gff
```
As you can see, it follows a pretty standard gff format in that it has 9 columns and each row entry specifies a different seq. The gene_id and transcript_id are in the 9th column. From reading online, this is pretty standard. However, you will also notice that in this gff file each row can often refer to the same gene but each row will specify a different portion of the sequence of that gene including the start and stop codon and the protein coding sequences of that gene. Each one of these sections is separated by a line that specifies the gene ID for that whole section.


## 3. Build an index file from the reference genome with HISat2

This code chunk attempts to extract exon and splice site info from the gff, but I believe the gff file is already in it's entirety protein coding regions already. So "exons" are not specified as portions of the transcripts. Just the position of the whole transcript itself and whether or not it has any introns in that sequence. That is why the below code chunk produces tabs of file size 0 B.
```{bash}
# Create Hisat2 exons tab file
/home/shared/hisat2-2.2.1/hisat2_extract_exons.py \
../data/genome-alignment/start_stop_transcripts30.gff \
> ../output/botznik-chr-exons.tab

# Create Hisat2 splice sites tab file
/home/shared/hisat2-2.2.1/hisat2_extract_splice_sites.py \
../data/genome-alignment/start_stop_transcripts30.gff \
> ../output/botznik-chr-splice-sites.tab
```

The below command takes the path to the HISAT2 executable.
Then with -f option we specify that the input is a fasta file (which can come in the fasta, fna, or fa file extension format).

We then title and create an index which is stored into the output folder. If you don't have an output folder in your working directory already, add one before running the below code. It's okay that the exon and splice site tabs are empty, this chunk will still work without them to build an index.
```{r, engine='bash'}
# Build Hisat2 reference index using splice sites and exons
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/genome-alignment/GCA_000444245.1_356a-chromosome-assembly_genomic.fna \
../output/GCA_000444245.1_356a-annote.index \
--exon ../output/botznik-chr-all-exons.tab \
--ss ../output/botznik-chr-all-splice-sites.tab \
-p 7 \
2> hisat2-build_stats.txt
```

## 4. Iterative alignment to index (making SAM files)

The below code chunk will align each pair of my RNAseq files to the reference index we created in the last step using hisat2. 

```{r, engine='bash'}
# Set input and output directories
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw
output_dir=../output

# Set Hisat2 options
hisat2_options="-x /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/GCA_000444245.1_356a-annote.index --threads 7"

# Loop through fastq files in input directory
for f in ${input_dir}/*_1.fastq; do
    # Get base filename
    base=$(basename ${f} _1.fastq)
    
    # Align with Hisat2
    /home/shared/hisat2-2.2.1/hisat2 \
    ${hisat2_options} \
    -1 ${f} \
    -2 ${input_dir}/${base}_2.fastq \
    -S ${output_dir}/${base}.sam \
    
done

```


## 5. Iteratively make sorted.BAM files from SAM files 

The following code chunk will iterateively take all sam files in the output directory and convert them to bam files.
```{r, engine='bash'}
#!/bin/bash

# Directory containing SAM files
sam_dir="../output/"

# Iterate over all SAM files in the directory
for sam_file in ${sam_dir}*.sam
do
    # Generate output file names
    base_name=$(basename ${sam_file} .sam)
    bam_file="${sam_dir}${base_name}.sorted.bam"
    index_file="${bam_file}.bai"

    # Convert SAM to BAM and sort
    /home/shared/samtools-1.12/samtools view \
    -@ 7 \
    -Su ${sam_file} \
    | /home/shared/samtools-1.12/samtools sort - \
    -@ 7 \
    -o ${bam_file}

    # Index BAM file
    /home/shared/samtools-1.12/samtools index ${bam_file}
done

```

Cleaning up the output directory since those sam files are pretty large and taking up space.
```{r, engine = 'bash'}
cd ../output
# Delete unnecessary index files
rm GCA_000444245.1_356a-annote*.ht2

# Delete unneeded SAM files
rm ./*.sam
```

## trying to make a gtf file
```{bash}
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 -o output.gtf ../data/gemome-alignment/start_stop_transcripts30.gff

```


## 6. Generating a reference transcriptome annotation to input into trinity?
I can't make abundance estimates until I have gtf it seems. Here is an example of the code chunk that ChatGPT spit out for me. 

```{r, engine='bash'}
 #!/bin/bash

# Set input and output directories
input_dir=../output
output_dir=../output/stringtie

# Loop through the sorted BAM files in input directory
for f in ${input_dir}/*.sorted.bam; do
    # Get base filename
    base=$(basename ${f} .sorted.bam)

    # Run StringTie to estimate transcript abundances, botznik gff not accepted as a valid reference transcript... :(
    /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
    ${input_dir}/${base}.sorted.bam \
    -G ../data/genome-alignment/botznik-chr-all.gff \
    -o ${output_dir}/${base}.gtf \
    -p 7

done

# Merge GTF files with StringTie
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge ${output_dir}/mergelist.txt \
-G ../data/botznik-chr-ctg.gff \
-o ${output_dir}/merged.gtf \
-p 7

```

## 7. Quantifying gene expression levels

We will now create transcript abundance estimates from the sorted BAM files we just generated.

For this we will use the "align_and_estimate_abundance.pl" script included with Trinity.


```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/align_and_estimate_abundance.pl \
--transcripts ../data/start_stop_transcripts30.gff \
--seqType sam \
--left reads_1.fq \
--right reads_2.fq \
--est_method RSEM \
--aln_method bowtie2 \
--prep_reference \
--thread_count 4 \
--output_dir output_dir

```