---
title: "Differential Gene Expression Analysis with Kallisto & DESeq2"
subtitle: "Selecting expression from a pile of stuff"
author: "Chris Mantegna"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
  theme: cerulean
  highlight: 
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
    callout-appearance: default
---

```{r, echo=FALSE}
#install.packages("quarto")
```
#Needed
# this is the css directive to add a banner behind my title, if it doesn't render that is ok we can remove it - it is not necessary, try to replace the path with the png before completely ditching it

```{css, echo=FALSE}

.title-banner {
  background-image: url('https://github.com/course-fish546-2023/chris-musselcon/blob/main/photos/dwpAndy.jpg?raw=true');
  background-size: cover;
  background-position: left;
  text-align: center;
  padding: 50px;
  color: white;
}

```
# Needed
# this chunk is the global default options chunk. unless otherwise specified each of the chunks will run according to these parameters

```{r setup, include=FALSE}

knitr::opts_chunk$set(
  highlight = TRUE,    # Highlight code
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = TRUE,      # Hide warnings
  message = TRUE,      # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
  )

```
# Needed
# this sets the document parameters for rendering the document without over- or under- sized things
```{css, echo=FALSE}
pre {
  max-height: 300px;
  overflow-y: auto;
}
pre[class] {
  max-height: 100px;
}
```
# Needed
# this controls the navigation ability of the page 
```{css, echo=FALSE}

.scroll-100 {
  max-height: 100px;
  overflow-y: auto;
  background-color: inherit;
}

```
# Needed
# this will be removed because you cannot set a callout chunk like this no matter what Chat GPT tells you to fix your callout syntax 
```{css, echo=FALSE}

file.copy(from = "https://github.com/course-fish546-2023/chris-coursework1/blob/main/photos/miranda.png?raw=true", to = "photos/miranda.png", overwrite = TRUE)

```
#Start of the doc behind the header, everything else to this point is setup
# Purpose

Using RNASeq data to compare gene expression is a standard workflow to help scientists clarify which genes are differently expressed than their 'normal' state when exposed to perturbations. We're starting from the place where you've already QC'd your RNASeq files and are ready to jump into quantification and statistical analysis. If you're interested in learning more about the the QC process and application, check out [The Galaxy Project](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html). If you already know but want to check out your options, look into [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or [MultiQC](https://multiqc.info/).

#Remove
```{r}
library(quarto)

quarto_callout(
  type = "note",
  icon = "photos/miranda.png",
  content = "The truth is, there is no one that can do what I do. If you don't pay attention to your file structure and comment your code- you're right. Take the time now to intentionally set up your future self and colleagues for success."
)
```
#Needed
# Set yourself up for success

## Know where you are & where you're placing your files

```{r}

# find out where you are before you start

getwd()

```

```{bash}

# navigate to your working directory and move to your data folder to begin

cd ../data

```

# Grab what you need to start

## This file is to create an index. The index will be used to quantify the transcripts abundances. The index file maps the RNASeq reads to the transcriptome (or genome) to quantify expression level when we use Kallisto.

### We will be assessing the genome of the Pacific oyster, *Crassostra gigas*.

```{bash}

cd ../data
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna

```

# Packages & Libraries

## Load the packages you'll need by 'calling' them from your package library

```{r, include=TRUE, message=FALSE, warning=FALSE}

BiocManager::install("DESeq2")
library(knitr)
library(tidyverse)
library(kableExtra)
library(DESeq2)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(DT)
library(Biostrings)

```
#Needed
#Fix this callout to match the correct Quarto syntax

```{r callout, type='note'}

knitr::include_graphics("https://github.com/course-fish546-2023/chris-musselcon/blob/main/photos/dwpMiranda.jpg?raw=true")

"I had hope. Anyway, you ended up disappointing me more than any of the other silly packages. If your library call doesn't yield the package you requested use the install.packages() code."

```

## We're going to access Kallisto, ask it to create our index, tell it where to create that index and what to name it.

```{bash}

/home/shared/kallisto/kallisto \
index -i \
../data/cgigas_roslin_rna.index \
../data/rna.fna

```

## We'll navigate to our data folder, ask it to grab our comparison files (only the necessary fastq.gz files, nothing more) from the URL listed below, and add them 'as is' to our data directory.

```{bash}

cd ../data 
wget --recursive --no-parent --no-directories \
--no-check-certificate \
--accept '*.fastq.gz' \
https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/

```

## You will have to use the 'mkdir' bash command to create the output folder in Kalliso. The code is commented out so that it diens't run, but it is a necessary step if this is your first time doing it.

```{bash}

#mkdir ../output/kallisto_01

```

## Here we are quantifying our RNASeq results with Kallisto. We do this by first searching for our fastq.gz files, organizes what we find, cleans up the names of what we find, and then using the paramters set at the bottom of the code chunk we are creating individual outputs for each found file.

```{bash}

find ../data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/cgigas_roslin_rna.index \
-o ../output/kallisto_01/{} \
-t 40 \
--single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz

```

# fix the callout syntax
```{r callout, type='note'}

knitr::include_graphics("https://github.com/course-fish546-2023/chris-musselcon/blob/main/photos/dwpMiranda.jpg?raw=true")

"Don't be ridiculous, everybody wants this. Everybody wants to be us. Every file wants to be in a repo, but not every file can be. At this point if you haven't saved and committed your work to your repo, you need to. Keep in mind that GitHub will not allow you to commit files greater than 100MB. If any of that sounded new, check out [Happy Git] (https://happygitwithr.com/index.html) to get yourself in order!"

```

## This Bash code runs the 'abundance_estimates' script from the Trinity RNASeq package and creates a gene expression matrix from the output of Kallisto quantification results.

```{bash}

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/D54_S145/abundance.tsv \
    ../output/kallisto_01/D56_S136/abundance.tsv \
    ../output/kallisto_01/D58_S144/abundance.tsv \
    ../output/kallisto_01/M45_S140/abundance.tsv \
    ../output/kallisto_01/M48_S137/abundance.tsv \
    ../output/kallisto_01/M89_S138/abundance.tsv \
    ../output/kallisto_01/D55_S146/abundance.tsv \
    ../output/kallisto_01/D57_S143/abundance.tsv \
    ../output/kallisto_01/D59_S142/abundance.tsv \
    ../output/kallisto_01/M46_S141/abundance.tsv \
    ../output/kallisto_01/M49_S139/abundance.tsv \
    ../output/kallisto_01/M90_S147/abundance.tsv \
    ../output/kallisto_01/N48_S194/abundance.tsv \
    ../output/kallisto_01/N50_S187/abundance.tsv \
    ../output/kallisto_01/N52_S184/abundance.tsv \
    ../output/kallisto_01/N54_S193/abundance.tsv \
    ../output/kallisto_01/N56_S192/abundance.tsv \
    ../output/kallisto_01/N58_S195/abundance.tsv \
    ../output/kallisto_01/N49_S185/abundance.tsv \
    ../output/kallisto_01/N51_S186/abundance.tsv \
    ../output/kallisto_01/N53_S188/abundance.tsv \
    ../output/kallisto_01/N55_S190/abundance.tsv \
    ../output/kallisto_01/N57_S191/abundance.tsv \
    ../output/kallisto_01/N59_S189/abundance.tsv
    
```

## We are creating a count matrix dataframe, a.k.a. a table of how many times a feature (i.e., gene) is observed in our sequence. Each row is a feature and each column is a sample. After we create the table we are going to take a look at it to make sure it looks like we want it. We're doing this in standard R programming commands.
```{r}

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

```
## Since all of the values in our table are very large and burdensome to read, we are going to round the values to make it easier to get at what we really want. Once we do that, we will use str() to take a fast look at the results.
```{r}

countmatrix <- round(countmatrix, 0)
str(countmatrix)

```
## We have what we want and we are moving into using the DESeq package. First, we're going to add the metadata to the count matrix we made a few steps back. Second, we're going to put them together and create a new dataset that can be processed in the DESeq program.
```{r}

deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), 
                             type=factor(rep("single-read", 24)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

```
## We are using DESeq to find these outcomes: 
- estimating size factors
- estimating dispersions
- gene-wise dispersion estimates
- mean-dispersion relationship
- final dispersion estimates
- fitting model and testing
-- replacing outliers and refitting for 5677 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
- estimating dispersions
- fitting model and testing
```{r}

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]

```

```{r}

head(deseq2.res)

```

```{r}

# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

```

```{r}

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Dessication  (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")

# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")

# 2 FC lines
abline(h=c(-1,1), col="blue")

```

```{r}

write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)

```