Differential Gene Expression Analysis with Kallisto & DESeq2

a pretty report for FISH 546

Author

Sarah Tanja

Published

April 14, 2023

Advanced Prep

For this assignment you will be taking RNA-seq data, and running a common differential gene expression analysis, without the use of a reference genome. The end product will be a number of plots and table of differentially expressed genes.

Important

Pre-reqs: This code contains relative paths that only work within the raven Roberts’ Lab Server, where the data for this assignment is housed

Install Packages

Code

if ("knitr" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("knitr") 
if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse')
if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra')
if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap') 
if ("RColorBrewer" %in% rownames(installed.packages()) == 'FALSE') install.packages('RColorBrewer') 
if ("data.table" %in% rownames(installed.packages()) == 'FALSE') install.packages('data.table') 
if ("DT" %in% rownames(installed.packages()) == 'FALSE') install.packages('DT')
if ("Biostrings" %in% rownames(installed.packages()) == 'FALSE') install.packages('Biostrings') 
if ("BiocManager" %in% rownames(installed.packages()) == 'FALSE') install.packages('BiocManager') 
if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2')

Load Libraries

Code

library(knitr)
library(tidyverse)
library(kableExtra)
library(DESeq2)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(DT)
library(Biostrings)

Setup git hooks to automatically gitignore large files

In this assignment we’re dealing with large files that will end up in the ../data and ../output directories. To prevent those large files from clogging up our ability to ‘git push’, we can use built-in hooks to automatically ignore files larger than 100 MB (no matter the directory or file name!). Here are the steps to follow:

Create a new text file in the .git/hooks/ directory of your repository called pre-commit

Select the More tab with the gear icon under the RStudio Files navigator bar and select ‘show hidden files’ to see the .git folder.
Add the following text to the .git/hooks/pre-commit file:

#!/bin/sh

# Maximum file size (in bytes)
max_file_size=104857600

# Find all files larger than max_file_size and add them to the .gitignore file
find . -type f -size +$max_file_size -exec echo "{}" >> .gitignore \;

This code sets the max_file_size variable to 100 MB and then uses the find command to locate all files in the repository that are larger than the specified max_file_size. The exec option of the find command appends the name of each file that matches the criteria to the .gitignore file.

Save the pre-commit file and make it executable by running the following command in Terminal:

Code

cd ../../
chmod +x .git/hooks/pre-commit

With these changes, whenever you run a git commit command, Git will first execute the pre-commit hook, which will automatically add any files larger than 100 MB to the .gitignore file. This will prevent Git from tracking these files in the repository going forward.

In the event that you accidentally committed a big file (>100MB), you can reset to the last successful git master branch push

Warning

⚠️warning this will overwrite any changes you made after your last successful push⚠️

If you still want to continue, you can un-comment the code and follow this instruction:

First, update all origin/<branch> refs to latest:
Code
#git fetch --all
Backup your current branch (e.g. master):
Code
#git branch backup-master
Jump to the latest commit on origin/master :
Code
#git reset --hard origin/master

Running kallisto

kallisto is a software that can be downloaded and unzipped into a programs directory outside of the repo.

Kallisto is a pseudo-aligner, it does not need a genome to align to! It uses the sequences in the dataset to create it’s own ‘reference’

In our case, it is already installed on raven (/home/shared/kallisto/kallisto), and can be checked by running the version command as below:

Code

/home/shared/kallisto/kallisto version

If you need to download and install kallisto:

navigate to a programs directory outside of the repo

grab the applicable program from the website using wget

uncompress the file by navigating to the programs/kallisto directory and running gunzip

check functionality with index command as above

Downloading Reference

This code grabs the Pacific oyster fasta file (rna.fna) of genes and does so ignoring the fact that gannet does not have a security certificate to authenticate (--insecure). This is usually not recommended however we know the server (i.e. we trust we’re not going to get a virus from this server).

Code

# change to work in data directory
cd ../data
# download the rna.fna file to data directory from the gannet server
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna

In the next code chunk we create the index file which insert what the index file is for here. Creating the index file can take some time (it is 1.6GB!) This code is indexing the file rna.fna while also renaming it as cgigas_roslin_rna.index. /home/shared/kallisto/kallisto is the absolute path to the kallisto program from within the raven server, while the lines after the index command indicate where to get the data from (the rna.fna file) and where to write the file to.

Code

/home/shared/kallisto/kallisto \
index -i \
../data/cgigas_roslin_rna.index \
../data/rna.fna

Download sequence reads

The samples that we are looking at in this assignment are Pacific oyster Crassostrea gigas, 12 of which are ‘control’ oysters under normal conditions, and 12 of which are ‘desiccated’, shown in this table:

Sample	SampleID
D-control	D54
D-control	D55
D-control	D56
D-control	D57
D-control	D58
D-control	D59
D-control	M45
D-control	M46
D-control	M48
D-control	M49
D-control	M89
D-control	M90
D-desiccation	N48
D-desiccation	N49
D-desiccation	N50
D-desiccation	N51
D-desiccation	N52
D-desiccation	N53
D-desiccation	N54
D-desiccation	N55
D-desiccation	N56
D-desiccation	N57
D-desiccation	N58
D-desiccation	N59

Sequence reads for these samples are on a public server at https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/ or located at absolute path /home/shared/8TB_HDD_01/sr320/github/nb-2023/Cgigas/data.

This code uses the --recursive feature of wget to get all 24 files. Additionally as with curl above we are ignoring the fact there is not security certificate with --no-check-certificate

Code

# move to data directory
cd ../data 
# download fastq files to data directory 
wget --recursive --no-parent --no-directories \
--no-check-certificate \
--accept '*.fastq.gz' \
https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/

Check to make sure all 24 files were downloaded successfully

Code

# move to data directory
cd ../data
# list all files that end with `.gz` extension, count the files in the list
ls *.gz | wc -l

The next chunk performs the following steps:

creates a subdirectory kallisto_01 in the output folder using mkdir
Uses the find utility to search for all files in the ../data/ directory that match the pattern *fastq.gz.
Uses the basename command to extract the base filename of each file (i.e., the filename without the directory path), and removes the suffix _L001_R1_001.fastq.gz.
Runs the kallisto quant command on each input file, with the following options:
-i ../data/cgigas_roslin_rna.index: Use the kallisto index file located at ../data/cgigas_roslin_rna.index.
-o ../output/kallisto_01/{}: Write the output files to a directory called ../output/kallisto_01/ with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename).
-t 40: Use 40 threads for the computation.
--single -l 100 -s 10: Specify that the input file contains single-end reads (–single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10).
The input file to process is specified using the {} placeholder, which is replaced by the base filename from the previous step.

Code

# uncomment the line of code below 
# mkdir ../output/kallisto_01

find ../data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/cgigas_roslin_rna.index \
-o ../output/kallisto_01/{} \
-t 40 \
--single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz

This next command runs the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files.

The specific options and arguments used in the command are as follows:

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl: Run the abundance_estimates_to_matrix.pl script from Trinity.
--est_method kallisto: Specify that the abundance estimates were generated using kallisto.
--gene_trans_map none: Do not use a gene-to-transcript mapping file.
--out_prefix ../output/kallisto_01: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file.
--name_sample_by_basedir: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.
And then there are the kallisto abundance files to use as input for creating the gene expression matrix.

Code


perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/D54_S145/abundance.tsv \
    ../output/kallisto_01/D56_S136/abundance.tsv \
    ../output/kallisto_01/D58_S144/abundance.tsv \
    ../output/kallisto_01/M45_S140/abundance.tsv \
    ../output/kallisto_01/M48_S137/abundance.tsv \
    ../output/kallisto_01/M89_S138/abundance.tsv \
    ../output/kallisto_01/D55_S146/abundance.tsv \
    ../output/kallisto_01/D57_S143/abundance.tsv \
    ../output/kallisto_01/D59_S142/abundance.tsv \
    ../output/kallisto_01/M46_S141/abundance.tsv \
    ../output/kallisto_01/M49_S139/abundance.tsv \
    ../output/kallisto_01/M90_S147/abundance.tsv \
    ../output/kallisto_01/N48_S194/abundance.tsv \
    ../output/kallisto_01/N50_S187/abundance.tsv \
    ../output/kallisto_01/N52_S184/abundance.tsv \
    ../output/kallisto_01/N54_S193/abundance.tsv \
    ../output/kallisto_01/N56_S192/abundance.tsv \
    ../output/kallisto_01/N58_S195/abundance.tsv \
    ../output/kallisto_01/N49_S185/abundance.tsv \
    ../output/kallisto_01/N51_S186/abundance.tsv \
    ../output/kallisto_01/N53_S188/abundance.tsv \
    ../output/kallisto_01/N55_S190/abundance.tsv \
    ../output/kallisto_01/N57_S191/abundance.tsv \
    ../output/kallisto_01/N59_S189/abundance.tsv

Running DESeq2

This code performs differential expression analysis to identify deferentially expressed genes (DEGs) between a control condition and a desiccated condition in Pacific oysters.

First, it reads in a count matrix of isoform counts generated by kallisto, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.

The results() function is used to extract the results table, which is ordered by gene/transcript ID.

The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.

Read in count matrix

Code

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

Round integers up to whole numbers for analysis

Code

countmatrix <- round(countmatrix, 0)
head(countmatrix)

Here the code creates a data.frame named deseq2.colData containing information about the experimental conditions (control & desiccated). It uses the column data dataframe named deseq2.colData to create a DESeqDataSet object using the DESeqDataSetFromMatrix function from the DESeq2 package.

Code

deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), 
                             type=factor(rep("single-read", 24)))
# set row names to match the column names in the count matrix
rownames(deseq2.colData) <- colnames(data)

# DESeqDataSet object created using the `DESeqDataSetFromMatrix` function
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

converting counts to integer mode

The DESeqDataSet object, named deseq.dds, is then passed to the DESeq() function to fit a negative binomial model and estimate dispersions.

Code

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]

Code

str(deseq2.res)

Count number of hits with adjusted p-value less then 0.05

Code

dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

[1] 607   6

Plotting DESeq2 matrix counts

Principle Component Analysis plot

What is PCA? Very simply, PCA dimensionally reduces the number of variables of a data set while preserving as much information as possible.

Tip

Checkout this YouTube video that visually explains PCA!

Code

vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "condition")

Heatmap plot

Code

# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]
# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]
# Log-transform counts
log_counts_top <- log2(counts_top + 1)
# Generate heatmap
pheatmap(log_counts_top, scale = "row")

Understanding Log2 Fold Change

Tip

First Checkout this great article Comparing experimental conditions: differential expression analysis

In the article linked above, Fold Change is explained: > For a given comparison, a positive fold change value indicates an increase of expression, while a negative fold change indicates a decrease in expression. This value is typically reported in logarithmic scale (base 2). For example, log2 fold change of 1.5 for a specific gene in the “WT vs KO comparison” means that the expression of that gene is increased in WT relative to KO by a multiplicative factor of 2^1.5 ≈ 2.82.

Code

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Dessication  (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")

Warning in xy.coords(x, y, xlabel, ylabel, log): 16484 x values <= 0 omitted
from logarithmic plot

Code

# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

Volcano plot

Code

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)
# Create volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(title = "Volcano Plot",
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Significantly\nDifferentially Expressed") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "top")
print(volcano_plot)

Warning: Removed 41435 rows containing missing values (`geom_point()`).

Table of DEGs

Save the list of Differentially Expressed Genes (DEGs)! Write the output to a table

Code

write.table(tmp.sig, "../output/DEGlist.tab", sep = '\t', row.names = T)

Let’s take a look at this list…

Code

deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE)
deglist$RowName <- rownames(deglist)
deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns
head(deglist2)

Now that we have the list and a few visualizations, we can move to the part where we identify what these genes actually do…

Continue this analysis using Gene Ontology (GO) term annotations and NCBI BLAST

--- title: "Differential Gene Expression Analysis with Kallisto & DESeq2" subtitle: "a pretty report for [FISH 546](https://sr320.github.io/course-fish546-2023/assignments/03-knit.html)" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: arrow code-overflow: wrap theme: light: sandstone dark: vapor gfm: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set( highlight = TRUE, # Highlight code echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = TRUE, # Hide warnings message = TRUE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Advanced Prep For this assignment you will be taking RNA-seq data, and running a common differential gene expression analysis, without the use of a reference genome. The end product will be a number of plots and table of differentially expressed genes. ::: callout-important **Pre-reqs:** This code contains relative paths that only work within the *raven* Roberts' Lab Server, where the data for this assignment is housed ::: ------------------------------------------------------------------------ Install Packages ```{r install-packages, eval=TRUE, cache=TRUE} if ("knitr" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install("knitr") if ("tidyverse" %in% rownames(installed.packages()) == 'FALSE') install.packages('tidyverse') if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra') if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap') if ("RColorBrewer" %in% rownames(installed.packages()) == 'FALSE') install.packages('RColorBrewer') if ("data.table" %in% rownames(installed.packages()) == 'FALSE') install.packages('data.table') if ("DT" %in% rownames(installed.packages()) == 'FALSE') install.packages('DT') if ("Biostrings" %in% rownames(installed.packages()) == 'FALSE') install.packages('Biostrings') if ("BiocManager" %in% rownames(installed.packages()) == 'FALSE') install.packages('BiocManager') if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2') ``` Load Libraries ```{r load-packages, eval=TRUE, cache=TRUE, output=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(Biostrings) ``` ------------------------------------------------------------------------ ## Setup git hooks to automatically gitignore large files In this assignment we're dealing with large files that will end up in the ../data and ../output directories. To prevent those large files from clogging up our ability to 'git push', we can use built-in hooks to automatically ignore files larger than 100 MB (no matter the directory or file name!). Here are the steps to follow: - Create a new text file in the `.git/hooks/` directory of your repository called `pre-commit` > *Select the `More` tab with the gear icon under the RStudio Files navigator bar and select 'show hidden files' to see the .git folder*. - Add the following text to the `.git/hooks/pre-commit` file: ``` #!/bin/sh # Maximum file size (in bytes) max_file_size=104857600 # Find all files larger than max_file_size and add them to the .gitignore file find . -type f -size +$max_file_size -exec echo "{}" >> .gitignore \; ``` This code sets the max_file_size variable to 100 MB and then uses the find command to locate all files in the repository that are larger than the specified max_file_size. The exec option of the find command appends the name of each file that matches the criteria to the .gitignore file. Save the pre-commit file and make it executable by running the following command in Terminal: ```{r, engine='bash'} cd ../../ chmod +x .git/hooks/pre-commit ``` With these changes, whenever you run a git commit command, Git will first execute the pre-commit hook, which will automatically add any files larger than 100 MB to the .gitignore file. This will prevent Git from tracking these files in the repository going forward. > In the event that you accidentally committed a big file (\>100MB), you can reset to the last successful git master branch push ::: callout-warning ⚠️**warning this will overwrite any changes you made after your last successful push**⚠️ ::: > If you still want to continue, you can un-comment the code and follow this instruction: > > First, update all `origin/<branch>` refs to latest: > > ```{r, engine='bash'} > #git fetch --all > ``` > > Backup your current branch (e.g. `master`): > > ```{r, engine='bash'} > #git branch backup-master > ``` > > Jump to the latest commit on `origin/master` : > > ```{r, engine='bash'} > #git reset --hard origin/master > ``` ------------------------------------------------------------------------ # Running kallisto [kallisto](https://pachterlab.github.io/kallisto/) is a software that can be downloaded and unzipped into a `programs` directory outside of the repo. > Kallisto is a pseudo-aligner, it does not need a genome to align to! It uses the sequences in the dataset to create it's own 'reference' In our case, it is already installed on *raven* (`/home/shared/kallisto/kallisto`), and can be checked by running the `version` command as below: ```{r, engine='bash'} /home/shared/kallisto/kallisto version ``` > If you need to download and install kallisto: > > 1. navigate to a `programs` directory outside of the repo > 2. grab the applicable program from the [website](https://pachterlab.github.io/kallisto/) using `wget` > 3. uncompress the file by navigating to the `programs/kallisto` directory and running `gunzip` > 4. check functionality with `index` command as above ------------------------------------------------------------------------ # Downloading Reference This code grabs the Pacific oyster fasta file (rna.fna) of genes and does so ignoring the fact that *gannet* does not have a security certificate to authenticate (`--insecure`). This is usually not recommended however we know the server (i.e. we trust we're not going to get a virus from this server). ```{r, engine='bash'} # change to work in data directory cd ../data # download the rna.fna file to data directory from the gannet server curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna ``` In the next code chunk we create the index file which **insert what the index file is for here**. Creating the index file can take some time (it is 1.6GB!) This code is indexing the file `rna.fna` while also renaming it as `cgigas_roslin_rna.index`. `/home/shared/kallisto/kallisto` is the absolute path to the kallisto program from within the *raven* server, while the lines after the `index` command indicate where to get the data from (the `rna.fna` file) and where to write the file to. ```{r, engine='bash'} /home/shared/kallisto/kallisto \ index -i \ ../data/cgigas_roslin_rna.index \ ../data/rna.fna ``` ------------------------------------------------------------------------ # Download sequence reads The samples that we are looking at in this assignment are Pacific oyster *Crassostrea gigas*, 12 of which are 'control' oysters under normal conditions, and 12 of which are 'desiccated', shown in this table: | | | |---------------|----------| | Sample | SampleID | | D-control | D54 | | D-control | D55 | | D-control | D56 | | D-control | D57 | | D-control | D58 | | D-control | D59 | | D-control | M45 | | D-control | M46 | | D-control | M48 | | D-control | M49 | | D-control | M89 | | D-control | M90 | | D-desiccation | N48 | | D-desiccation | N49 | | D-desiccation | N50 | | D-desiccation | N51 | | D-desiccation | N52 | | D-desiccation | N53 | | D-desiccation | N54 | | D-desiccation | N55 | | D-desiccation | N56 | | D-desiccation | N57 | | D-desiccation | N58 | | D-desiccation | N59 | Sequence reads for these samples are on a public server at <https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/> or located at absolute path `/home/shared/8TB_HDD_01/sr320/github/nb-2023/Cgigas/data`. This code uses the `--recursive` feature of `wget` to get all 24 files. Additionally as with `curl` above we are ignoring the fact there is not security certificate with `--no-check-certificate` ```{r, engine='bash'} # move to data directory cd ../data # download fastq files to data directory wget --recursive --no-parent --no-directories \ --no-check-certificate \ --accept '*.fastq.gz' \ https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/ ``` Check to make sure all 24 files were downloaded successfully ```{r, engine='bash', eval=TRUE} # move to data directory cd ../data # list all files that end with `.gz` extension, count the files in the list ls *.gz | wc -l ``` The next chunk performs the following steps: - creates a subdirectory `kallisto_01` in the `output` folder using `mkdir` - Uses the `find` utility to search for all files in the `../data/` directory that match the pattern `*fastq.gz`. - Uses the `basename` command to extract the base filename of each file (i.e., the filename without the directory path), and removes the suffix `_L001_R1_001.fastq.gz`. - Runs the kallisto `quant` command on each input file, with the following options: - `-i ../data/cgigas_roslin_rna.index`: Use the kallisto index file located at `../data/cgigas_roslin_rna.index`. - `-o ../output/kallisto_01/{}`: Write the output files to a directory called `../output/kallisto_01/` with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename). - `-t 40`: Use 40 threads for the computation. - `--single -l 100 -s 10`: Specify that the input file contains single-end reads (--single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10). - The input file to process is specified using the {} placeholder, which is replaced by the base filename from the previous step. ```{r, engine='bash'} # uncomment the line of code below # mkdir ../output/kallisto_01 find ../data/*fastq.gz \ | xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \ quant -i ../data/cgigas_roslin_rna.index \ -o ../output/kallisto_01/{} \ -t 40 \ --single -l 100 -s 10 ../data/{}_L001_R1_001.fastq.gz ``` This next command runs the `abundance_estimates_to_matrix.pl` script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files. The specific options and arguments used in the command are as follows: - `perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl`: Run the abundance_estimates_to_matrix.pl script from Trinity. - `--est_method kallisto`: Specify that the abundance estimates were generated using kallisto. - `--gene_trans_map none`: Do not use a gene-to-transcript mapping file. - `--out_prefix ../output/kallisto_01`: Use ../output/kallisto_01 as the output directory and prefix for the gene expression matrix file. - `--name_sample_by_basedir`: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix. - And then there are the kallisto abundance files to use as input for creating the gene expression matrix. ```{r, engine='bash'} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix ../output/kallisto_01 \ --name_sample_by_basedir \ ../output/kallisto_01/D54_S145/abundance.tsv \ ../output/kallisto_01/D56_S136/abundance.tsv \ ../output/kallisto_01/D58_S144/abundance.tsv \ ../output/kallisto_01/M45_S140/abundance.tsv \ ../output/kallisto_01/M48_S137/abundance.tsv \ ../output/kallisto_01/M89_S138/abundance.tsv \ ../output/kallisto_01/D55_S146/abundance.tsv \ ../output/kallisto_01/D57_S143/abundance.tsv \ ../output/kallisto_01/D59_S142/abundance.tsv \ ../output/kallisto_01/M46_S141/abundance.tsv \ ../output/kallisto_01/M49_S139/abundance.tsv \ ../output/kallisto_01/M90_S147/abundance.tsv \ ../output/kallisto_01/N48_S194/abundance.tsv \ ../output/kallisto_01/N50_S187/abundance.tsv \ ../output/kallisto_01/N52_S184/abundance.tsv \ ../output/kallisto_01/N54_S193/abundance.tsv \ ../output/kallisto_01/N56_S192/abundance.tsv \ ../output/kallisto_01/N58_S195/abundance.tsv \ ../output/kallisto_01/N49_S185/abundance.tsv \ ../output/kallisto_01/N51_S186/abundance.tsv \ ../output/kallisto_01/N53_S188/abundance.tsv \ ../output/kallisto_01/N55_S190/abundance.tsv \ ../output/kallisto_01/N57_S191/abundance.tsv \ ../output/kallisto_01/N59_S189/abundance.tsv ``` ------------------------------------------------------------------------ # Running DESeq2 This code performs differential expression analysis to identify deferentially expressed genes (DEGs) between a control condition and a desiccated condition in Pacific oysters. First, it reads in a count matrix of isoform counts generated by `kallisto`, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers. The `results()` function is used to extract the results table, which is ordered by gene/transcript ID. The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called "DEGlist.tab". Read in count matrix ```{r, eval=TRUE} countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` Round integers up to whole numbers for analysis ```{r, eval=TRUE} countmatrix <- round(countmatrix, 0) head(countmatrix) ``` Here the code creates a `data.frame` named `deseq2.colData` containing information about the experimental conditions (control & desiccated). It uses the column data dataframe named `deseq2.colData` to create a `DESeqDataSet` object using the `DESeqDataSetFromMatrix` function from the DESeq2 package. ```{r, eval=TRUE, cache=TRUE} deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))), type=factor(rep("single-read", 24))) # set row names to match the column names in the count matrix rownames(deseq2.colData) <- colnames(data) # DESeqDataSet object created using the `DESeqDataSetFromMatrix` function deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition) ``` The `DESeqDataSet` object, named `deseq.dds`, is then passed to the `DESeq()` function to fit a negative binomial model and estimate dispersions. ```{r, eval=TRUE, cache=TRUE, output=FALSE} deseq2.dds <- DESeq(deseq2.dds) deseq2.res <- results(deseq2.dds) deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] ``` ```{r, eval=TRUE, output=FALSE} str(deseq2.res) ``` Count number of hits with adjusted p-value less then 0.05 ```{r, eval=TRUE} dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) ``` ### Plotting DESeq2 matrix counts #### Principle Component Analysis plot What is PCA? Very simply, PCA dimensionally reduces the number of variables of a data set while preserving as much information as possible. ::: callout-tip Checkout [this YouTube video](https://youtu.be/FD4DeN81ODY) that visually explains PCA! ::: ```{r PCA, eval=TRUE} vsd <- vst(deseq2.dds, blind = FALSE) plotPCA(vsd, intgroup = "condition") ``` #### Heatmap plot ```{r heatmap, eval=TRUE, cache=TRUE} # Select top 50 differentially expressed genes res <- results(deseq2.dds) res_ordered <- res[order(res$padj), ] top_genes <- row.names(res_ordered)[1:50] # Extract counts and normalize counts <- counts(deseq2.dds, normalized = TRUE) counts_top <- counts[top_genes, ] # Log-transform counts log_counts_top <- log2(counts_top + 1) # Generate heatmap pheatmap(log_counts_top, scale = "row") ``` #### Understanding Log2 Fold Change ::: callout-tip First Checkout this great article [Comparing experimental conditions: differential expression analysis](https://biocorecrg.github.io/CRG_Bioinformatics_for_Biologists/differential_gene_expression.html#:~:text=Fold%20change%3A&text=This%20value%20is%20typically%20reported,of%202%5E1.5%20%E2%89%88%202.82.) ::: In the article linked above, *Fold Change* is explained: > For a given comparison, a positive fold change value indicates an increase of expression, while a negative fold change indicates a decrease in expression. This value is typically reported in logarithmic scale (base 2). For example, log2 fold change of 1.5 for a specific gene in the "WT vs KO comparison" means that the expression of that gene is increased in WT relative to KO by a multiplicative factor of 2\^1.5 ≈ 2.82. ```{r results='markup', eval=TRUE, cache=TRUE} tmp <- deseq2.res # The main plot plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray", main="DEG Dessication (pval <= 0.05)", xlab="mean of normalized counts", ylab="Log2 Fold Change") # Getting the significant points and plotting them again so they're a different color tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ] points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red") # 2 FC lines abline(h=c(-1,1), col="blue") ``` #### Volcano plot ```{r newplot, eval=TRUE, cache=TRUE} # Prepare the data for plotting res_df <- as.data.frame(deseq2.res) res_df$gene <- row.names(res_df) # Create volcano plot volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_manual(values = c("grey", "red")) + labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 Adjusted P-value", color = "Significantly\nDifferentially Expressed") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "top") print(volcano_plot) ``` ## Table of DEGs Save the list of Differentially Expressed Genes (DEGs)! Write the output to a table ```{r, eval=TRUE} write.table(tmp.sig, "../output/DEGlist.tab", sep = '\t', row.names = T) ``` Let's take a look at this list... ```{r, eval=TRUE, } deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE) deglist$RowName <- rownames(deglist) deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns head(deglist2) ``` Now that we have the list and a few visualizations, we can move to the part where we identify what these genes actually do... [Continue this analysis using Gene Ontology (GO) term annotations and NCBI BLAST](https://sr320.github.io/course-fish546-2023/assignments/01-blast.html)