For this exercise, we will be taking RNA-seq reads off the sequencer in the form of FASTQ files and determining what genes are expressed higher in treatment group A compared to treatment group B. For this we will use Kallisto, which allows us to quantify transcription levels. The goal is to generate a volcano plot, a table of differentially expressed genes, and other helpful visualizations to assess
/home/shared/kallisto/kallisto
This code confirms installation of the Kallisto software.
mkdir data #creates a data folder
cd data
curl --insecure -O https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/rna.fna #downloads the reference genome
This code is downloading the file rna.fna into the data directory.
/home/shared/kallisto/kallisto \
index -i \
data/cgigas_roslin_rna.index \
data/rna.fna
This added the cgigas_roslin_rna.index file to the data directory.
cd data
wget --recursive --no-parent --no-directories \ #another command to get a file from a URL
--no-check-certificate \
--accept '*.fastq.gz' \
https://gannet.fish.washington.edu/seashell/bu-github/nb-2023/Cgigas/data/nopp/
This code downloads several fastq files into the data folder.
mkdir output/kallisto_01
find data/*fastq.gz \
| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i data/cgigas_roslin_rna.index \
-o output/kallisto_01/{} \
-t 24 \
--single -l 100 -s 10 data/{}_L001_R1_001.fastq.gz
This creates an output directory and populates it with sequence reads.
Now we quantify the transcript expression levels using Kallisto. From the lecture notes: “Kallisto uses a pseudoalignment approach, which is much faster than traditional alignment-based methods and does not require a reference genome.”
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix output/kallisto_01 \
--name_sample_by_basedir \
output/kallisto_01/D54_S145/abundance.tsv \
output/kallisto_01/D56_S136/abundance.tsv \
output/kallisto_01/D58_S144/abundance.tsv \
output/kallisto_01/M45_S140/abundance.tsv \
output/kallisto_01/M48_S137/abundance.tsv \
output/kallisto_01/M89_S138/abundance.tsv \
output/kallisto_01/D55_S146/abundance.tsv \
output/kallisto_01/D57_S143/abundance.tsv \
output/kallisto_01/D59_S142/abundance.tsv \
output/kallisto_01/M46_S141/abundance.tsv \
output/kallisto_01/M49_S139/abundance.tsv \
output/kallisto_01/M90_S147/abundance.tsv \
output/kallisto_01/N48_S194/abundance.tsv \
output/kallisto_01/N50_S187/abundance.tsv \
output/kallisto_01/N52_S184/abundance.tsv \
output/kallisto_01/N54_S193/abundance.tsv \
output/kallisto_01/N56_S192/abundance.tsv \
output/kallisto_01/N58_S195/abundance.tsv \
output/kallisto_01/N49_S185/abundance.tsv \
output/kallisto_01/N51_S186/abundance.tsv \
output/kallisto_01/N53_S188/abundance.tsv \
output/kallisto_01/N55_S190/abundance.tsv \
output/kallisto_01/N57_S191/abundance.tsv \
output/kallisto_01/N59_S189/abundance.tsv
This code runs the abundance_estimates_to_matrix.pl script from Trinity RNA-seq to create a gene expression matrix from Kallisto output files.
Now we will perform differential expression analysis to identify differentially expressed genes (DEGs) between a control and a desiccated condition.
countmatrix <- read.delim("output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
## D54_S145 D56_S136 D58_S144 M45_S140 M48_S137 M89_S138 D55_S146
## XM_034466747.1 0 4.46531 0 2 0.00000 2.6438 0
## XM_034468017.1 0 0.00000 0 0 0.00000 0.0000 0
## XM_011447437.3 0 0.00000 0 0 0.00000 0.0000 1
## XM_011451485.3 0 0.00000 0 0 0.00000 0.0000 0
## XM_034453812.1 0 0.00000 0 0 3.21794 0.0000 0
## XM_034446851.1 0 3.00000 12 0 0.00000 0.0000 0
## D57_S143 D59_S142 M46_S141 M49_S139 M90_S147 N48_S194 N50_S187
## XM_034466747.1 1.59852 0 0 0 0 0 0
## XM_034468017.1 0.00000 0 0 0 0 0 0
## XM_011447437.3 0.00000 0 0 0 0 0 0
## XM_011451485.3 0.00000 0 0 0 0 0 0
## XM_034453812.1 0.00000 0 0 0 0 0 0
## XM_034446851.1 0.00000 0 0 0 0 3 0
## N52_S184 N54_S193 N56_S192 N58_S195 N49_S185 N51_S186 N53_S188
## XM_034466747.1 0 0.000000 0 0 0 0 0
## XM_034468017.1 0 0.000000 0 0 0 0 0
## XM_011447437.3 0 0.000000 0 0 0 0 0
## XM_011451485.3 0 0.000000 0 0 0 0 0
## XM_034453812.1 0 0.386561 2 0 0 0 0
## XM_034446851.1 0 0.000000 0 0 0 0 1
## N55_S190 N57_S191 N59_S189
## XM_034466747.1 0 0 0
## XM_034468017.1 0 0 0
## XM_011447437.3 0 0 1
## XM_011451485.3 0 0 0
## XM_034453812.1 0 0 0
## XM_034446851.1 0 3 0
We created a table showing differentially expressed genes.
countmatrix <- round(countmatrix, 0)
str(countmatrix)
## 'data.frame': 73307 obs. of 24 variables:
## $ D54_S145: num 0 0 0 0 0 0 0 18 5 0 ...
## $ D56_S136: num 4 0 0 0 0 3 1 18 0 0 ...
## $ D58_S144: num 0 0 0 0 0 12 12 0 4 0 ...
## $ M45_S140: num 2 0 0 0 0 0 1 0 12 0 ...
## $ M48_S137: num 0 0 0 0 3 0 0 0 8 0 ...
## $ M89_S138: num 3 0 0 0 0 0 0 0 1 0 ...
## $ D55_S146: num 0 0 1 0 0 0 1 17 0 0 ...
## $ D57_S143: num 2 0 0 0 0 0 0 0 5 0 ...
## $ D59_S142: num 0 0 0 0 0 0 4 44 0 0 ...
## $ M46_S141: num 0 0 0 0 0 0 2 32 0 0 ...
## $ M49_S139: num 0 0 0 0 0 0 3 11 0 0 ...
## $ M90_S147: num 0 0 0 0 0 0 0 0 10 0 ...
## $ N48_S194: num 0 0 0 0 0 3 1 0 6 0 ...
## $ N50_S187: num 0 0 0 0 0 0 1 0 5 0 ...
## $ N52_S184: num 0 0 0 0 0 0 1 0 0 0 ...
## $ N54_S193: num 0 0 0 0 0 0 0 0 1 0 ...
## $ N56_S192: num 0 0 0 0 2 0 0 0 0 0 ...
## $ N58_S195: num 0 0 0 0 0 0 0 0 16 0 ...
## $ N49_S185: num 0 0 0 0 0 0 6 0 14 0 ...
## $ N51_S186: num 0 0 0 0 0 0 2 16 20 0 ...
## $ N53_S188: num 0 0 0 0 0 1 0 0 6 0 ...
## $ N55_S190: num 0 0 0 0 0 0 2 0 0 0 ...
## $ N57_S191: num 0 0 0 0 0 3 0 0 4 0 ...
## $ N59_S189: num 0 0 1 0 0 0 2 0 4 0 ...
deseq2.colData <- data.frame(condition=factor(c(rep("control", 12), rep("desicated", 12))),
type=factor(rep("single-read", 24)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
colData = deseq2.colData,
design = ~ condition)
Below we convert counts to integer mode.
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
head(deseq2.res)
## log2 fold change (MLE): condition desicated vs control
## Wald test p-value: condition desicated vs control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## NM_001305288.1 0.181270 1.0453698 3.002647 0.348149 7.27728e-01
## NM_001305289.1 0.881457 -2.8119577 1.068276 -2.632239 8.48240e-03
## NM_001305290.1 145.913728 0.4580323 0.116185 3.942251 8.07203e-05
## NM_001305291.1 0.261701 0.5618449 1.587076 0.354013 7.23329e-01
## NM_001305292.1 2.902430 -1.2181330 0.763421 -1.595624 1.10573e-01
## NM_001305293.1 234.342117 0.0663449 0.131969 0.502731 6.15154e-01
## padj
## <numeric>
## NM_001305288.1 NA
## NM_001305289.1 NA
## NM_001305290.1 0.00956401
## NM_001305291.1 NA
## NM_001305292.1 0.59541971
## NM_001305293.1 0.95562321
This actually previews our analysis. log2 fold change (MLE): condition dessicated vs control Wald test p-value: condition desicated vs control DataFrame with 6 rows and 6 columns.
# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
We have 607 significantly differentially expressed genes, based on padj <= 0.05. There are 6 columns of data for each.
tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
main="DEG Dessication (pval <= 0.05)",
xlab="mean of normalized counts",
ylab="Log2 Fold Change")
#Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")
Now we save the differentially expressed genes (DEGs) to a file.
write.table(tmp.sig, "output/DEGlist.tab", row.names = T)
We can also take the data and see if samples cluster by condition (control vs desiccated).
vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "condition") +
ggtitle("PCA of RNA-Seq Samples (vsd transformed)")
There is some grouping/striation between the 2 conditions. So
dessication may explain the variation.
top_genes <- head(order(deseq2.res$padj), 50) # pulls the top DEGs
mat <- assay(vsd)[top_genes, ] # extract expression values for those genes
mat <- mat - rowMeans(mat) #centers the rows
ann_col <- as.data.frame(colData(vsd)[, "condition", drop = FALSE]) #annotation dataframe
rownames(ann_col) <- colnames(mat) # ensures rownames match matrix colnames
pheatmap(mat,
cluster_rows = TRUE,
cluster_cols = TRUE,
annotation_col = ann_col,
show_rownames = FALSE,
fontsize_col = 8,
main = "Top 50 Differentially Expressed Genes")