Conduct differential expression analysis from 4 of the 90 RNAseq files produced from Kowarsky et al., 2021.
Run | Bases | dev_stage | TISSUE |
---|---|---|---|
SRR10352205 | 10576501436 | b5-b6 | whole system |
SRR10352206 | 3609019200 | b5-b6 | whole system |
SRR10352224 | 4847587800 | TO | whole system |
SRR10352254 | 5482368900 | TO | whole system |
For the data we will generate in our own project, the SRA Toolkit will not be used. Instead I plan to curl the data directly from Gannet server, a computer available for data hosting in the Roberts Lab.
Created a reference with ‘index’ command of Kallisto using ~99k mRNA “complete records” from NCBI.
Use the ‘quant’ command on Kallisto to conduct the psuedo-alignment:
Make transcript count matrix using the perl abundance_estimates_to_matrix Trinity script:
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ../output/kallisto_01 \
--name_sample_by_basedir \
../output/kallisto_01/SRR10352205/abundance.tsv \
../output/kallisto_01/SRR10352206/abundance.tsv \
../output/kallisto_01/SRR10352224/abundance.tsv \
../output/kallisto_01/SRR10352254/abundance.tsv
Here we will tidy up the count matrix and take a peak at the first 6 rows.
SRR10352205 SRR10352206 SRR10352224 SRR10352254
JG353837.1 0 0 0 0
JG390417.1 840 26 0 0
JG368003.1 0 28 0 0
JG365003.1 0 0 0 0
JG314272.1 0 0 17 2
JG329595.1 0 0 0 0
Checked how many genes are in the count matrix
[1] "Number of genes to evaluate: 99708"
Now we will follow the standard DESeq workflow
# Make a new data frame containing info about conditions in study
deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))),
type=factor(rep("paired-end", 4)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
colData = deseq2.colData,
design = ~ condition)
# Run a DESeq analysis
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
hits <- nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
print(paste("Number of differentially expressed genes with p-value less than 0.05:", hits))
[1] "Number of differentially expressed genes with p-value less than 0.05: 1971"