RNAseq Data Anaylsis Workflow Update

Celeste Valdivia

Project goal

Conduct differential expression analysis from 4 of the 90 RNAseq files produced from Kowarsky et al., 2021.

Run	Bases	dev_stage	TISSUE
SRR10352205	10576501436	b5-b6	whole system
SRR10352206	3609019200	b5-b6	whole system
SRR10352224	4847587800	TO	whole system
SRR10352254	5482368900	TO	whole system

Visualization of blastogenic stages

Methods Overview

Access RNAseq data
Create a refernce index
Align RNAseq data to reference
Make transcript count matrix
Conduct differential expression analysis using DESeq2

Accessing RNAseq data

For the data we will generate in our own project, the SRA Toolkit will not be used. Instead I plan to curl the data directly from Gannet server, a computer available for data hosting in the Roberts Lab.

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \
--progress \
SRR10352205 \
SRR10352206 \
SRR10352224 \
SRR10352254

Transcript Quantification (psuedo-alignment)

Created a reference with ‘index’ command of Kallisto using ~99k mRNA “complete records” from NCBI.

/home/shared/kallisto/kallisto \
index -i \
/home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/bsc_rna.index \
/home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/sequence.fasta

Use the ‘quant’ command on Kallisto to conduct the psuedo-alignment:

/home/shared/kallisto/kallisto quant -i ../data/bsc_rna.index \
-o ../output/kallisto_01/SRR10352205 \
-t 40 \
/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_1.fastq \
/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_2.fastq

Making a transcript count matrix with Trinity

Make transcript count matrix using the perl abundance_estimates_to_matrix Trinity script:

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/SRR10352205/abundance.tsv \
    ../output/kallisto_01/SRR10352206/abundance.tsv \
    ../output/kallisto_01/SRR10352224/abundance.tsv \
    ../output/kallisto_01/SRR10352254/abundance.tsv

Viewing the initial count matrix

Here we will tidy up the count matrix and take a peak at the first 6 rows.

           SRR10352205 SRR10352206 SRR10352224 SRR10352254
JG353837.1           0           0           0           0
JG390417.1         840          26           0           0
JG368003.1           0          28           0           0
JG365003.1           0           0           0           0
JG314272.1           0           0          17           2
JG329595.1           0           0           0           0

Checked how many genes are in the count matrix

[1] "Number of genes to evaluate: 99708"

6. Using DESeq2

Now we will follow the standard DESeq workflow

# Make a new data frame containing info about conditions in study
deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), 
                             type=factor(rep("paired-end", 4)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)
# Run a DESeq analysis
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
hits <- nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
print(paste("Number of differentially expressed genes with p-value less than 0.05:", hits))

[1] "Number of differentially expressed genes with p-value less than 0.05: 1971"

Preliminary results: Exploratory PCA

Preliminary results: Exploratory heatmap

Preliminary results: Exploratory Volcano Plot

Next Steps

Go back and do a proper transcriptome alignment. Found annotated genome on Botryllus schlosseri Genome Project web server.
Play with making a better looking plot for my DEG analysis
Go back and do further quality assessment and control since there is so much variance within one of my groups due to differences in sequencing depth amongst samples of the same group.