RNAseq Data Anaylsis Workflow Update

Celeste Valdivia

Project goal

Conduct differential expression analysis from 4 of the 90 RNAseq files produced from Kowarsky et al., 2021.

Run Bases dev_stage TISSUE
SRR10352205 10576501436 b5-b6 whole system
SRR10352206 3609019200 b5-b6 whole system
SRR10352224 4847587800 TO whole system
SRR10352254 5482368900 TO whole system

Visualization of blastogenic stages

Methods Overview

  1. Access RNAseq data
  2. Create a refernce index
  3. Align RNAseq data to reference
  4. Make transcript count matrix
  5. Conduct differential expression analysis using DESeq2

Accessing RNAseq data

For the data we will generate in our own project, the SRA Toolkit will not be used. Instead I plan to curl the data directly from Gannet server, a computer available for data hosting in the Roberts Lab.

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \
--progress \
SRR10352205 \
SRR10352206 \
SRR10352224 \
SRR10352254

Transcript Quantification (psuedo-alignment)

Created a reference with ‘index’ command of Kallisto using ~99k mRNA “complete records” from NCBI.

/home/shared/kallisto/kallisto \
index -i \
/home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/bsc_rna.index \
/home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/sequence.fasta

Use the ‘quant’ command on Kallisto to conduct the psuedo-alignment:

/home/shared/kallisto/kallisto quant -i ../data/bsc_rna.index \
-o ../output/kallisto_01/SRR10352205 \
-t 40 \
/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_1.fastq \
/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/SRR10352205_2.fastq

Making a transcript count matrix with Trinity

Make transcript count matrix using the perl abundance_estimates_to_matrix Trinity script:

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/SRR10352205/abundance.tsv \
    ../output/kallisto_01/SRR10352206/abundance.tsv \
    ../output/kallisto_01/SRR10352224/abundance.tsv \
    ../output/kallisto_01/SRR10352254/abundance.tsv 

Viewing the initial count matrix

Here we will tidy up the count matrix and take a peak at the first 6 rows.

           SRR10352205 SRR10352206 SRR10352224 SRR10352254
JG353837.1           0           0           0           0
JG390417.1         840          26           0           0
JG368003.1           0          28           0           0
JG365003.1           0           0           0           0
JG314272.1           0           0          17           2
JG329595.1           0           0           0           0

Checked how many genes are in the count matrix

[1] "Number of genes to evaluate: 99708"

6. Using DESeq2

Now we will follow the standard DESeq workflow

# Make a new data frame containing info about conditions in study
deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), 
                             type=factor(rep("paired-end", 4)))
rownames(deseq2.colData) <- colnames(data)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)
# Run a DESeq analysis
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
hits <- nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
print(paste("Number of differentially expressed genes with p-value less than 0.05:", hits))
[1] "Number of differentially expressed genes with p-value less than 0.05: 1971"

Preliminary results: Exploratory PCA

Preliminary results: Exploratory heatmap

Preliminary results: Exploratory Volcano Plot

Next Steps

  1. Go back and do a proper transcriptome alignment. Found annotated genome on Botryllus schlosseri Genome Project web server.
  2. Play with making a better looking plot for my DEG analysis
  3. Go back and do further quality assessment and control since there is so much variance within one of my groups due to differences in sequencing depth amongst samples of the same group.