I have conducted a small scale replication of the transcriptomic BioProject PRJNA579844 which has 90 RNAseq run files itself.
Briefly, colonial marine tunicates like Botryllus schlosseri are capable of both sexually reproducing through a single fertilized egg (embryogenesis) and through asexual reproduction where new generations are derived from somatic cells of the parental body (blastogenesis). In healthy Botryllus schlosseri, blastogenesis occurs as a weekly synchronized process. Transcriptomic data available from this project isolated RNA across several developmental stages of Botryllus schlosseri from the whole body.
Within this project I compared the whole transcriptome of early blastogenic stage b5-b6 in the secondary bud to the late takeover (TO) blastogenic stage in the adult zooid. Below is a schematic that helps visualize these two stages:
Asexual developmental cycle of Botryllus schlosseri
Below is the metadata from SRA Run Selector.
Run | Bases | BioProject | Bytes | dev_stage | Organism | Platform | Tissue |
---|---|---|---|---|---|---|---|
SRR10352205 | 10576501436 | PRJNA579844 | 6996363570 | b5-b6 | Botryllus schlosseri | ILLUMINA | whole system |
SRR10352206 | 3609019200 | PRJNA579844 | 1836994542 | b5-b6 | Botryllus schlosseri | ILLUMINA | whole system |
SRR10352224 | 4847587800 | PRJNA579844 | 2533047747 | TO | Botryllus schlosseri | ILLUMINA | whole system |
SRR10352254 | 5482368900 | PRJNA579844 | 2984116333 | TO | Botryllus schlosseri | ILLUMINA | whole system |
# move to a large hard drive
cd /home/shared/8TB_HDD_01
# making a new directory
mkdir cvaldi/tunicate-fish546/raw
Here we used NCBI’s SRA Toolkit to download raw fastq files using
fasterq-dump
.
/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
\
--outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \
--progress \
SRR10352205 \
SRR10352206 \
SRR10352224 SRR10352254
Below we generated checksums for all the raw fastq files using the
bash md5sum
command.
cd /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw
md5sum *.fastq > md5.original.download.20230413
cat /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/md5.original.download.20230413
## 509022c6a21cdd5bf0f5066f9b9c346c SRR10352205_1.fastq
## 473b2a1e27fbe5ffed8003a00e48a684 SRR10352205_2.fastq
## afd3278c40c120e00dc8b2d109afd6b6 SRR10352206_1.fastq
## d895814f5a2238e9bb02767354c59b11 SRR10352206_2.fastq
## c5a9b8ee11c148374eab253bb5a31d61 SRR10352224_1.fastq
## 0822524636d5771899451bdff467f11a SRR10352224_2.fastq
## d021331125fe3f3b36ed00395b2521bb SRR10352254_1.fastq
## a6f10754f23cb7d0e07e5371aacc0c16 SRR10352254_2.fastq
Below, we used the FastQC software to generate html reports regarding the quality of the raw RNAseq data.
cd ../raw
/home/shared/FastQC/fastqc *.fastq
Botryllus schlosseri currently have several bioinformatic data files available for download on Stanford University’s Botryllus schlosseri Genome Project Server. To do any differential gene expression analysis, you will need either a reference (annotated) genome or reference transcriptome to be able to align your RNAseq data to that reference. Although the Botryllus schlosseri genome server does have annotated genomes, they are in a non-standard format that need to be normalized in order for it to run through a standard genome alignment workflow using HISat2.
As such, here we conducted a psuedo-alignment of the RNAseq files against a reference transcriptome that was generated from all the mRNA files available for this species on NCBI.
Below, we used the Kallisto index
command to
index the reference transcriptome and name the index
“bsc_rna.index”.
/home/shared/kallisto/kallisto \
-i \
index \
../data/bsc_rna.index ../data/sequence.fasta
Below we used the quant
command in Kallisto to
align all the raw fastq files ot the reference index. This quantified
abundance of transcripts for each query.
#!/bin/bash
# Set input directory and index file
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/
index_file=../data/psuedo-alignment/bsc_rna.index
# Set output directory
output_dir=../output/kallisto_01
# Set number of threads
threads=10
# Loop through all paired-end fastq files in input directory
for f in ${input_dir}/*_1.fastq; do
# Get base filename
base=$(basename ${f} _1.fastq)
# Run kallisto to quantify transcripts
/home/shared/kallisto/kallisto quant -i ${index_file} \
-o ${output_dir}/${base} \
-t ${threads} \
${input_dir}/${base}_1.fastq ${input_dir}/${base}_2.fastq
done
We used the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression count matrix from the “abundance.tsv” Kallisto output files.
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
\
--est_method kallisto --gene_trans_map none \
--out_prefix ../output/kallisto_01 \
--name_sample_by_basedir \
\
../output/kallisto_01/SRR10352205/abundance.tsv \
../output/kallisto_01/SRR10352206/abundance.tsv \
../output/kallisto_01/SRR10352224/abundance.tsv ../output/kallisto_01/SRR10352254/abundance.tsv
Below, we read in and modify the count matrix we generated from Kallisto.
<- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
countmatrix rownames(countmatrix) <- countmatrix$X #rename row names from 1-n to the value of the first column called x
<- countmatrix[,-1] #remove first column
countmatrix <- round(countmatrix, 0) #rounds values in count matrix
countmatrix head(countmatrix) #take a peak at the data frame
## SRR10352205 SRR10352206 SRR10352224 SRR10352254
## JG353837.1 0 0 0 0
## JG390417.1 840 26 0 0
## JG368003.1 0 28 0 0
## JG365003.1 0 0 0 0
## JG314272.1 0 0 17 2
## JG329595.1 0 0 0 0
We then made a new data frame that contains information about the
treatment groups and the type of RNAseq data produced. We then used the
DESeqDataSetFromMatrix
function from the DESeq2
package to prepare a data set for the DEG analysis.
<- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))),
deseq2.colData type=factor(rep("paired-end", 4)))
<- DESeqDataSetFromMatrix(countData = countmatrix,
deseq2.dds colData = deseq2.colData,
design = ~ condition)
Next we will conduct the differential gene expression analysis using
the DESeq
function:
<- DESeq(deseq2.dds)
deseq2.dds <- results(deseq2.dds) #this places the results in an object that we can open
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] #reorders the rows deseq2.res
Evaluate the DEseq results.
head(deseq2.res) #take a peak
## log2 fold change (MLE): condition TO vs b5.b6
## Wald test p-value: condition TO vs b5.b6
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## AF201094.1 0.00000 NA NA NA NA NA
## AF201095.1 0.00000 NA NA NA NA NA
## AF201096.1 0.00000 NA NA NA NA NA
## AF201097.1 3.10213 -3.00407 3.36211 -0.893508 0.371585 NA
## AF201098.1 0.00000 NA NA NA NA NA
## AF201099.1 0.00000 NA NA NA NA NA
Check the number of “hits” in the DEseq results using the
nrow
function.
# Count number of hits with adjusted p-value less then 0.05
nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
## [1] 1971
<- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ] #writes significant hits into an object tmp.sig
Finally, we write out the list of differentially expressed genes into a tab file that will later be annotated.
write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)
# Select top 50 differentially expressed genes
<- results(deseq2.dds)
res <- res[order(res$padj), ]
res_ordered <- row.names(res_ordered)[1:50]
top_genes
# Extract counts and normalize
<- counts(deseq2.dds, normalized = TRUE)
counts <- counts[top_genes, ]
counts_top
# Log-transform counts
<- log2(counts_top + 1)
log_counts_top
# Generate heatmap
pheatmap(log_counts_top, scale = "row")
# Prepare the data for plotting
<- as.data.frame(deseq2.res)
res_df $gene <- row.names(res_df)
res_df
# Create volcano plot
<- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
volcano_plot geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("grey", "red")) +
labs(title = "Volcano Plot",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-value",
color = "Significantly\nDifferentially Expressed") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")
print(volcano_plot)
Lets look at what’s in the reference transcriptome file before we BLAST it.
cd ../data/psuedo-alignment
head sequence.fasta
## >JG389647.1 CCTU20128.b1 CCTU Botryllus schlosseri total asexual and embronic development subtracted Botryllus schlosseri cDNA clone CCTU20128 5', mRNA sequence
## ATTCCTCCTCTATTTGCTTCGATGAACCTGTCATACCATACACCAGCTTGACTTGTCCTGGAGTGTCAGA
## CATTCCGTGTGATGTGTCATGTCCCACTGGACTAGTGTATGATGGATATAGATGCGTCAATATTGAAGAC
## TGTCCTTGCACCTACAAGGACGTTCGACATAAGTCCGGATCTGTGTGGAAGGAAGATGACGGTTGTGTCA
## ATTGCATCTGCATTGGCGGTCATTCTAATTGTGTCGCCCAAACCTGTGACGTAACAAGTTGTGAACCAGG
## ATACGCACTAATTGAAGTTGATGGTGAATGCTGTCCTCGTTGCTTGAAAACAAATGTGACCTGTAGCGAT
## GAAACTCATGAGATCGGAGAAGTCTGGCATGATGATTGTGACGTTTGCACATGCACTGAGTATGGAGTAG
## TCAAATGTGAACATGTTGACTGTCCAGCTACAACTCCGCCTGAATGCCCTGCGAGTCACGAATTAGTGCA
## GAAAGTTGGTGAATGCTGTCCGAAATATGAATGTGTCTGTAATACTGATCTTTGTGACACTTCTGAGCCG
## ACTTGCCCGGCGAACCACAAAGTTGTTGCATCCTTGACTGATGGATGTTGCCCAGAATATCAGTGCGTCT
cd ../data/psuedo-alignment
echo "How many sequences are there?"
grep -c ">" sequence.fasta
## How many sequences are there?
## 99708
Use the NCBI blast software command makeblastdb
to
create a blast database from the UniProt fasta file.
/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
\
-in /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/uniprot_sprot_r2023_01.fasta \
-dbtype prot -out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01
Here we used the blastx
command to BLAST the reference
transcriptome and made a new table that includes the associated gene ID
for each sequence in the original reference transcriptome.
/home/shared/ncbi-blast-2.11.0+/bin/blastx \
\
-query /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/psuedo-alignment/sequence.fasta \
-db /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 \
-out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/Bsc-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 -outfmt 6
Below, we clean up the tab file we created to separate out the information regarding gene ID.
tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab | head -2
# replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab"
# and save the output to the file "Ab_4-uniprot_blastx_sep.tab"
tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab \
> ../output/Bsc-uniprot_blastx_sep.tab
Peak at the BLASTed reference:
head -2 ../output/Bsc-uniprot_blastx_sep.tab
## JG389643.1 sp Q8IUL8 CILP2_HUMAN 40.714 140 77 5 160 564 59 197 2.95e-25 107
## JG389632.1 sp Q12959 DLG1_HUMAN 82.143 56 10 0 325 492 746 801 2.30e-22 99.4
UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks.
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
First read in the BLASTed reference transcriptome table into the R global environment.
<- read.csv("../output/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) bltabl
Also read in a UniProt table into the R global environment.
#reading in annotation table from Gannet and saving as object....
<- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
url <- GET(url, config(ssl_verifypeer = FALSE))
response <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE) spgo
Joining the two tables.
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V4, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>%
mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
write_delim("../output/blast_annot_go.tab", delim = '\t')
Evaluate annotated reference transcriptome. It now contains the original gene ID, gene names, Gene Ontology (GO) terms, and protein names.
cd ../output
head -n 2 blast_annot_go.tab
## V1 V3 V4 V13 Protein.names Organism Gene.Ontology..biological.process. Gene.Ontology.IDs
## JG389643.1 Q8IUL8 CILP2_HUMAN 2.95e-25 Cartilage intermediate layer protein 2 (CILP-2) [Cleaved into: Cartilage intermediate layer protein 2 C1; Cartilage intermediate layer protein 2 C2] Homo sapiens (Human) GO:0005615; GO:0070062
Evaluate DEG list. Data needs to be cleaned before we can merge it with the annotated reference.
cd ../output
head -n 3 DEGlist.tab
## "baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj"
## "AF201122.1" 34.0837884412689 8.76461374330488 2.30407565629183 3.80396091568043 0.000142400657707493 0.00543819960949714
## "AF201134.1" 48.4069541381205 -6.97289922608518 2.4215424716195 -2.8795279487383 0.00398270995509485 0.0345116898968149
Read in DEG list as object into R environment and clean up the data frame.
<- read.delim(file = "../output/DEGlist.tab", sep =" ", header = TRUE)
diff_genes $access_num <- row.names(diff_genes) #new column
diff_genesrownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n
<- diff_genes[,c(7, 1:6)] #puts the gene column in front diff_genes
Next, read in the blast annotated GO terms table into our R environment.
<- read.delim(file = "../output/blast_annot_go.tab")
blast_go colnames(blast_go)[1] <- 'access_num' #renaming the first column
Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with the list of differentially expressed genes.
# Perform inner join between diff_genes and blast_go data frames based on shared "access_num" column
<- merge(diff_genes, blast_go, by = "access_num", all = FALSE) annote_DEGlist
# Print the merged table
head(annote_DEGlist, 2)
## access_num baseMean log2FoldChange lfcSE stat pvalue
## 1 AF201134.1 48.40695 -6.972899 2.421542 -2.879528 0.0039827100
## 2 AF201259.1 189.65058 -8.943305 2.422191 -3.692237 0.0002222898
## padj V3 V4 V13
## 1 0.034511690 P27604 SAHH_CAEEL 4.60e-100
## 2 0.007245743 P62246 RS15A_RAT 2.22e-82
## Protein.names
## 1 Adenosylhomocysteinase (AdoHcyase) (EC 3.13.2.1) (Protein dumpy-14) (S-adenosyl-L-homocysteine hydrolase)
## 2 40S ribosomal protein S15a
## Organism
## 1 Caenorhabditis elegans
## 2 Rattus norvegicus (Rat)
## Gene.Ontology..biological.process.
## 1 one-carbon metabolic process [GO:0006730]; S-adenosylhomocysteine catabolic process [GO:0019510]; S-adenosylmethionine cycle [GO:0033353]
## 2 positive regulation of cell cycle [GO:0045787]; positive regulation of cell population proliferation [GO:0008284]; response to virus [GO:0009615]; translation [GO:0006412]
## Gene.Ontology.IDs
## 1 GO:0004013; GO:0005829; GO:0006730; GO:0019510; GO:0033353
## 2 GO:0003735; GO:0005737; GO:0006412; GO:0008284; GO:0009615; GO:0022626; GO:0022627; GO:0045202; GO:0045787
Finally, write out the annotated DEG list to save it to our output directory.
write.table(annote_DEGlist, "../output/annote_DEGlist.tab", row.names = T)
Make a vector of the top 100 differentially expressed genes:
# Select top 100 differentially expressed genes
<- results(deseq2.dds)
deseq2.res <- deseq2.res[order(deseq2.res$padj), ]
res_ordered #create object that we will reference later
<- row.names(res_ordered)[1:100] top_genes
Next in order to merge the annotated blast table with the counts data we will need to transform the counts matrix into a data frame so that we can use the merge() function.
#extract counts and normalizes them
<- counts(deseq2.dds, normalized = TRUE)
counts <- as.data.frame(counts)
counts $access_num <- row.names(counts) #new column
counts<- counts[ , c(5, 1:4)] #puts accession col in front counts
Read in blast annotated table:
<- read.delim(file = "../output/blast_annot_go.tab")
blast_go #renaming the first column so it has the same name for merging
colnames(blast_go)[1] <- 'access_num'
Merge blast data frame and count data frame by an inner join. Note, there will be duplicate protein entries in this new merged data frame. This is normal and so we are just going to get rid duplicates.
#merge the blast table and counts table together
#both must be data frames
<- merge(counts, blast_go, by = "access_num", all = FALSE)
counts_pro
#checking if there any duplicates
head(which(duplicated(counts_pro)))
## [1] 2 5 20 23 25 27
#eliminate duplicate accession numbers
<- distinct(counts_pro)
counts_pro_unico
#checking if there are any duplicates still, should be 0
head(which(duplicated(counts_pro_unico)))
## integer(0)
<- as.matrix(counts_pro_unico) #make this a matrix again so it can take the rownames()
counts_matrix_unico
<- counts_matrix_unico[,1] #making a vector of names
access_num_unico
rownames(counts_matrix_unico) <- access_num_unico #rename all rows the access_num
We needed to name the rows the accession number since we will be using this to pull out the top differentially expressed genes that we named in the beginning of this section.
# Identify top genes that are present in the counts_matrix_unico row names
<- intersect(top_genes, row.names(counts_matrix_unico))
present_genes
# Extract the rows corresponding to the present genes
<- counts_matrix_unico[present_genes, ]
counts_top
head(counts_top) #lets take a peak!
## access_num SRR10352205 SRR10352206 SRR10352224
## HO069547.1 "HO069547.1" "1.142747e+03" "0.000000e+00" " 0.000000"
## JG340639.1 "JG340639.1" "9.921528e+02" "0.000000e+00" " 0.000000"
## JG337431.1 "JG337431.1" "1.419221e+03" "0.000000e+00" " 0.000000"
## JG333944.1 "JG333944.1" "8.909001e+02" "0.000000e+00" " 0.000000"
## JG332820.1 "JG332820.1" "6.986705e+02" "0.000000e+00" " 0.000000"
## JG330205.1 "JG330205.1" "5.686276e+02" "0.000000e+00" " 0.000000"
## SRR10352254 V3 V4 V13
## HO069547.1 " 0.000000" "P86221" "TBB4B_MESAU" " 1.09e-96"
## JG340639.1 " 0.000000" "P27541" "HSP70_BRUMA" " 2.59e-56"
## JG337431.1 " 0.000000" "P62755" "RS6_RAT" " 1.79e-83"
## JG333944.1 " 0.000000" "Q90YS3" "RS2_ICTPU" "1.80e-114"
## JG332820.1 " 0.000000" "P18700" "TBB_STRPU" " 0.00e+00"
## JG330205.1 " 0.000000" "Q9U3U0" "RLA0_CERCA" " 6.91e-90"
## Protein.names
## HO069547.1 "Tubulin beta-4B chain (Tubulin beta-2C chain)"
## JG340639.1 "Heat shock 70 kDa protein"
## JG337431.1 "40S ribosomal protein S6"
## JG333944.1 "40S ribosomal protein S2"
## JG332820.1 "Tubulin beta chain (Beta-tubulin)"
## JG330205.1 "60S acidic ribosomal protein P0 (CcP0)"
## Organism
## HO069547.1 "Mesocricetus auratus (Golden hamster)"
## JG340639.1 "Brugia malayi (Filarial nematode worm)"
## JG337431.1 "Rattus norvegicus (Rat)"
## JG333944.1 "Ictalurus punctatus (Channel catfish) (Silurus punctatus)"
## JG332820.1 "Strongylocentrotus purpuratus (Purple sea urchin)"
## JG330205.1 "Ceratitis capitata (Mediterranean fruit fly) (Tephritis capitata)"
## Gene.Ontology..biological.process.
## HO069547.1 "microtubule-based process [GO:0007017]"
## JG340639.1 "response to heat [GO:0009408]"
## JG337431.1 "activation-induced cell death of T cells [GO:0006924]; cellular response to ethanol [GO:0071361]; cytoplasmic translation [GO:0002181]; erythrocyte development [GO:0048821]; G1/S transition of mitotic cell cycle [GO:0000082]; gastrulation [GO:0007369]; glucose homeostasis [GO:0042593]; mammalian oogenesis stage [GO:0022605]; mitotic cell cycle [GO:0000278]; negative regulation of apoptotic process [GO:0043066]; negative regulation of bicellular tight junction assembly [GO:1903347]; placenta development [GO:0001890]; positive regulation of apoptotic process [GO:0043065]; positive regulation of cell population proliferation [GO:0008284]; response to insulin [GO:0032868]; ribosomal small subunit assembly [GO:0000028]; ribosomal small subunit biogenesis [GO:0042274]; rRNA processing [GO:0006364]; T cell differentiation in thymus [GO:0033077]; T cell proliferation involved in immune response [GO:0002309]; TOR signaling [GO:0031929]"
## JG333944.1 "translation [GO:0006412]"
## JG332820.1 "microtubule cytoskeleton organization [GO:0000226]; mitotic cell cycle [GO:0000278]"
## JG330205.1 "ribosome biogenesis [GO:0042254]"
## Gene.Ontology.IDs
## HO069547.1 "GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874; GO:0007017; GO:0046872"
## JG340639.1 "GO:0005524; GO:0009408; GO:0140662"
## JG337431.1 "GO:0000028; GO:0000082; GO:0000278; GO:0001890; GO:0002181; GO:0002309; GO:0003729; GO:0003735; GO:0005634; GO:0005730; GO:0005737; GO:0005840; GO:0005844; GO:0006364; GO:0006924; GO:0007369; GO:0008284; GO:0015935; GO:0019901; GO:0022605; GO:0022626; GO:0022627; GO:0030425; GO:0031929; GO:0032868; GO:0033077; GO:0036464; GO:0042274; GO:0042593; GO:0043065; GO:0043066; GO:0044297; GO:0048471; GO:0048821; GO:0071361; GO:0098793; GO:0098794; GO:0098982; GO:1903347; GO:1990904"
## JG333944.1 "GO:0003723; GO:0003735; GO:0006412; GO:0015935"
## JG332820.1 "GO:0000226; GO:0000278; GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874"
## JG330205.1 "GO:0005840; GO:0042254; GO:1990904"
nrow(counts_top) #how many top DEGs are we going to graph?
## [1] 45
Renaming rows as protein names so that the axis has them.
<- counts_top[, c(2:5, 9)]
counts_top_pro rownames(counts_top_pro) <- counts_top_pro[,"Protein.names"]
<- counts_top_pro[ ,-5]
counts_top_pro <- as.matrix(counts_top_pro) counts_top_pro
Next we have to do a log transformation of the counts but this first means making a new matrix that takes the contents of the previous one (character values) and converts them into numeric values to apply the arithmetic function of log2().
#convert the counts into numbers from characters
#into new matrix and renaming the col and row names of that new matrix
<- matrix(as.numeric(counts_top_pro),
counts_top_pro2 ncol = ncol(counts_top_pro))
colnames(counts_top_pro2) <- colnames(counts_top_pro)
<- rownames(counts_top_pro)
protein_names
#scrub up protein names list so that each entry is not as long
<- protein_names %>%
protein_names2 sub(' \\([^)]+\\).*$', '', .) %>%
sub('\\[.*?\\]', ' ', .)
rownames(counts_top_pro2) <- protein_names2
# Perform log transformation
<- log2(counts_top_pro2 + 1) log_counts_top2
Here we see that the two blastogenic stages cluster into two distinct groups in the PCA plot.
<- vst(deseq2.dds, blind = FALSE)
vsd
plotPCA(vsd, intgroup = "condition")
Now that we have a log counts object we can use it to start generating better looking plots:
#define some color palettes to play with from RColorBrewer
<- colorRampPalette(brewer.pal(8, "PiYG"))(25)
pink <- colorRampPalette(brewer.pal(8, "Blues"))(25) blue
# Customize the heatmap using heatmap.2
heatmap.2(log_counts_top2,
scale="column",
cexRow=0.9,
margins =c(10,20),
col=pink,
Colv = FALSE, #Keeps x-axis from being reordered
ylab = "Protien Names", #y-axis label
xlab = "Sample Run ID", #x-axis label
key=TRUE, # Add a legend
keysize = 0.7, # Set the size of the legend
key.title = "Log Counts", # Legend title
key.xlab = "Expression", # Legend x-axis label
key.ylab = "Frequency", # Legend y-axis label
trace="none", # Remove the trace lines
cexCol = 1) # Adjust column label font size
Purine nucleoside phosphorylase: is an enzyme which catalyzes the first step in either purine base salvage or nucleoside catabolism.
Cytochrome c oxidase subunit 3: component of the respiratory chain that catalyzes the reduction of oxygen to water.
Pyridoxal 5’-phosphate synthase subunit SNZERR: coenzyme in all transamination reactions, and in certain decarboxylation, deamination, and racemization reactions of amino acids.
Several transcripts upregulated associated with catabolism. In line with the phenomenon of reabsorption of the parental zooid.
Several elongation factors: involved in protein synthesis.
Several actin related genes
The b5-b6 stage is a period of growth especially relative to the TO stage, it is thus expected to find an upregulation of protein coding genes related to protein synthesis in and cell structure maintenance.
In order to get gene names that are in the annote_DEGlist.tab printed onto the plot, we will have to do some more data wrangling again to modify the deseq2.res df to also include gene ID’s.
<- read.delim(file = "../output/blast_annot_go.tab")
blast_go colnames(blast_go)[1] <- 'access_num' #renaming the first column
# Merge together all results from DESeq analysis with blast_go data frame
<- as.data.frame(deseq2.res) %>%
res_df mutate(access_num = rownames(.)) %>%
distinct()
#clean up the blast annotated table to keep only unique rows
<- distinct(blast_go, access_num, .keep_all = TRUE)
blast_go_small
<- merge(blast_go_small, res_df, by = "access_num", all = TRUE) transcriptome_anno
# Create the volcano plot with labeled genes
EnhancedVolcano(transcriptome_anno,
lab = transcriptome_anno$V4,
x = "log2FoldChange",
y = "pvalue",
title = "TO vs. b5-b6 Blastogenic Stages")