Project Overview

Research Question

How do the transcriptomic profiles differ between two developmental stages in Botryllus schlosseri?

Background

I have conducted a small scale replication of the transcriptomic BioProject PRJNA579844 which has 90 RNAseq run files itself.

Briefly, colonial marine tunicates like Botryllus schlosseri are capable of both sexually reproducing through a single fertilized egg (embryogenesis) and through asexual reproduction where new generations are derived from somatic cells of the parental body (blastogenesis). In healthy Botryllus schlosseri, blastogenesis occurs as a weekly synchronized process. Transcriptomic data available from this project isolated RNA across several developmental stages of Botryllus schlosseri from the whole body.

Within this project I compared the whole transcriptome of early blastogenic stage b5-b6 in the secondary bud to the late takeover (TO) blastogenic stage in the adult zooid. Below is a schematic that helps visualize these two stages:

Asexual developmental cycle of Botryllus schlosseri

Bioinformatics Workflow

Metadata

Below is the metadata from SRA Run Selector.

Run	Bases	BioProject	Bytes	dev_stage	Organism	Platform	Tissue
SRR10352205	10576501436	PRJNA579844	6996363570	b5-b6	Botryllus schlosseri	ILLUMINA	whole system
SRR10352206	3609019200	PRJNA579844	1836994542	b5-b6	Botryllus schlosseri	ILLUMINA	whole system
SRR10352224	4847587800	PRJNA579844	2533047747	TO	Botryllus schlosseri	ILLUMINA	whole system
SRR10352254	5482368900	PRJNA579844	2984116333	TO	Botryllus schlosseri	ILLUMINA	whole system

Accessing Data

Go to the NCBI page for BioProject PRJNA579844 and select the numeric value next to “SRA Experiments” which indicates the number of runs in this project.
On the top right corner select “Send to” and from the drop down menu select “Run Selector.
Run selector will show you all the runs available for the BioProject of interest with all the metadata. You can download metadata or accession list (SRR numbers) to keep track of your runs of interest.

Identify Raw Data Storage Directory

# move to a large hard drive
cd /home/shared/8TB_HDD_01

# making a new directory 
mkdir cvaldi/tunicate-fish546/raw

Download RNAseq NCBI Files

Here we used NCBI’s SRA Toolkit to download raw fastq files using fasterq-dump.

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \
--progress \
SRR10352205 \
SRR10352206 \
SRR10352224 \
SRR10352254

Gereate Checksums

Below we generated checksums for all the raw fastq files using the bash md5sum command.

cd /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw
md5sum *.fastq > md5.original.download.20230413

cat /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/md5.original.download.20230413

## 509022c6a21cdd5bf0f5066f9b9c346c  SRR10352205_1.fastq
## 473b2a1e27fbe5ffed8003a00e48a684  SRR10352205_2.fastq
## afd3278c40c120e00dc8b2d109afd6b6  SRR10352206_1.fastq
## d895814f5a2238e9bb02767354c59b11  SRR10352206_2.fastq
## c5a9b8ee11c148374eab253bb5a31d61  SRR10352224_1.fastq
## 0822524636d5771899451bdff467f11a  SRR10352224_2.fastq
## d021331125fe3f3b36ed00395b2521bb  SRR10352254_1.fastq
## a6f10754f23cb7d0e07e5371aacc0c16  SRR10352254_2.fastq

QAQC Raw Files

Below, we used the FastQC software to generate html reports regarding the quality of the raw RNAseq data.

cd ../raw
/home/shared/FastQC/fastqc *.fastq

Accessing Reference

Botryllus schlosseri currently have several bioinformatic data files available for download on Stanford University’s Botryllus schlosseri Genome Project Server. To do any differential gene expression analysis, you will need either a reference (annotated) genome or reference transcriptome to be able to align your RNAseq data to that reference. Although the Botryllus schlosseri genome server does have annotated genomes, they are in a non-standard format that need to be normalized in order for it to run through a standard genome alignment workflow using HISat2.

As such, here we conducted a psuedo-alignment of the RNAseq files against a reference transcriptome that was generated from all the mRNA files available for this species on NCBI.

Creating Reference Transcriptome

Navigate to the Botryllus schlosseri NCBI page.
Select “Nucleotides” of which there are currently about 100 K files currently available.
On the lefthand navigation menu, under “Molecule Type” select “mRNA” (there are about 99 K files). This will filter out the list we have.
Select “Send to” on the top part of the page and select “Complete Record”. This will download a file called “sequence.fasta” that will contain all ~99,000 mRNA files from that list. Download this file, because it is under 100 MB we can store it on GitHub.

Quantify Transcript Abundances

Indexing Reference Transcriptome

Below, we used the Kallisto index command to index the reference transcriptome and name the index “bsc_rna.index”.

/home/shared/kallisto/kallisto \
index -i \
../data/bsc_rna.index \
../data/sequence.fasta

Psuedo-alignment

Below we used the quant command in Kallisto to align all the raw fastq files ot the reference index. This quantified abundance of transcripts for each query.

#!/bin/bash

# Set input directory and index file
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/
index_file=../data/psuedo-alignment/bsc_rna.index

# Set output directory
output_dir=../output/kallisto_01

# Set number of threads
threads=10

# Loop through all paired-end fastq files in input directory
for f in ${input_dir}/*_1.fastq; do
    # Get base filename
    base=$(basename ${f} _1.fastq)
    
    # Run kallisto to quantify transcripts
    /home/shared/kallisto/kallisto quant -i ${index_file} \
    -o ${output_dir}/${base} \
    -t ${threads} \
    ${input_dir}/${base}_1.fastq ${input_dir}/${base}_2.fastq
done

Creating Count Matrix

We used the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression count matrix from the “abundance.tsv” Kallisto output files.

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/SRR10352205/abundance.tsv \
    ../output/kallisto_01/SRR10352206/abundance.tsv \
    ../output/kallisto_01/SRR10352224/abundance.tsv \
    ../output/kallisto_01/SRR10352254/abundance.tsv

Differential Gene Expression Analysis

Below, we read in and modify the count matrix we generated from Kallisto.

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X #rename row names from 1-n to the value of the first column called x
countmatrix <- countmatrix[,-1] #remove first column
countmatrix <- round(countmatrix, 0) #rounds values in count matrix
head(countmatrix) #take a peak at the data frame

##            SRR10352205 SRR10352206 SRR10352224 SRR10352254
## JG353837.1           0           0           0           0
## JG390417.1         840          26           0           0
## JG368003.1           0          28           0           0
## JG365003.1           0           0           0           0
## JG314272.1           0           0          17           2
## JG329595.1           0           0           0           0

We then made a new data frame that contains information about the treatment groups and the type of RNAseq data produced. We then used the DESeqDataSetFromMatrix function from the DESeq2 package to prepare a data set for the DEG analysis.

deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), 
                             type=factor(rep("paired-end", 4)))
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Next we will conduct the differential gene expression analysis using the DESeq function:

deseq2.dds <- DESeq(deseq2.dds) 
deseq2.res <- results(deseq2.dds) #this places the results in an object that we can open
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] #reorders the rows

Evaluate the DEseq results.

head(deseq2.res) #take a peak

## log2 fold change (MLE): condition TO vs b5.b6 
## Wald test p-value: condition TO vs b5.b6 
## DataFrame with 6 rows and 6 columns
##             baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##            <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## AF201094.1   0.00000             NA        NA        NA        NA        NA
## AF201095.1   0.00000             NA        NA        NA        NA        NA
## AF201096.1   0.00000             NA        NA        NA        NA        NA
## AF201097.1   3.10213       -3.00407   3.36211 -0.893508  0.371585        NA
## AF201098.1   0.00000             NA        NA        NA        NA        NA
## AF201099.1   0.00000             NA        NA        NA        NA        NA

Check the number of “hits” in the DEseq results using the nrow function.

# Count number of hits with adjusted p-value less then 0.05
nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

## [1] 1971

tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ] #writes significant hits into an object

Finally, we write out the list of differentially expressed genes into a tab file that will later be annotated.

write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)

Exploratory Visualization

Exploratory Heatmap

# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

Exploratory Volcano Plot

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(title = "Volcano Plot",
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Significantly\nDifferentially Expressed") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "top")

print(volcano_plot)

Annotate Reference Transcriptome

Lets look at what’s in the reference transcriptome file before we BLAST it.

cd ../data/psuedo-alignment
head sequence.fasta

## >JG389647.1 CCTU20128.b1 CCTU Botryllus schlosseri total asexual and embronic development subtracted Botryllus schlosseri cDNA clone CCTU20128 5', mRNA sequence
## ATTCCTCCTCTATTTGCTTCGATGAACCTGTCATACCATACACCAGCTTGACTTGTCCTGGAGTGTCAGA
## CATTCCGTGTGATGTGTCATGTCCCACTGGACTAGTGTATGATGGATATAGATGCGTCAATATTGAAGAC
## TGTCCTTGCACCTACAAGGACGTTCGACATAAGTCCGGATCTGTGTGGAAGGAAGATGACGGTTGTGTCA
## ATTGCATCTGCATTGGCGGTCATTCTAATTGTGTCGCCCAAACCTGTGACGTAACAAGTTGTGAACCAGG
## ATACGCACTAATTGAAGTTGATGGTGAATGCTGTCCTCGTTGCTTGAAAACAAATGTGACCTGTAGCGAT
## GAAACTCATGAGATCGGAGAAGTCTGGCATGATGATTGTGACGTTTGCACATGCACTGAGTATGGAGTAG
## TCAAATGTGAACATGTTGACTGTCCAGCTACAACTCCGCCTGAATGCCCTGCGAGTCACGAATTAGTGCA
## GAAAGTTGGTGAATGCTGTCCGAAATATGAATGTGTCTGTAATACTGATCTTTGTGACACTTCTGAGCCG
## ACTTGCCCGGCGAACCACAAAGTTGTTGCATCCTTGACTGATGGATGTTGCCCAGAATATCAGTGCGTCT

cd ../data/psuedo-alignment
echo "How many sequences are there?"
grep -c ">" sequence.fasta

## How many sequences are there?
## 99708

Make a BLAST Database

Use the NCBI blast software command makeblastdb to create a blast database from the UniProt fasta file.

/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
-in /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01

BLAST Nucleotides to Protiens

Here we used the blastx command to BLAST the reference transcriptome and made a new table that includes the associated gene ID for each sequence in the original reference transcriptome.

/home/shared/ncbi-blast-2.11.0+/bin/blastx \
-query /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/psuedo-alignment/sequence.fasta \
-db /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 \
-out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/Bsc-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6

Below, we clean up the tab file we created to separate out the information regarding gene ID.

tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab | head -2

# replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab"
# and save the output to the file "Ab_4-uniprot_blastx_sep.tab"

tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab \
> ../output/Bsc-uniprot_blastx_sep.tab

Peak at the BLASTed reference:

head -2 ../output/Bsc-uniprot_blastx_sep.tab

## JG389643.1   sp  Q8IUL8  CILP2_HUMAN 40.714  140 77  5   160 564 59  197 2.95e-25    107
## JG389632.1   sp  Q12959  DLG1_HUMAN  82.143  56  10  0   325 492 746 801 2.30e-22    99.4

Download UniProt Database

UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks.

cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz

Merge BLAST and UniProt Tables

First read in the BLASTed reference transcriptome table into the R global environment.

bltabl <- read.csv("../output/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)

Also read in a UniProt table into the R global environment.

#reading in annotation table from Gannet and saving as object....
url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
response <- GET(url, config(ssl_verifypeer = FALSE))
spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE)

Joining the two tables.

left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V4, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% 
  mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')

Evaluate annotated reference transcriptome. It now contains the original gene ID, gene names, Gene Ontology (GO) terms, and protein names.

cd ../output
head -n 2 blast_annot_go.tab

## V1   V3  V4  V13 Protein.names   Organism    Gene.Ontology..biological.process.  Gene.Ontology.IDs
## JG389643.1   Q8IUL8  CILP2_HUMAN 2.95e-25    Cartilage intermediate layer protein 2 (CILP-2) [Cleaved into: Cartilage intermediate layer protein 2 C1; Cartilage intermediate layer protein 2 C2]    Homo sapiens (Human)        GO:0005615; GO:0070062

Merge Annotated Reference with DEG List

Evaluate DEG list. Data needs to be cleaned before we can merge it with the annotated reference.

cd ../output
head -n 3 DEGlist.tab

## "baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj"
## "AF201122.1" 34.0837884412689 8.76461374330488 2.30407565629183 3.80396091568043 0.000142400657707493 0.00543819960949714
## "AF201134.1" 48.4069541381205 -6.97289922608518 2.4215424716195 -2.8795279487383 0.00398270995509485 0.0345116898968149

Read in DEG list as object into R environment and clean up the data frame.

diff_genes <- read.delim(file = "../output/DEGlist.tab", sep =" ", header = TRUE)
diff_genes$access_num <- row.names(diff_genes) #new column
rownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n 
diff_genes <- diff_genes[,c(7, 1:6)] #puts the gene column in front

Next, read in the blast annotated GO terms table into our R environment.

blast_go <- read.delim(file = "../output/blast_annot_go.tab")
colnames(blast_go)[1] <- 'access_num' #renaming the first column

Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with the list of differentially expressed genes.

# Perform inner join between diff_genes and blast_go data frames based on shared "access_num" column
annote_DEGlist <- merge(diff_genes, blast_go, by = "access_num", all = FALSE)

# Print the merged table
head(annote_DEGlist, 2)

##   access_num  baseMean log2FoldChange    lfcSE      stat       pvalue
## 1 AF201134.1  48.40695      -6.972899 2.421542 -2.879528 0.0039827100
## 2 AF201259.1 189.65058      -8.943305 2.422191 -3.692237 0.0002222898
##          padj     V3         V4       V13
## 1 0.034511690 P27604 SAHH_CAEEL 4.60e-100
## 2 0.007245743 P62246  RS15A_RAT  2.22e-82
##                                                                                               Protein.names
## 1 Adenosylhomocysteinase (AdoHcyase) (EC 3.13.2.1) (Protein dumpy-14) (S-adenosyl-L-homocysteine hydrolase)
## 2                                                                                40S ribosomal protein S15a
##                  Organism
## 1  Caenorhabditis elegans
## 2 Rattus norvegicus (Rat)
##                                                                                                                                            Gene.Ontology..biological.process.
## 1                                   one-carbon metabolic process [GO:0006730]; S-adenosylhomocysteine catabolic process [GO:0019510]; S-adenosylmethionine cycle [GO:0033353]
## 2 positive regulation of cell cycle [GO:0045787]; positive regulation of cell population proliferation [GO:0008284]; response to virus [GO:0009615]; translation [GO:0006412]
##                                                                                            Gene.Ontology.IDs
## 1                                                 GO:0004013; GO:0005829; GO:0006730; GO:0019510; GO:0033353
## 2 GO:0003735; GO:0005737; GO:0006412; GO:0008284; GO:0009615; GO:0022626; GO:0022627; GO:0045202; GO:0045787

Finally, write out the annotated DEG list to save it to our output directory.

write.table(annote_DEGlist, "../output/annote_DEGlist.tab", row.names = T)

Wrangling Annotated DEG Data

Make a vector of the top 100 differentially expressed genes:

# Select top 100 differentially expressed genes
deseq2.res <- results(deseq2.dds)
res_ordered <- deseq2.res[order(deseq2.res$padj), ]
#create object that we will reference later
top_genes <- row.names(res_ordered)[1:100]

Next in order to merge the annotated blast table with the counts data we will need to transform the counts matrix into a data frame so that we can use the merge() function.

#extract counts and normalizes them 
counts <- counts(deseq2.dds, normalized = TRUE)
counts <- as.data.frame(counts)
counts$access_num <- row.names(counts) #new column
counts <- counts[ , c(5, 1:4)] #puts accession col in front

Read in blast annotated table:

blast_go <- read.delim(file = "../output/blast_annot_go.tab")
#renaming the first column so it has the same name for merging
colnames(blast_go)[1] <- 'access_num'

Merge blast data frame and count data frame by an inner join. Note, there will be duplicate protein entries in this new merged data frame. This is normal and so we are just going to get rid duplicates.

#merge the blast table and counts table together
#both must be data frames
counts_pro <- merge(counts, blast_go, by = "access_num", all = FALSE) 

#checking if there any duplicates
head(which(duplicated(counts_pro)))

## [1]  2  5 20 23 25 27

#eliminate duplicate accession numbers
counts_pro_unico <- distinct(counts_pro)

#checking if there are any duplicates still, should be 0
head(which(duplicated(counts_pro_unico)))

## integer(0)

counts_matrix_unico <- as.matrix(counts_pro_unico) #make this a matrix again so it can take the rownames()

access_num_unico <- counts_matrix_unico[,1] #making a vector of names

rownames(counts_matrix_unico) <- access_num_unico #rename all rows the access_num

We needed to name the rows the accession number since we will be using this to pull out the top differentially expressed genes that we named in the beginning of this section.

# Identify top genes that are present in the counts_matrix_unico row names
present_genes <- intersect(top_genes, row.names(counts_matrix_unico))

# Extract the rows corresponding to the present genes
counts_top <- counts_matrix_unico[present_genes, ]

head(counts_top) #lets take a peak!

##            access_num   SRR10352205    SRR10352206    SRR10352224   
## HO069547.1 "HO069547.1" "1.142747e+03" "0.000000e+00" "    0.000000"
## JG340639.1 "JG340639.1" "9.921528e+02" "0.000000e+00" "    0.000000"
## JG337431.1 "JG337431.1" "1.419221e+03" "0.000000e+00" "    0.000000"
## JG333944.1 "JG333944.1" "8.909001e+02" "0.000000e+00" "    0.000000"
## JG332820.1 "JG332820.1" "6.986705e+02" "0.000000e+00" "    0.000000"
## JG330205.1 "JG330205.1" "5.686276e+02" "0.000000e+00" "    0.000000"
##            SRR10352254    V3       V4            V13        
## HO069547.1 "    0.000000" "P86221" "TBB4B_MESAU" " 1.09e-96"
## JG340639.1 "    0.000000" "P27541" "HSP70_BRUMA" " 2.59e-56"
## JG337431.1 "    0.000000" "P62755" "RS6_RAT"     " 1.79e-83"
## JG333944.1 "    0.000000" "Q90YS3" "RS2_ICTPU"   "1.80e-114"
## JG332820.1 "    0.000000" "P18700" "TBB_STRPU"   " 0.00e+00"
## JG330205.1 "    0.000000" "Q9U3U0" "RLA0_CERCA"  " 6.91e-90"
##            Protein.names                                  
## HO069547.1 "Tubulin beta-4B chain (Tubulin beta-2C chain)"
## JG340639.1 "Heat shock 70 kDa protein"                    
## JG337431.1 "40S ribosomal protein S6"                     
## JG333944.1 "40S ribosomal protein S2"                     
## JG332820.1 "Tubulin beta chain (Beta-tubulin)"            
## JG330205.1 "60S acidic ribosomal protein P0 (CcP0)"       
##            Organism                                                           
## HO069547.1 "Mesocricetus auratus (Golden hamster)"                            
## JG340639.1 "Brugia malayi (Filarial nematode worm)"                           
## JG337431.1 "Rattus norvegicus (Rat)"                                          
## JG333944.1 "Ictalurus punctatus (Channel catfish) (Silurus punctatus)"        
## JG332820.1 "Strongylocentrotus purpuratus (Purple sea urchin)"                
## JG330205.1 "Ceratitis capitata (Mediterranean fruit fly) (Tephritis capitata)"
##            Gene.Ontology..biological.process.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## HO069547.1 "microtubule-based process [GO:0007017]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## JG340639.1 "response to heat [GO:0009408]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## JG337431.1 "activation-induced cell death of T cells [GO:0006924]; cellular response to ethanol [GO:0071361]; cytoplasmic translation [GO:0002181]; erythrocyte development [GO:0048821]; G1/S transition of mitotic cell cycle [GO:0000082]; gastrulation [GO:0007369]; glucose homeostasis [GO:0042593]; mammalian oogenesis stage [GO:0022605]; mitotic cell cycle [GO:0000278]; negative regulation of apoptotic process [GO:0043066]; negative regulation of bicellular tight junction assembly [GO:1903347]; placenta development [GO:0001890]; positive regulation of apoptotic process [GO:0043065]; positive regulation of cell population proliferation [GO:0008284]; response to insulin [GO:0032868]; ribosomal small subunit assembly [GO:0000028]; ribosomal small subunit biogenesis [GO:0042274]; rRNA processing [GO:0006364]; T cell differentiation in thymus [GO:0033077]; T cell proliferation involved in immune response [GO:0002309]; TOR signaling [GO:0031929]"
## JG333944.1 "translation [GO:0006412]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## JG332820.1 "microtubule cytoskeleton organization [GO:0000226]; mitotic cell cycle [GO:0000278]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## JG330205.1 "ribosome biogenesis [GO:0042254]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
##            Gene.Ontology.IDs                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## HO069547.1 "GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874; GO:0007017; GO:0046872"                                                                                                                                                                                                                                                                                                                                                                                                            
## JG340639.1 "GO:0005524; GO:0009408; GO:0140662"                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## JG337431.1 "GO:0000028; GO:0000082; GO:0000278; GO:0001890; GO:0002181; GO:0002309; GO:0003729; GO:0003735; GO:0005634; GO:0005730; GO:0005737; GO:0005840; GO:0005844; GO:0006364; GO:0006924; GO:0007369; GO:0008284; GO:0015935; GO:0019901; GO:0022605; GO:0022626; GO:0022627; GO:0030425; GO:0031929; GO:0032868; GO:0033077; GO:0036464; GO:0042274; GO:0042593; GO:0043065; GO:0043066; GO:0044297; GO:0048471; GO:0048821; GO:0071361; GO:0098793; GO:0098794; GO:0098982; GO:1903347; GO:1990904"
## JG333944.1 "GO:0003723; GO:0003735; GO:0006412; GO:0015935"                                                                                                                                                                                                                                                                                                                                                                                                                                                
## JG332820.1 "GO:0000226; GO:0000278; GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874"                                                                                                                                                                                                                                                                                                                                                                                                            
## JG330205.1 "GO:0005840; GO:0042254; GO:1990904"

nrow(counts_top) #how many top DEGs are we going to graph?

## [1] 45

Renaming rows as protein names so that the axis has them.

counts_top_pro <- counts_top[, c(2:5, 9)]
rownames(counts_top_pro) <- counts_top_pro[,"Protein.names"]
counts_top_pro <- counts_top_pro[ ,-5]
counts_top_pro <- as.matrix(counts_top_pro)

Next we have to do a log transformation of the counts but this first means making a new matrix that takes the contents of the previous one (character values) and converts them into numeric values to apply the arithmetic function of log2().

#convert the counts into numbers from characters 
#into new matrix and renaming the col and row names of that new matrix
counts_top_pro2 <- matrix(as.numeric(counts_top_pro),
                          ncol = ncol(counts_top_pro))
colnames(counts_top_pro2) <- colnames(counts_top_pro)
protein_names <- rownames(counts_top_pro)

#scrub up protein names list so that each entry is not as long 
protein_names2 <- protein_names %>%
  sub(' \\([^)]+\\).*$', '', .) %>%
  sub('\\[.*?\\]', ' ', .)

rownames(counts_top_pro2) <- protein_names2

# Perform log transformation
log_counts_top2 <- log2(counts_top_pro2 + 1)

Explanatory Visualization

Principle Components Analysis

Here we see that the two blastogenic stages cluster into two distinct groups in the PCA plot.

vsd <- vst(deseq2.dds, blind = FALSE)

plotPCA(vsd, intgroup = "condition")

Heatmaps

Now that we have a log counts object we can use it to start generating better looking plots:

#define some color palettes to play with from RColorBrewer
pink <- colorRampPalette(brewer.pal(8, "PiYG"))(25)
blue <- colorRampPalette(brewer.pal(8, "Blues"))(25)

# Customize the heatmap using heatmap.2
heatmap.2(log_counts_top2,
          scale="column",
          cexRow=0.9,
          margins =c(10,20),
          col=pink,
          Colv = FALSE,              #Keeps x-axis from being reordered
          ylab = "Protien Names",    #y-axis label
          xlab = "Sample Run ID",    #x-axis label
          key=TRUE,                  # Add a legend
          keysize = 0.7,             # Set the size of the legend
          key.title = "Log Counts",  # Legend title
          key.xlab = "Expression",   # Legend x-axis label
          key.ylab = "Frequency",    # Legend y-axis label
          trace="none",              # Remove the trace lines
          cexCol = 1)                # Adjust column label font size

Protein coding transcripts upregulated in takeover relative to b5-b6:

Purine nucleoside phosphorylase: is an enzyme which catalyzes the first step in either purine base salvage or nucleoside catabolism.
Cytochrome c oxidase subunit 3: component of the respiratory chain that catalyzes the reduction of oxygen to water.
Pyridoxal 5’-phosphate synthase subunit SNZERR: coenzyme in all transamination reactions, and in certain decarboxylation, deamination, and racemization reactions of amino acids.

Several transcripts upregulated associated with catabolism. In line with the phenomenon of reabsorption of the parental zooid.

Protein coding transcripts upregulated in b5-b6 relative to TO:

Several elongation factors: involved in protein synthesis.
Several actin related genes

The b5-b6 stage is a period of growth especially relative to the TO stage, it is thus expected to find an upregulation of protein coding genes related to protein synthesis in and cell structure maintenance.

Volcano Plot

In order to get gene names that are in the annote_DEGlist.tab printed onto the plot, we will have to do some more data wrangling again to modify the deseq2.res df to also include gene ID’s.

blast_go <- read.delim(file = "../output/blast_annot_go.tab")
colnames(blast_go)[1] <- 'access_num' #renaming the first column 

# Merge together all results from DESeq analysis with blast_go data frame
res_df <- as.data.frame(deseq2.res) %>%
  mutate(access_num = rownames(.)) %>%
  distinct()

#clean up the blast annotated table to keep only unique rows
blast_go_small <- distinct(blast_go, access_num, .keep_all = TRUE)

transcriptome_anno <- merge(blast_go_small, res_df, by = "access_num", all = TRUE)

# Create the volcano plot with labeled genes
EnhancedVolcano(transcriptome_anno,
                lab = transcriptome_anno$V4, 
                x = "log2FoldChange",
                y = "pvalue",
                title = "TO vs. b5-b6 Blastogenic Stages")

Upregulated

COX3: mitochondrial activity
PNPH: enzyme which reversibly catalyzes the phosphorolysis of purine nucleosides
ANX11: cell cycle regulation

Downregulated

TUBB4B: involved in microtubule cytoskeleton organization and mitotic cell cycle

Next Steps

Further work out the biological meaning of my findings.
Complete functional enrichment analysis.
Apply this workflow to transcriptomic data that I plan to generate for preliminary study evaluating changes in gene expression over time in primary cultures of Botryllus schlosseri.

Differential Gene Expression Analysis of two Distinct Asexual Stages in a Marine Tunicate

Celeste Valdivia

October 05, 2023