--- title: "**Differential Gene Expression Analysis of two Distinct Asexual Stages in a Marine Tunicate**" author: "Celeste Valdivia" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: theme: cosmo highlight: pygments toc: true toc_float: true toc_depth: 3 --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(httr) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(kableExtra) library(gplots) library(EnhancedVolcano) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 10, # Set plot width in inches fig.height = 10, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # **Project Overview** ## Research Question #### How do the transcriptomic profiles differ between two developmental stages in *Botryllus schlosseri*? ## Background I have conducted a small scale replication of the transcriptomic BioProject [PRJNA579844](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA579844) which has 90 RNAseq run files itself. Briefly, colonial marine tunicates like *Botryllus schlosseri* are capable of both sexually reproducing through a single fertilized egg (embryogenesis) and through asexual reproduction where new generations are derived from somatic cells of the parental body (blastogenesis). In healthy *Botryllus schlosseri*, blastogenesis occurs as a weekly synchronized process. Transcriptomic data available from this project isolated RNA across several developmental stages of *Botryllus schlosseri* from the whole body. Within this project I compared the whole transcriptome of early blastogenic stage b5-b6 in the secondary bud to the late takeover (TO) blastogenic stage in the adult zooid. Below is a schematic that helps visualize these two stages: ![Asexual developmental cycle of *Botryllus schlosseri*](https://github.com/valeste/valeste.github.io/blob/master/assets/img/sex_asex_devo_TO_b5.jpeg?raw=true) ## Bioinformatics Workflow ![](https://github.com/course-fish546-2023/celeste-tunicate-devo/blob/main/notes/workflow1.jpeg?raw=true) # **Metadata** Below is the metadata from SRA Run Selector. | Run | Bases | BioProject | Bytes | dev_stage | Organism | Platform | Tissue | |-------------|-------------|-------------|------------|------------|-----------------------|-----------|--------------| | SRR10352205 | 10576501436 | PRJNA579844 | 6996363570 | b5-b6 | Botryllus schlosseri | ILLUMINA | whole system | | SRR10352206 | 3609019200 | PRJNA579844 | 1836994542 | b5-b6 | Botryllus schlosseri | ILLUMINA | whole system | | SRR10352224 | 4847587800 | PRJNA579844 | 2533047747 | TO | Botryllus schlosseri | ILLUMINA | whole system | | SRR10352254 | 5482368900 | PRJNA579844 | 2984116333 | TO | Botryllus schlosseri | ILLUMINA | whole system | # **Accessing Data** 1. Go to the NCBI page for BioProject [PRJNA579844](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA579844) and select the numeric value next to "SRA Experiments" which indicates the number of runs in this project. 2. On the top right corner select "Send to" and from the drop down menu select "Run Selector. 3. Run selector will show you all the runs available for the BioProject of interest with all the metadata. You can download metadata or accession list (SRR numbers) to keep track of your runs of interest. ## Identify Raw Data Storage Directory ```{r make raw data directory, engine='bash', eval = FALSE} # move to a large hard drive cd /home/shared/8TB_HDD_01 # making a new directory mkdir cvaldi/tunicate-fish546/raw ``` ## Download RNAseq NCBI Files Here we used NCBI's SRA Toolkit to download raw fastq files using `fasterq-dump`. ```{r read in raw data, engine = 'bash'} /home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \ --outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \ --progress \ SRR10352205 \ SRR10352206 \ SRR10352224 \ SRR10352254 ``` ## Gereate Checksums Below we generated checksums for all the raw fastq files using the bash `md5sum` command. ```{r generate checksums, engine = 'bash'} cd /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw md5sum *.fastq > md5.original.download.20230413 ``` ```{r evaluate checksums, engine ='bash', eval = TRUE} cat /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/md5.original.download.20230413 ``` # **QAQC Raw Files** Below, we used the FastQC software to generate html reports regarding the quality of the raw RNAseq data. ```{r QAQC, engine='bash'} cd ../raw /home/shared/FastQC/fastqc *.fastq ``` # **Accessing Reference** *Botryllus schlosseri* currently have several bioinformatic data files available for download on Stanford University's [*Botryllus schlosseri* Genome Project Server](http://botryllus.stanford.edu/botryllusgenome/download/). To do any differential gene expression analysis, you will need either a reference (annotated) genome or reference transcriptome to be able to align your RNAseq data to that reference. Although the *Botryllus schlosseri* genome server does have annotated genomes, they are in a non-standard format that need to be normalized in order for it to run through a standard genome alignment workflow using HISat2. As such, here we conducted a **psuedo-alignment** of the RNAseq files against a reference transcriptome that was generated from all the mRNA files available for this species on NCBI. ## Creating Reference Transcriptome 1. Navigate to the [*Botryllus schlosseri* NCBI](https://www.ncbi.nlm.nih.gov/search/all/?term=boryllus%20schlosseri) page. 2. Select "Nucleotides" of which there are currently about 100 K files currently available. 3. On the lefthand navigation menu, under "Molecule Type" select "mRNA" (there are about 99 K files). This will filter out the list we have. 4. Select "Send to" on the top part of the page and select "Complete Record". This will download a file called "sequence.fasta" that will contain all ~99,000 mRNA files from that list. Download this file, because it is under 100 MB we can store it on GitHub. # **Quantify Transcript Abundances** ## Indexing Reference Transcriptome Below, we used the *Kallisto* `index` command to index the reference transcriptome and name the index "bsc_rna.index". ```{r index reference, engine = 'bash'} /home/shared/kallisto/kallisto \ index -i \ ../data/bsc_rna.index \ ../data/sequence.fasta ``` ## Psuedo-alignment Below we used the `quant` command in *Kallisto* to align all the raw fastq files ot the reference index. This quantified abundance of transcripts for each query. ```{r pseudo-align, engine='bash'} #!/bin/bash # Set input directory and index file input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/ index_file=../data/psuedo-alignment/bsc_rna.index # Set output directory output_dir=../output/kallisto_01 # Set number of threads threads=10 # Loop through all paired-end fastq files in input directory for f in ${input_dir}/*_1.fastq; do # Get base filename base=$(basename ${f} _1.fastq) # Run kallisto to quantify transcripts /home/shared/kallisto/kallisto quant -i ${index_file} \ -o ${output_dir}/${base} \ -t ${threads} \ ${input_dir}/${base}_1.fastq ${input_dir}/${base}_2.fastq done ``` ## Creating Count Matrix We used the abundance_estimates_to_matrix.pl script from the *Trinity* RNA-seq assembly software package to create a gene expression count matrix from the "abundance.tsv" *Kallisto* output files. ```{r generate count matrix, engine = 'bash'} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix ../output/kallisto_01 \ --name_sample_by_basedir \ ../output/kallisto_01/SRR10352205/abundance.tsv \ ../output/kallisto_01/SRR10352206/abundance.tsv \ ../output/kallisto_01/SRR10352224/abundance.tsv \ ../output/kallisto_01/SRR10352254/abundance.tsv ``` # **Differential Gene Expression Analysis** Below, we read in and modify the count matrix we generated from *Kallisto*. ```{r read in and wrangle count matrix, eval = TRUE} countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X #rename row names from 1-n to the value of the first column called x countmatrix <- countmatrix[,-1] #remove first column countmatrix <- round(countmatrix, 0) #rounds values in count matrix head(countmatrix) #take a peak at the data frame ``` We then made a new data frame that contains information about the treatment groups and the type of RNAseq data produced. We then used the `DESeqDataSetFromMatrix` function from the *DESeq2* package to prepare a data set for the DEG analysis. ```{r deseq2 conditions, eval=TRUE} deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), type=factor(rep("paired-end", 4))) deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition) ``` Next we will conduct the differential gene expression analysis using the `DESeq` function: ```{r deseq analysis, eval=TRUE} deseq2.dds <- DESeq(deseq2.dds) deseq2.res <- results(deseq2.dds) #this places the results in an object that we can open deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] #reorders the rows ``` Evaluate the DEseq results. ```{r peak at deseq results, eval=TRUE} head(deseq2.res) #take a peak ``` Check the number of "hits" in the DEseq results using the `nrow` function. ```{r p-value hits, eval=TRUE, cache=TRUE} # Count number of hits with adjusted p-value less then 0.05 nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ] #writes significant hits into an object ``` Finally, we write out the list of differentially expressed genes into a tab file that will later be annotated. ```{r generate degList} write.table(tmp.sig, "../output/DEGlist.tab", row.names = T) ``` # **Exploratory Visualization** ## Exploratory Heatmap ```{r heatmap, eval=TRUE} # Select top 50 differentially expressed genes res <- results(deseq2.dds) res_ordered <- res[order(res$padj), ] top_genes <- row.names(res_ordered)[1:50] # Extract counts and normalize counts <- counts(deseq2.dds, normalized = TRUE) counts_top <- counts[top_genes, ] # Log-transform counts log_counts_top <- log2(counts_top + 1) # Generate heatmap pheatmap(log_counts_top, scale = "row") ``` ## Exploratory Volcano Plot ```{r volcano, eval=TRUE} # Prepare the data for plotting res_df <- as.data.frame(deseq2.res) res_df$gene <- row.names(res_df) # Create volcano plot volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_manual(values = c("grey", "red")) + labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 Adjusted P-value", color = "Significantly\nDifferentially Expressed") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "top") print(volcano_plot) ``` # **Annotate Reference Transcriptome** Lets look at what's in the reference transcriptome file before we BLAST it. ```{r evaluate reference for content, engine='bash', eval = TRUE} cd ../data/psuedo-alignment head sequence.fasta ``` ```{r, engine='bash', eval=TRUE} cd ../data/psuedo-alignment echo "How many sequences are there?" grep -c ">" sequence.fasta ``` ## Make a BLAST Database Use the NCBI blast software command `makeblastdb` to create a blast database from the UniProt fasta file. ```{r generate blast database, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \ -in /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/uniprot_sprot_r2023_01.fasta \ -dbtype prot \ -out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 ``` ## BLAST Nucleotides to Protiens Here we used the `blastx` command to BLAST the reference transcriptome and made a new table that includes the associated gene ID for each sequence in the original reference transcriptome. ```{r blast reference, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/blastx \ -query /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/psuedo-alignment/sequence.fasta \ -db /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 \ -out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/Bsc-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` Below, we clean up the tab file we created to separate out the information regarding gene ID. ```{r cleanup blast output, engine = 'bash'} tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab | head -2 # replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab" # and save the output to the file "Ab_4-uniprot_blastx_sep.tab" tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab \ > ../output/Bsc-uniprot_blastx_sep.tab ``` Peak at the BLASTed reference: ```{r, engine='bash', eval=TRUE, cache=TRUE} head -2 ../output/Bsc-uniprot_blastx_sep.tab ``` ## Download UniProt Database UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks. ```{r uniprot data download, engine = 'bash'} cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz gunzip -k uniprot_sprot_r2023_01.fasta.gz ``` ## Merge BLAST and UniProt Tables First read in the BLASTed reference transcriptome table into the R global environment. ```{r read in blast output} bltabl <- read.csv("../output/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) ``` Also read in a UniProt table into the R global environment. ```{r read in uniprot table} #reading in annotation table from Gannet and saving as object.... url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab" response <- GET(url, config(ssl_verifypeer = FALSE)) spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE) ``` Joining the two tables. ```{r join uniprot table and blast table} left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V4, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("../output/blast_annot_go.tab", delim = '\t') ``` Evaluate annotated reference transcriptome. It now contains the original gene ID, gene names, Gene Ontology (GO) terms, and protein names. ```{r view blast annotated go table, engine ='bash', eval = TRUE, cache = TRUE} cd ../output head -n 2 blast_annot_go.tab ``` # **Merge Annotated Reference with DEG List** Evaluate DEG list. Data needs to be cleaned before we can merge it with the annotated reference. ```{r evaluate DEGlist content, engine='bash', eval= TRUE, cache = TRUE} cd ../output head -n 3 DEGlist.tab ``` Read in DEG list as object into R environment and clean up the data frame. ```{r clean up DEGlist, eval=TRUE, cache=TRUE} diff_genes <- read.delim(file = "../output/DEGlist.tab", sep =" ", header = TRUE) diff_genes$access_num <- row.names(diff_genes) #new column rownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n diff_genes <- diff_genes[,c(7, 1:6)] #puts the gene column in front ``` Next, read in the blast annotated GO terms table into our R environment. ```{r read in blast annotated go table, eval=TRUE, cache=TRUE} blast_go <- read.delim(file = "../output/blast_annot_go.tab") colnames(blast_go)[1] <- 'access_num' #renaming the first column ``` Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with the list of differentially expressed genes. ```{r, eval=TRUE, cache=TRUE} # Perform inner join between diff_genes and blast_go data frames based on shared "access_num" column annote_DEGlist <- merge(diff_genes, blast_go, by = "access_num", all = FALSE) ``` ```{r, eval=TRUE, cache=TRUE} # Print the merged table head(annote_DEGlist, 2) ``` Finally, write out the annotated DEG list to save it to our output directory. ```{r} write.table(annote_DEGlist, "../output/annote_DEGlist.tab", row.names = T) ``` # **Wrangling Annotated DEG Data** Make a vector of the top 100 differentially expressed genes: ```{r top 100 DEGs, eval=TRUE, cache=TRUE} # Select top 100 differentially expressed genes deseq2.res <- results(deseq2.dds) res_ordered <- deseq2.res[order(deseq2.res$padj), ] #create object that we will reference later top_genes <- row.names(res_ordered)[1:100] ``` Next in order to merge the annotated blast table with the counts data we will need to transform the counts matrix into a data frame so that we can use the merge() function. ```{r counts matrix to df, eval=TRUE, cache=TRUE} #extract counts and normalizes them counts <- counts(deseq2.dds, normalized = TRUE) counts <- as.data.frame(counts) counts$access_num <- row.names(counts) #new column counts <- counts[ , c(5, 1:4)] #puts accession col in front ``` Read in blast annotated table: ```{r blast annotated table, eval=TRUE, cache=TRUE} blast_go <- read.delim(file = "../output/blast_annot_go.tab") #renaming the first column so it has the same name for merging colnames(blast_go)[1] <- 'access_num' ``` Merge blast data frame and count data frame by an inner join. Note, there will be duplicate protein entries in this new merged data frame. This is normal and so we are just going to get rid duplicates. ```{r merge annotated blast table and counts df, eval=TRUE, cache=TRUE} #merge the blast table and counts table together #both must be data frames counts_pro <- merge(counts, blast_go, by = "access_num", all = FALSE) #checking if there any duplicates head(which(duplicated(counts_pro))) #eliminate duplicate accession numbers counts_pro_unico <- distinct(counts_pro) #checking if there are any duplicates still, should be 0 head(which(duplicated(counts_pro_unico))) counts_matrix_unico <- as.matrix(counts_pro_unico) #make this a matrix again so it can take the rownames() access_num_unico <- counts_matrix_unico[,1] #making a vector of names rownames(counts_matrix_unico) <- access_num_unico #rename all rows the access_num ``` We needed to name the rows the accession number since we will be using this to pull out the top differentially expressed genes that we named in the beginning of this section. ```{r, eval=TRUE, cache=TRUE} # Identify top genes that are present in the counts_matrix_unico row names present_genes <- intersect(top_genes, row.names(counts_matrix_unico)) # Extract the rows corresponding to the present genes counts_top <- counts_matrix_unico[present_genes, ] head(counts_top) #lets take a peak! nrow(counts_top) #how many top DEGs are we going to graph? ``` Renaming rows as protein names so that the axis has them. ```{r, eval=TRUE, cache=TRUE} counts_top_pro <- counts_top[, c(2:5, 9)] rownames(counts_top_pro) <- counts_top_pro[,"Protein.names"] counts_top_pro <- counts_top_pro[ ,-5] counts_top_pro <- as.matrix(counts_top_pro) ``` Next we have to do a log transformation of the counts but this first means making a new matrix that takes the contents of the previous one (character values) and converts them into numeric values to apply the arithmetic function of log2(). ```{r getting log counts, eval=TRUE, cache=TRUE} #convert the counts into numbers from characters #into new matrix and renaming the col and row names of that new matrix counts_top_pro2 <- matrix(as.numeric(counts_top_pro), ncol = ncol(counts_top_pro)) colnames(counts_top_pro2) <- colnames(counts_top_pro) protein_names <- rownames(counts_top_pro) #scrub up protein names list so that each entry is not as long protein_names2 <- protein_names %>% sub(' \\([^)]+\\).*$', '', .) %>% sub('\\[.*?\\]', ' ', .) rownames(counts_top_pro2) <- protein_names2 ``` ```{r, eval=TRUE, cache=TRUE} # Perform log transformation log_counts_top2 <- log2(counts_top_pro2 + 1) ``` # **Explanatory Visualization** ## Principle Components Analysis Here we see that the two blastogenic stages cluster into two distinct groups in the PCA plot. ```{r PCA, eval = TRUE, cache=TRUE} vsd <- vst(deseq2.dds, blind = FALSE) plotPCA(vsd, intgroup = "condition") ``` ## Heatmaps Now that we have a log counts object we can use it to start generating better looking plots: ```{r base Rcolorbrewer, eval=TRUE, cache=TRUE} #define some color palettes to play with from RColorBrewer pink <- colorRampPalette(brewer.pal(8, "PiYG"))(25) blue <- colorRampPalette(brewer.pal(8, "Blues"))(25) ``` ```{r playing with other heatmap.2 function, eval=TRUE, cache=TRUE} # Customize the heatmap using heatmap.2 heatmap.2(log_counts_top2, scale="column", cexRow=0.9, margins =c(10,20), col=pink, Colv = FALSE, #Keeps x-axis from being reordered ylab = "Protien Names", #y-axis label xlab = "Sample Run ID", #x-axis label key=TRUE, # Add a legend keysize = 0.7, # Set the size of the legend key.title = "Log Counts", # Legend title key.xlab = "Expression", # Legend x-axis label key.ylab = "Frequency", # Legend y-axis label trace="none", # Remove the trace lines cexCol = 1) # Adjust column label font size ``` ### Protein coding transcripts upregulated in takeover relative to b5-b6: - Purine nucleoside phosphorylase: is an enzyme which catalyzes the first step in either purine base salvage or nucleoside catabolism. - Cytochrome c oxidase subunit 3: component of the respiratory chain that catalyzes the reduction of oxygen to water. - Pyridoxal 5'-phosphate synthase subunit SNZERR: coenzyme in all transamination reactions, and in certain decarboxylation, deamination, and racemization reactions of amino acids. Several transcripts upregulated associated with **catabolism**. In line with the phenomenon of reabsorption of the parental zooid. ### Protein coding transcripts upregulated in b5-b6 relative to TO: - Several elongation factors: involved in protein synthesis. - Several actin related genes The b5-b6 stage is a period of growth especially relative to the TO stage, it is thus expected to find an upregulation of protein coding genes related to protein synthesis in and cell structure maintenance. ## Volcano Plot In order to get gene names that are in the annote_DEGlist.tab printed onto the plot, we will have to do some more data wrangling again to modify the deseq2.res df to also include gene ID's. ```{r, eval=TRUE, cache=TRUE} blast_go <- read.delim(file = "../output/blast_annot_go.tab") colnames(blast_go)[1] <- 'access_num' #renaming the first column # Merge together all results from DESeq analysis with blast_go data frame res_df <- as.data.frame(deseq2.res) %>% mutate(access_num = rownames(.)) %>% distinct() #clean up the blast annotated table to keep only unique rows blast_go_small <- distinct(blast_go, access_num, .keep_all = TRUE) transcriptome_anno <- merge(blast_go_small, res_df, by = "access_num", all = TRUE) ``` ```{r, eval=TRUE, cache=TRUE} # Create the volcano plot with labeled genes EnhancedVolcano(transcriptome_anno, lab = transcriptome_anno$V4, x = "log2FoldChange", y = "pvalue", title = "TO vs. b5-b6 Blastogenic Stages") ``` ### Upregulated - COX3: mitochondrial activity - PNPH: enzyme which reversibly catalyzes the phosphorolysis of purine nucleosides - ANX11: cell cycle regulation ### Downregulated - TUBB4B: involved in microtubule cytoskeleton organization and mitotic cell cycle # **Next Steps** 1. Further work out the biological meaning of my findings. 2. Complete functional enrichment analysis. 2. Apply this workflow to transcriptomic data that I plan to generate for preliminary study evaluating changes in gene expression over time in primary cultures of *Botryllus schlosseri*. ![](https://github.com/course-fish546-2023/celeste-tunicate-devo/blob/main/notes/next_schematic.jpeg?raw=true)