Project Summary

I have conducted a small scale replication of the transcriptomic BioProject PRJNA579844 which has 90 RNAseq run files itself.

Briefly, colonial marine tunicates like Botryllus schlosseri are capable of both sexually reproducing through a single fertilized egg (embryogenesis) and through asexual reproduction where new generations are derived from somatic cells of the parental body (blastogenesis). In healthy Botryllus schlosseri, blastogenesis occurs as a weekly synchronized process. Transcriptomic data available from this project isolated RNA across several developmental stages of Botryllus schlosseri from the whole body.

Within this project I will evaluate the whole transcriptome of an the early blastogenic stage b5 to b6 in the secondary bud and the late takeover (TO) blastogenic stage in the adult zooid. Below is a schematic that helps visualize these two stages:

Asexual developmental cycle of Botryllus schlosseri

Metadata for the four runs evalutated

Run	Bases	BioProject	Bytes	dev_stage	Organism	Platform	Tissue
SRR10352205	10576501436	PRJNA579844	6996363570	b5-b6	Botryllus schlosseri	ILLUMINA	whole system
SRR10352206	3609019200	PRJNA579844	1836994542	b5-b6	Botryllus schlosseri	ILLUMINA	whole system
SRR10352224	4847587800	PRJNA579844	2533047747	TO	Botryllus schlosseri	ILLUMINA	whole system
SRR10352254	5482368900	PRJNA579844	2984116333	TO	Botryllus schlosseri	ILLUMINA	whole system

Accessing Data

Go to the NCBI page for BioProject PRJNA579844 and select the numeric value next to “SRA Experiments” which indicates the number of runs in this project.
On the top right corener select “Send to” and from the drop down menu select “Run Selector.
Run selector will show you all the runs available for the BioProject of interest with all the metadata. You can download metadata or accession list (SRR numbers) to keep track of your runs of interest.

Identify raw data storage directory

Below we move to a large hard drive and make a new directory where we will store our data.

# move to a large hard drive
cd /home/shared/8TB_HDD_01

# making a new directory 
mkdir cvaldi/tunicate-fish546/raw

Downloading the RNAseq run files

Since we are downloading data from NCBI, we will use NCBI’s SRA Toolkit to do so. Below we bypass the prefetch command and use the fasterq-dump command directlry. Skipping the prefetch command just makes the downloading time longer. We specify the directory we want the raw data files stored in the ‘–outdir’ subcommand.

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw \
--progress \
SRR10352205 \
SRR10352206 \
SRR10352224 \
SRR10352254

Gereate checksums

We will make checksums for all the raw fastq files using the ’*’ wildcard. This will help us evaluate data integrity downstream.

cd raw
md5sum *.fastq > md5.original.download.20230413

The following code chunk will show us the checksums in the new md5 file we made.

cat md5.original.download.20230413

QAQC Raw Files

Below, we run the fastqc software on all the fastq files in the raw directory. FastQC will produce html reports of the quality of the raw RNAseq data. We can use this information to trim the data, but we can loop back to this after conducting the initial exploratory flow.

cd ../raw
/home/shared/FastQC/fastqc *.fastq

Differential Gene Expression (DGE) Analysis

Botrylus schlosseri currently have several bioinformatic data files available for download on Stanford University’s Botryllus schlosseri Genome Project Server. To do any differential gene expression analysis, you will need either a reference (annotated) genome or reference transcriptome to be able to align your RNAseq data to that reference. Although the Botryllus schlosseri genome server does have annotated genomes in the form of gff files, they are in a non-standard format that need to be normalized in order for it to run through a standard genome alignment workflow using HISat2. As such, we will conduct a psuedo-alignment of our RNAseq files against a reference transcriptome that we generate from all the mRNA files available for this species on NCBI.

Creating a reference transcriptome

Navigate to the Botryllus schlosseri NCBI page.
Select “Nucleotides” of which there are currently about 100 K files currently available.
On the lefthand navigation menu, under “Molecule Type” select “mRNA” (there are about 99 K files). This will filter out the list we have.
Select “Send to” on the top part of the page and select “Complete Record”. This will download a file called “sequence.fasta” that will contain all ~99,000 mRNA files from that list. Download this file, because it is under 100 MB we can store it on GitHub.

Index reference transcriptome

Below, we use the kallisto index command to take the sequence.fasta file we downloaded, index it, name the index “bsc_rna.index” and store it in the data directory of our project.

/home/shared/kallisto/kallisto \
index -i \
../data/bsc_rna.index \
../data/sequence.fasta

Psuedo-alignment of RNAseq raw data

Below is a loop that will iterate through all the fastq files in the raw data directory and align them to the reference index using the quant command in Kallisto. It will quantify abundance of transcripts for each pair of fastq files (since we have paired-end reads for each query). It then takes the base name of the input file and creates and names a new folder fore each query based on the base name.

#!/bin/bash

# Set input directory and index file
input_dir=/home/shared/8TB_HDD_01/cvaldi/tunicate-fish546/raw/
index_file=../data/psuedo-alignment/bsc_rna.index

# Set output directory
output_dir=../output/kallisto_01

# Set number of threads
threads=10

# Loop through all paired-end fastq files in input directory
for f in ${input_dir}/*_1.fastq; do
    # Get base filename
    base=$(basename ${f} _1.fastq)
    
    # Run kallisto to quantify transcripts
    /home/shared/kallisto/kallisto quant -i ${index_file} \
    -o ${output_dir}/${base} \
    -t ${threads} \
    ${input_dir}/${base}_1.fastq ${input_dir}/${base}_2.fastq
done

Creating a gene expression count matrix

This command runs the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression matrix from “abundance.tsv” Kallisto output files.

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../output/kallisto_01 \
    --name_sample_by_basedir \
    ../output/kallisto_01/SRR10352205/abundance.tsv \
    ../output/kallisto_01/SRR10352206/abundance.tsv \
    ../output/kallisto_01/SRR10352224/abundance.tsv \
    ../output/kallisto_01/SRR10352254/abundance.tsv

Visualization of DGE Analysis

Using DESeq2 to wrangle the count matrix in R

We will use the R package DESeq2 to help us digest the gene abundance count matrix we generated.

Below, I am reading in and modifying the count matrix we generated from Kallisto.

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X #this line renames the row names from 1-n to the value of the first column called x
countmatrix <- countmatrix[,-1] #this gets rid of the first column
countmatrix <- round(countmatrix, 0) #rounds values in count matrix
head(countmatrix) #take a peak at the data frame

##            SRR10352205 SRR10352206 SRR10352224 SRR10352254
## JG353837.1           0           0           0           0
## JG390417.1         840          26           0           0
## JG368003.1           0          28           0           0
## JG365003.1           0           0           0           0
## JG314272.1           0           0          17           2
## JG329595.1           0           0           0           0

We will then make a new data frame that will contain information about the treatment groups and the type of RNAseq data produced. This will be called the object “deseq2.colData”. We then use the DESeqDataSetFromMatrix function from the DESeq2 package.

deseq2.colData <- data.frame(condition=factor(c(rep("b5-b6", 2), rep("TO", 2))), 
                             type=factor(rep("paired-end", 4)))
rownames(deseq2.colData) <- colnames(data) 
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Next we will conduct a DESeq analysis using the DESeq function

deseq2.dds <- DESeq(deseq2.dds) 
deseq2.res <- results(deseq2.dds) #this places the results in an object that we can open
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] #reorders the rows
head(deseq2.res) #takes a peak

## log2 fold change (MLE): condition TO vs b5.b6 
## Wald test p-value: condition TO vs b5.b6 
## DataFrame with 6 rows and 6 columns
##             baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##            <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## AF201094.1   0.00000             NA        NA        NA        NA        NA
## AF201095.1   0.00000             NA        NA        NA        NA        NA
## AF201096.1   0.00000             NA        NA        NA        NA        NA
## AF201097.1   3.10213       -3.00407   3.36211 -0.893508  0.371585        NA
## AF201098.1   0.00000             NA        NA        NA        NA        NA
## AF201099.1   0.00000             NA        NA        NA        NA        NA

We can check the number of hits we got from our data using the nrow function which will count the number of rows where we have a p-adjusted value less than or equal to 0.05.

# Count number of hits with adjusted p-value less then 0.05
nrow(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

Explatory visualization of results

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Dessication  (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

PCA:

vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "condition")

Heatmap:

# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

Volcano plot:

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(title = "Volcano Plot",
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Significantly\nDifferentially Expressed") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "top")

print(volcano_plot)

Finally, we should write out the list of differentially expressed genes into a tab file that we will need to annotate.

write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)

This table contains information about our differentilly expressed genes but the gene ID’s are the accession numbers from NCBI. In order to make sense of the DEG list, we will need to annotate the file with actual gene names. To do this, we can use NCBI’s BLAST.

Using BLAST to annotate reference transcriptome

Downloading complete UniProt database.

UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks.

cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data

Making a BLAST database

Using the NCBI blast software command ‘makeblastdb’ to create a blast database from the UniProt fasta file.

/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
-in /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01

Blasting the reference transcriptome

Lets look at what’s in the sequence.fasta file before we BLAST it.

cd ../data/psuedo-alignment
head sequence.fasta
echo "How many sequences are there?"
grep -c ">" sequence.fasta

## >JG389647.1 CCTU20128.b1 CCTU Botryllus schlosseri total asexual and embronic development subtracted Botryllus schlosseri cDNA clone CCTU20128 5', mRNA sequence
## ATTCCTCCTCTATTTGCTTCGATGAACCTGTCATACCATACACCAGCTTGACTTGTCCTGGAGTGTCAGA
## CATTCCGTGTGATGTGTCATGTCCCACTGGACTAGTGTATGATGGATATAGATGCGTCAATATTGAAGAC
## TGTCCTTGCACCTACAAGGACGTTCGACATAAGTCCGGATCTGTGTGGAAGGAAGATGACGGTTGTGTCA
## ATTGCATCTGCATTGGCGGTCATTCTAATTGTGTCGCCCAAACCTGTGACGTAACAAGTTGTGAACCAGG
## ATACGCACTAATTGAAGTTGATGGTGAATGCTGTCCTCGTTGCTTGAAAACAAATGTGACCTGTAGCGAT
## GAAACTCATGAGATCGGAGAAGTCTGGCATGATGATTGTGACGTTTGCACATGCACTGAGTATGGAGTAG
## TCAAATGTGAACATGTTGACTGTCCAGCTACAACTCCGCCTGAATGCCCTGCGAGTCACGAATTAGTGCA
## GAAAGTTGGTGAATGCTGTCCGAAATATGAATGTGTCTGTAATACTGATCTTTGTGACACTTCTGAGCCG
## ACTTGCCCGGCGAACCACAAAGTTGTTGCATCCTTGACTGATGGATGTTGCCCAGAATATCAGTGCGTCT
## How many sequences are there?
## 99708

Here we use the blastx command to BLAST the reference transcriptome and make a new table that includes the estimated protein ID associated with the gene ID in our original reference transcriptome.

/home/shared/ncbi-blast-2.11.0+/bin/blastx \
-query /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/psuedo-alignment/sequence.fasta \
-db /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 \
-out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/Bsc-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6

Lets take a little peak at the tab file we just created:

head -5 ../output/Bsc-uniprot_blastx.tab
wc -l ../output/Bsc-uniprot_blastx.tab

## JG389643.1   sp|Q8IUL8|CILP2_HUMAN   40.714  140 77  5   160 564 59  197 2.95e-25    107
## JG389632.1   sp|Q12959|DLG1_HUMAN    82.143  56  10  0   325 492 746 801 2.30e-22    99.4
## JG389625.1   sp|Q920R6|VPP4_MOUSE    53.846  195 75  4   719 144 651 833 9.90e-58    199
## JG389624.1   sp|Q8AVM5|VPP1_XENLA    53.052  213 89  2   1   606 128 340 5.08e-71    229
## JG389624.1   sp|Q8AVM5|VPP1_XENLA    58.696  46  19  0   602 739 339 384 5.08e-71    60.5
## 36046 ../output/Bsc-uniprot_blastx.tab

Curl file from UniProt that contains protein data in a specified format (accession number, reviewed status, ID, protein name, gene names, organism name, length, GO function, GO, process, GO component, GO ID, interacting partners, EC number, Reactome pathway, UniPathway, and InterPro cross-references. This can be then used to

#this may not work, pulled uniprot tab from other source (gannet server)
cd ../data
curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"

Below, we are going to clean up the tab file we created to seperate out the information regarding the estimated protein sequence.

tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab | head -2

# replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab"
# and save the output to the file "Ab_4-uniprot_blastx_sep.tab"

tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab \
> ../output/Bsc-uniprot_blastx_sep.tab

# peaking at the output file earlier on
head -2 ../output/Bsc-uniprot_blastx_sep.tab
wc -l ../output/Bsc-uniprot_blastx_sep.tab

Merging the two tables

First read in the blast table

# reading into r environ the blast table we created
bltabl <- read.csv("../output/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)

Then we are going to grab the UniProt table that is stored on Gannet.

#reading in annotation table from Gannet and saving as object....
library(httr)
url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab"
response <- GET(url, config(ssl_verifypeer = FALSE))
spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE)

Joining the two tables and saving this in our output directory as our blast annotated GO table called “blast_annot_go.tab”.

left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% 
  mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')

Let’s look at the first 2 lines of our annotated reference transcriptome. It now contains the original gene ID and Gene Ontology (GO) terms that will help us understand our DEGlist from our differential gene expression analysis.

cd ../output
head -n 2 blast_annot_go.tab

## V1   V3  V13 Protein.names   Organism    Gene.Ontology..biological.process.  Gene.Ontology.IDs
## JG389643.1   Q8IUL8  2.95e-25    Cartilage intermediate layer protein 2 (CILP-2) [Cleaved into: Cartilage intermediate layer protein 2 C1; Cartilage intermediate layer protein 2 C2]    Homo sapiens (Human)        GO:0005615; GO:0070062

Merging the annotated table with our DEGlist

If you take a look at the DEGlist.tab, you will see that the row names themselves are the gene ID’s. This contrasts with the blast GO anotated table that has a whole column (V1) dedicated to the gene ID in the same format. Thus, we first need to take the row names of the DEGlist.tab and make them into a new column. It was originally in it’s own column but we needed to convert it so that it could go through the DESeq workflow.

cd ../output
head -n 3 DEGlist.tab

## "baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj"
## "AF201122.1" 34.0837884412689 8.76461374330488 2.30407565629183 3.80396091568043 0.000142400657707493 0.00543819960949714
## "AF201134.1" 48.4069541381205 -6.97289922608518 2.4215424716195 -2.8795279487383 0.00398270995509485 0.0345116898968149

diff_genes <- read.delim(file = "../output/DEGlist.tab", sep =" ", header = TRUE)
diff_genes$access_num <- row.names(diff_genes) #adds the gene name into data frame as new column called gene
rownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n instead of the gene name
diff_genes <- diff_genes[,c(7, 1:6)] #puts the gene column in front

Next we will read in the blast annotate GO terms table that we made above as an object into our R environment

blast_go <- read.delim(file = "../output/blast_annot_go.tab")
colnames(blast_go)[1] <- 'access_num' #renaming the first column so it has the same name as the diff_genes data frame above

Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes.

# Perform inner join between diff_genes and blast_go data frames based on shared "access_num" column
annote_DEGlist <- merge(diff_genes, blast_go, by = "access_num", all = FALSE)

# Print the merged table
head(annote_DEGlist)

##   access_num  baseMean log2FoldChange    lfcSE      stat       pvalue
## 1 AF201134.1  48.40695      -6.972899 2.421542 -2.879528 3.982710e-03
## 2 AF201259.1 189.65058      -8.943305 2.422191 -3.692237 2.222898e-04
## 3 AY159281.1  49.51016      -7.005987 2.429210 -2.884060 3.925842e-03
## 4 EE743591.1 307.58870      -9.640638 2.355167 -4.093398 4.250971e-05
## 5 FF339616.1  67.95408      -7.461979 2.404802 -3.102949 1.916025e-03
## 6 FN178505.1  37.13056       8.578527 2.182042  3.931422 8.444502e-05
##          padj     V3       V13
## 1 0.034511690 P27604 4.60e-100
## 2 0.007245743 P62246  2.22e-82
## 3 0.034305090 P12716 5.79e-118
## 4 0.002401130 Q90474 1.99e-127
## 5 0.024035365 Q7TSK7  1.23e-26
## 6 0.003704028 P13789  5.68e-39
##                                                                                               Protein.names
## 1 Adenosylhomocysteinase (AdoHcyase) (EC 3.13.2.1) (Protein dumpy-14) (S-adenosyl-L-homocysteine hydrolase)
## 2                                                                                40S ribosomal protein S15a
## 3                     Actin, cytoplasmic (EC 3.6.4.-) [Cleaved into: Actin, cytoplasmic, intermediate form]
## 4                                                                         Heat shock protein HSP 90-alpha 1
## 5                              ADAMTS-like protein 2 (ADAMTSL-2) (TSP1-repeat-containing protein 1) (TCP-1)
## 6                                      Troponin T, cardiac muscle (TnTc) (Cardiac muscle troponin T) (cTnT)
##                                                  Organism
## 1                                  Caenorhabditis elegans
## 2                                 Rattus norvegicus (Rat)
## 3 Pisaster ochraceus (Ochre sea star) (Asterias ochracea)
## 4             Danio rerio (Zebrafish) (Brachydanio rerio)
## 5                                    Mus musculus (Mouse)
## 6                                     Bos taurus (Bovine)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Gene.Ontology..biological.process.
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                             one-carbon metabolic process [GO:0006730]; S-adenosylhomocysteine catabolic process [GO:0019510]; S-adenosylmethionine cycle [GO:0033353]
## 2                                                                                                                                                                                                                                                                                                                                                                                                           positive regulation of cell cycle [GO:0045787]; positive regulation of cell population proliferation [GO:0008284]; response to virus [GO:0009615]; translation [GO:0006412]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## 4 cellular response to heat [GO:0034605]; leukocyte migration [GO:0050900]; muscle organ development [GO:0007517]; myofibril assembly [GO:0030239]; positive regulation of nitric oxide biosynthetic process [GO:0045429]; protein folding [GO:0006457]; protein stabilization [GO:0050821]; response to metal ion [GO:0010038]; sarcomerogenesis [GO:0048769]; skeletal muscle myosin thick filament assembly [GO:0030241]; skeletal muscle thin filament assembly [GO:0030240]; skeletal myofibril assembly [GO:0014866]; striated muscle myosin thick filament assembly [GO:0071688]
## 5                                                                                                                                                                                                                                                                                                                                                                                    extracellular matrix organization [GO:0030198]; lobar bronchus epithelium development [GO:0060481]; negative regulation of transforming growth factor beta receptor signaling pathway [GO:0030512]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                          cardiac muscle contraction [GO:0060048]; muscle contraction [GO:0006936]; regulation of muscle contraction [GO:0006937]; sarcomere organization [GO:0045214]
##                                                                                                                                                                                                                                                                                                                                            Gene.Ontology.IDs
## 1                                                                                                                                                                                                                                                                                                 GO:0004013; GO:0005829; GO:0006730; GO:0019510; GO:0033353
## 2                                                                                                                                                                                                                                                 GO:0003735; GO:0005737; GO:0006412; GO:0008284; GO:0009615; GO:0022626; GO:0022627; GO:0045202; GO:0045787
## 3                                                                                                                                                                                                                                                                                                             GO:0005524; GO:0005737; GO:0005856; GO:0016787
## 4 GO:0005524; GO:0005634; GO:0005829; GO:0005886; GO:0006457; GO:0007517; GO:0010038; GO:0014866; GO:0016887; GO:0030018; GO:0030235; GO:0030239; GO:0030240; GO:0030241; GO:0031672; GO:0032991; GO:0034605; GO:0042470; GO:0043025; GO:0043209; GO:0045429; GO:0048471; GO:0048769; GO:0050821; GO:0050900; GO:0051082; GO:0071688; GO:0097718; GO:0140662
## 5                                                                                                                                                                                                                                                                                     GO:0005576; GO:0030198; GO:0030512; GO:0031012; GO:0050436; GO:0060481
## 6                                                                                                                                                                                                                                                             GO:0005523; GO:0005861; GO:0006936; GO:0006937; GO:0030172; GO:0031013; GO:0045214; GO:0060048

Finally, let’s write out the annotated DEG list to save it to our output directory.

write.table(annote_DEGlist, "../output/annote_DEGlist.tab", row.names = T)

Clean Figures

Modifying making a heat map with protein names on axis instead of gene accession number

make a list of top genes differentially expressed

# Get gene names
 # Select top 50 differentially expressed genes
deseq2.res <- results(deseq2.dds)
res_ordered <- deseq2.res[order(deseq2.res$padj), ]
top_genes <- row.names(res_ordered)[1:100] # here we are just going to make a list of the top de genes that we will apply later one

make a counts data frame to feed into merge() fxn

# Extract counts and normalizes them and then we put it all in a data frame so that it can
counts <- counts(deseq2.dds, normalized = TRUE)
countsdf <- as.data.frame(counts)
countsdf$access_num <- row.names(countsdf) #new column
countsdf <- countsdf[ , c(5, 1:4)] #puts accession col in front

read in blast annotated table

# read in blast annotated go table
blast_go <- read.delim(file = "../output/blast_annot_go.tab")
colnames(blast_go)[1] <- 'access_num' #renaming the first column so it has the same name as the diff_genes data frame above

merge blast df and count df

#merge the blast table and counts table together
counts_pro <- merge(countsdf, blast_go, by = "access_num", all = FALSE) #won't work unless they are both data frames

head(which(duplicated(counts_pro))) #checking if there any duplicates

## [1]  2  5 20 23 25 27

#eliminate duplicate accession numbers
counts_pro_unico <- distinct(counts_pro)

head(which(duplicated(counts_pro_unico))) #checking if there are any duplicates still, should be 0

## integer(0)

counts_matrix_unico <- as.matrix(counts_pro_unico) #make this a matrix again so it can take the rownames()

access_num_unico <- counts_matrix_unico[,1] #making a vector of names that I want to assign to the new matrix

rownames(counts_matrix_unico) <- access_num_unico #rename all rows the access num

We needed to name the rows the accession number since we will be using this to pull out the top differentially expressed genes that we named in the vector top_genes. Stuck here. Trying to extract only differentially expressed genes that have defined protein names.

# Identify top genes that are present in the counts_matrix_unico row names
present_genes <- intersect(top_genes, row.names(counts_matrix_unico))

# Extract the rows corresponding to the present genes
counts_top <- counts_matrix_unico[present_genes, ]

head(counts_top) #lets take a peak!

##            access_num   SRR10352205    SRR10352206    SRR10352224   
## HO069547.1 "HO069547.1" "1.142747e+03" "0.000000e+00" "    0.000000"
## JG340639.1 "JG340639.1" "9.921528e+02" "0.000000e+00" "    0.000000"
## JG337431.1 "JG337431.1" "1.419221e+03" "0.000000e+00" "    0.000000"
## JG333944.1 "JG333944.1" "8.909001e+02" "0.000000e+00" "    0.000000"
## JG332820.1 "JG332820.1" "6.986705e+02" "0.000000e+00" "    0.000000"
## JG330205.1 "JG330205.1" "5.686276e+02" "0.000000e+00" "    0.000000"
##            SRR10352254    V3       V13        
## HO069547.1 "    0.000000" "P86221" " 1.09e-96"
## JG340639.1 "    0.000000" "P27541" " 2.59e-56"
## JG337431.1 "    0.000000" "P62755" " 1.79e-83"
## JG333944.1 "    0.000000" "Q90YS3" "1.80e-114"
## JG332820.1 "    0.000000" "P18700" " 0.00e+00"
## JG330205.1 "    0.000000" "Q9U3U0" " 6.91e-90"
##            Protein.names                                  
## HO069547.1 "Tubulin beta-4B chain (Tubulin beta-2C chain)"
## JG340639.1 "Heat shock 70 kDa protein"                    
## JG337431.1 "40S ribosomal protein S6"                     
## JG333944.1 "40S ribosomal protein S2"                     
## JG332820.1 "Tubulin beta chain (Beta-tubulin)"            
## JG330205.1 "60S acidic ribosomal protein P0 (CcP0)"       
##            Organism                                                           
## HO069547.1 "Mesocricetus auratus (Golden hamster)"                            
## JG340639.1 "Brugia malayi (Filarial nematode worm)"                           
## JG337431.1 "Rattus norvegicus (Rat)"                                          
## JG333944.1 "Ictalurus punctatus (Channel catfish) (Silurus punctatus)"        
## JG332820.1 "Strongylocentrotus purpuratus (Purple sea urchin)"                
## JG330205.1 "Ceratitis capitata (Mediterranean fruit fly) (Tephritis capitata)"
##            Gene.Ontology..biological.process.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## HO069547.1 "microtubule-based process [GO:0007017]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## JG340639.1 "response to heat [GO:0009408]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## JG337431.1 "activation-induced cell death of T cells [GO:0006924]; cellular response to ethanol [GO:0071361]; cytoplasmic translation [GO:0002181]; erythrocyte development [GO:0048821]; G1/S transition of mitotic cell cycle [GO:0000082]; gastrulation [GO:0007369]; glucose homeostasis [GO:0042593]; mammalian oogenesis stage [GO:0022605]; mitotic cell cycle [GO:0000278]; negative regulation of apoptotic process [GO:0043066]; negative regulation of bicellular tight junction assembly [GO:1903347]; placenta development [GO:0001890]; positive regulation of apoptotic process [GO:0043065]; positive regulation of cell population proliferation [GO:0008284]; response to insulin [GO:0032868]; ribosomal small subunit assembly [GO:0000028]; ribosomal small subunit biogenesis [GO:0042274]; rRNA processing [GO:0006364]; T cell differentiation in thymus [GO:0033077]; T cell proliferation involved in immune response [GO:0002309]; TOR signaling [GO:0031929]"
## JG333944.1 "translation [GO:0006412]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## JG332820.1 "microtubule cytoskeleton organization [GO:0000226]; mitotic cell cycle [GO:0000278]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## JG330205.1 "ribosome biogenesis [GO:0042254]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
##            Gene.Ontology.IDs                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## HO069547.1 "GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874; GO:0007017; GO:0046872"                                                                                                                                                                                                                                                                                                                                                                                                            
## JG340639.1 "GO:0005524; GO:0009408; GO:0140662"                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## JG337431.1 "GO:0000028; GO:0000082; GO:0000278; GO:0001890; GO:0002181; GO:0002309; GO:0003729; GO:0003735; GO:0005634; GO:0005730; GO:0005737; GO:0005840; GO:0005844; GO:0006364; GO:0006924; GO:0007369; GO:0008284; GO:0015935; GO:0019901; GO:0022605; GO:0022626; GO:0022627; GO:0030425; GO:0031929; GO:0032868; GO:0033077; GO:0036464; GO:0042274; GO:0042593; GO:0043065; GO:0043066; GO:0044297; GO:0048471; GO:0048821; GO:0071361; GO:0098793; GO:0098794; GO:0098982; GO:1903347; GO:1990904"
## JG333944.1 "GO:0003723; GO:0003735; GO:0006412; GO:0015935"                                                                                                                                                                                                                                                                                                                                                                                                                                                
## JG332820.1 "GO:0000226; GO:0000278; GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874"                                                                                                                                                                                                                                                                                                                                                                                                            
## JG330205.1 "GO:0005840; GO:0042254; GO:1990904"

nrow(counts_top) #how many top DEGs are we going to graph?

## [1] 45

Renaming rows as protein names so that the axis has them

counts_top_pro <- counts_top[, c(2:5, 8)]
rownames(counts_top_pro) <- counts_top_pro[,"Protein.names"]
counts_top_pro <- counts_top_pro[ ,-5]
counts_top_pro <- as.matrix(counts_top_pro)

Next we have to do a log transformation of the counts but this first means making a new matrix that takes the contents of the previous one (charcter values) and converts them into numeric values to apply the arithmetic function of log2().

#convert the counts into numbers from characters into new matrix and renaming the col and row names of that new matrix
counts_top_pro2 <- matrix(as.numeric(counts_top_pro),
                          ncol = ncol(counts_top_pro))
colnames(counts_top_pro2) <- colnames(counts_top_pro)
protein_names <- rownames(counts_top_pro)

#scrub up protein names list so that each entry is not as long by removing things that are contained in parenthesis and brackets (i.e. additional but uncessary protein name info).
protein_names2 <- protein_names %>%
  sub(' \\([^)]+\\).*$', '', .) %>%
  sub('\\[.*?\\]', ' ', .)

rownames(counts_top_pro2) <- protein_names2

# Perform log transformation
log_counts_top2 <- log2(counts_top_pro2 + 1)

Now that we have a log counts object we can use it to start generating better looking plots:

#define some color palettes to play with from RColorBrewer
pink <- colorRampPalette(brewer.pal(8, "PiYG"))(25)
blue <- colorRampPalette(brewer.pal(8, "Blues"))(25)

# generating different plots with different colors/labels
heatmap(log_counts_top2, 
        scale="column", 
        col = blue, 
        cexCol = 0.8,
        margins = c(10,20),
        xlab = "Sample",
        ylab = "Protein Name")

library(gplots)

# Customize the heatmap using heatmap.2
heatmap.2(log_counts_top2,
          scale="column",
          cexRow=1,
          margins =c(10,20),
          col=pink,
          ylab = "Protien Names",
          xlab = "Sample Run ID",
          key=TRUE,                  # Add a legend
          keysize = 0.7,               # Set the size of the legend
          key.title = "Log Counts",  # Legend title
          key.xlab = "Expression",   # Legend x-axis label
          key.ylab = "Frequency",    # Legend y-axis label
          trace="none",              # Remove the trace lines
          cexCol = 1,              # Adjust column label font size
          cex.main = 1)              # Adjust main title font size

RNAseq Project Compendium

Celeste Valdivia

May 18, 2023