--- title: "Using NCBI BLAST" subtitle: "Taking a set of unknown sequence files and annotating them" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: theme: cosmo highlight: textmate toc: true toc_float: true toc_depth: 3 --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Setting this to false will allow us to knit the codument without evaluating code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` ----------------- ## Downloading complete UniProt database. UniProt is a freely accessible database of protein sequence and functional information that is continuously updated every 8 weeks. ```{r uniprot data download, engine = 'bash'} cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz gunzip -k uniprot_sprot_r2023_01.fasta.gz ls ../data ``` ## Making a BLAST database Using the NCBI blast software command 'makeblastdb' to create a blast database from the UniProt fasta file. ```{r generate blast database, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \ -in /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/uniprot_sprot_r2023_01.fasta \ -dbtype prot \ -out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 ``` ## Blasting the reference transcriptome Lets look at what's in the sequence.fasta file before we BLAST it. ```{r evaluate reference for content, engine='bash'} cd data/psuedo-alignment head sequence.fasta echo "How many sequences are there?" grep -c ">" sequence.fasta ``` Here we use the blastx command to BLAST the reference transcriptome and make a new table that includes the estimated protein ID associated with the gene ID in our original reference transcriptome. ```{r blast reference, engine='bash'} /home/shared/ncbi-blast-2.11.0+/bin/blastx \ -query /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/data/psuedo-alignment/sequence.fasta \ -db /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/blastdb/uniprot_sprot_r2023_01 \ -out /home/shared/8TB_HDD_02/cvaldi/celeste-tunicate-devo/output/Bsc-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` Lets take a little peak at the tab file we just created: ```{r evaluate blast output contents, engine='bash'} head -5 ../output/Bsc-uniprot_blastx.tab wc -l ../output/Bsc-uniprot_blastx.tab ``` Curl file from UniProt that contains protein data in a specified format (accession number, reviewed status, ID, protein name, gene names, organism name, length, GO function, GO, process, GO component, GO ID, interacting partners, EC number, Reactome pathway, UniPathway, and InterPro cross-references. This can be then used to ```{r read in uniprot tab, engine='bash'} #this may not work, pulled uniprot tab from other source (gannet server) cd ../data curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29" ``` Below, we are going to clean up the tab file we created to seperate out the information regarding the estimated protein sequence. ```{r cleanup blast output, engine = 'bash'} tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab | head -2 # replace the "|" with "\t" in the file "Ab_4-uniprot_blastx.tab" # and save the output to the file "Ab_4-uniprot_blastx_sep.tab" tr '|' '\t' < ../output/Bsc-uniprot_blastx.tab \ > ../output/Bsc-uniprot_blastx_sep.tab # peaking at the output file earlier on head -2 ../output/Bsc-uniprot_blastx_sep.tab wc -l ../output/Bsc-uniprot_blastx_sep.tab ``` ## Merging the two tables First read in the blast table ```{r read in blast output} # reading into r environ the blast table we created bltabl <- read.csv("../output/Bsc-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) ``` Then we are going to grab the UniProt table that is stored on Gannet. ```{r read in uniprot table} #reading in annotation table from Gannet and saving as object.... library(httr) url <- "https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab" response <- GET(url, config(ssl_verifypeer = FALSE)) spgo <- read.csv(text = rawToChar(response$content), sep = "\t", header = TRUE) ``` Joining the two tables and saving this in our output directory as our blast annotated GO table called "blast_annot_go.tab". ```{r join uniprot table and blast table} left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("../output/blast_annot_go.tab", delim = '\t') ``` Let's look at the first three lines of our annotated reference transcriptome. It now contains the original gene ID and Gene Ontology (GO) terms that will help us understand our DEGlist from our differential gene expression analysis. ```{r view blast annotated go table, engine ='bash', eval = TRUE} cd ../output head -n 3 blast_annot_go.tab ``` # Merging the annotated table with our DEGlist If you take a look at the DEGlist.tab, you will see that the row names themselves are the gene ID's. This contrasts with the blast GO anotated table that has a whole column (V1) dedicated to the gene ID in the same format. Thus, we first need to take the row names of the DEGlist.tab and make them into a new column. It was originally in it's own column but we needed to convert it so that it could go through the DESeq workflow. ```{r evaluate DEGlist content, engine='bash'} cd ../output head -n 3 DEGlist.tab ``` ```{r clean up DEGlist, eval=TRUE} diff_genes <- read.delim(file = "../output/DEGlist.tab", sep =" ", header = TRUE) diff_genes$access_num <- row.names(diff_genes) #adds the gene name into data frame as new column called gene rownames(diff_genes) <- diff_genes$X #renames all the rows to 1-n instead of the gene name diff_genes <- diff_genes[,c(7, 1:6)] #puts the gene column in front head(diff_genes) ``` Next we will read in the blast annotate GO terms table that we made above as an object into our R environment ```{r read in blast annotated go table, eval=TRUE} blast_go <- read.delim(file = "../output/blast_annot_go.tab") colnames(blast_go)[1] <- 'access_num' #renaming the first column so it has the same name as the diff_genes data frame above head(blast_go) ``` Now that we have both the blast go and differential gene list in the right format, we can merge the two so that we have information regarding gene ID and function paired with differentially expressed genes. ```{r, eval=TRUE, cache=TRUE} # Perform inner join between diff_genes and blast_go data frames based on shared "access_num" column annote_DEGlist <- merge(diff_genes, blast_go, by = "access_num", all = FALSE) # Print the merged table head(annote_DEGlist) ``` Finally, let's write out the annotated DEG list to save it to our output directory. ```{r} write.table(annote_DEGlist, "../output/annote_DEGlist.tab", row.names = T) ```