--- title: "lurida blast" author: "Megan Ewing" date: "2024-04-16" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` There are some packages that are used in this process including Dplyr, Stringr, ggplot2, and DT. Packages are installed/loaded here (disregard chunk if already loaded) ```{r} # if you need to install, uncomment the following: # install.packages("BiocManager") # BiocManager::install("Biostrings") # install.packages("ggplot2") install.packages('DT') # install.packages('dplyr') # install.packages('stringr') # load installed packages library(BiocManager) library(Biostrings) library(ggplot2) library(DT) library(dplyr) library(stringr) ``` ## Database Creation This part is [**not unique to the file**]{.underline}, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest). ### Obtain Fasta (UniProt/Swiss-Prot) ```{bash} cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_02.fasta.gz gunzip -k uniprot_sprot_r2024_02.fasta.gz ``` ### Making the Database ```{bash} mkdir ../blastdb /home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \ -in ../data/uniprot_sprot_r2024_02.fasta \ -dbtype prot \ -out ../blastdb/uniprot_sprot_r2024_02 ``` ## Getting the query fasta file This is where you start changing things for your specific [**file of interest.**]{.underline} You use curl to retrieve your file from gannet (or whatever server it is stored on). ```{bash} wget -O ../data/olurida_transcriptome_v3.fasta http://eagle.fish.washington.edu/cnidarian/Olurida_transcriptome_v3.fasta ``` ``` ``` Taking a peek at that file via `head()` and getting a count of sequences ```{bash} head -3 ../data/olurida_transcriptome_v3.fasta ``` ```{bash} echo "How many sequences are there?" grep -c ">" ../data/olurida_transcriptome_v3.fasta ``` ## Running Blast We run blast since it is a protein file. ```{bash} /home/shared/ncbi-blast-2.15.0+/bin/blastx \ -query ../data/olurida_transcriptome_v3.fasta \ -db ../blastdb/uniprot_sprot_r2024_02 \ -out ../output/olurida_blastout.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` Peeking at the output file ```{bash} head -2 ../blastdb/uniprot_sprot_r2024_04 ``` ```{bash} echo "Number of lines in output" wc -l ../output/olurida_blastout.tab ``` ## Joining Blast Table with Annotations ### Prepping Blast table for easy join ```{bash} tr '|' '\t' < ../output/olurida_blastout.tab \ > ../output/olurida_blastout_sep.tab #peeking to make sure it looks as expected head -1 ../output/olurida_blastout_sep.tab ``` ## Blast Result Visualization " Could do some cool stuff in R here reading in the table " Save our output file and reference uniprot table as objects ```{r} bltabl <- read.csv("../output/olurida_blastout_sep.tab", sep = '\t', header = FALSE) spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE) ``` Load packages and create data tables of output file and uniprot reference. Uses the DT package. ```{r} datatable(head(bltabl), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE)) ``` ```{r} datatable(head(spgo), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE)) ``` Join the two datatables you created. Uses dplyr and stringr ```{r} datatable( left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) ) ``` Same code as above but saved as object ```{r} annot_tab <- left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) ``` Get string counts, and create a plot of top 10 which will provide "top 10 species hits" ```{r} # Read dataset dataset <- read.csv("../output/olurida_blastout.tab", sep = '\t', header = FALSE) # Replace with the path to your dataset # Select the column of interest #dataset$V1 <- "Organism" # Replace with the name of the column of interest #column_data <- dataset[[V]] # Count the occurrences of the strings in the column string_counts <- table(dataset$V2) # Convert to a data frame, sort by count, and select the top 10 string_counts_df <- as.data.frame(string_counts) colnames(string_counts_df) <- c("String", "Count") string_counts_df <- string_counts_df[order(string_counts_df$Count, decreasing = TRUE), ] top_10_strings <- head(string_counts_df, n = 10) # Plot the top 10 most common strings using ggplot2 ggplot(top_10_strings, aes(x = reorder(String, -Count), y = Count, fill = String)) + geom_bar(stat = "identity", position = "dodge", color = "black") + labs(title = "Top 10 Species hits", x = dataset$V2, y = "Count") + theme_minimal() + theme(legend.position = "none") + coord_flip() ``` Produces graph of top 20 biological processes - note that the legend identifying what each column is, is saved as a png in the output folder in the last chunk of code. ```{r} data <- annot_tab # Rename the `Gene.Ontology..biological.process.` column to `Biological_Process` colnames(data)[colnames(data) == "Gene.Ontology..biological.process."] <- "Biological_Process" # Separate the `Biological_Process` column into individual biological processes data_separated <- unlist(strsplit(data$Biological_Process, split = ";")) # Trim whitespace from the biological processes data_separated <- gsub("^\\s+|\\s+$", "", data_separated) # Count the occurrences of each biological process process_counts <- table(data_separated) process_counts <- data.frame(Biological_Process = names(process_counts), Count = as.integer(process_counts)) process_counts <- process_counts[order(-process_counts$Count), ] # Select the 20 most predominant biological processes top_20_processes <- process_counts[1:20, ] # Create a color palette for the bars bar_colors <- rainbow(nrow(top_20_processes)) # Create a staggered vertical bar plot with different colors for each bar barplot(top_20_processes$Count, names.arg = rep("", nrow(top_20_processes)), col = bar_colors, ylim = c(0, max(top_20_processes$Count) * 1.25), main = "Occurrences of the 20 Most Predominant Biological Processes", xlab = "Biological Process", ylab = "Count") # Create a separate plot for the legend png("../output/GOlegend.png", width = 800, height = 600) par(mar = c(0, 0, 0, 0)) plot.new() legend("center", legend = top_20_processes$Biological_Process, fill = bar_colors, cex = 1, title = "Biological Processes") dev.off() ``` ```{r} knitr::include_graphics("../output/GOlegend.png") ``` we shall break this code :D we really will break this code!