---
title: "Geoduck Blast 2023"
output: html_document
date: "2023-11-07"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Blast Tutorial Using Sample Geoduck File

Following the blast tutorial outlined in [TUSK](https://robertslab.github.io/tusk/modules/04-blast.html) to blast and annotate geoduck genes in the file provided in [this github issue](https://github.com/RobertsLab/resources/issues/1710). Important to note that the geoduck file is a *protein* file, so need to run *blastp* instead of *blastx* as listed in the tutorial.

Before starting, make a directory/repo to do all this in that contains the following folders:

-   Code (this is where this/the code file should be)

-   Data

-   Output

There are some packages that are used in this process including Dplyr, Stringr, ggplot2, and DT. Packages are installed/loaded here (disregard chunk if already loaded)

```{r}

# if you need to install, uncomment the following: 
# install.packages("BiocManager")
# BiocManager::install("Biostrings")
# install.packages("ggplot2")
# install.packages('DT')
# install.packages('dplyr')
# install.packages('stringr')

# load installed packages
library(BiocManager)
library(Biostrings)
library(ggplot2)
library(DT)
library(dplyr)
library(stringr)


```

## 1. Database Creation

This part is not unique to the file, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest).

### Obtain Fasta (UniProt/Swiss-Prot)

```{bash}
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_04.fasta.gz
gunzip -k uniprot_sprot_r2023_04.fasta.gz
```

### Making the Database

```{bash}
mkdir ../blastdb
/home/shared/ncbi-blast-2.11.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_04.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_04
```

## 2. Getting the query fasta file

This is where you start changing things for your specific file of interest.

You use curl to retrieve your file from gannet (or whatever server it is stored on).

```{bash}
curl https://gannet.fish.washington.edu/seashell/snaps/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta \
-k \
> ../data/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta

```

Taking a peek at that file via `head()` and getting a count of sequences

```{bash}

head -3 ../data/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta

```

```{bash}

echo "How many sequences are there?"
grep -c ">" ../data/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta

```

## 3. File Visualization

At this point, we use the BiocManager, Biostrings, and ggplot2 packages. There were some issues with version conflicts between the biostring package and the `ReadDNAStringSet()` function. So we skip this step for now.

```{r}

# Read FASTA file
fasta_file <- "../data/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta"  # Replace with the name of your FASTA file
sequences <- ReadDNAStringSet(fasta_file)

# Calculate sequence lengths
sequence_lengths <- width(sequences)

# Create a data frame
sequence_lengths_df <- data.frame(Length = sequence_lengths)

# Plot histogram using ggplot2
ggplot(sequence_lengths_df, aes(x = Length)) +
  geom_histogram(binwidth = 1, color = "grey", fill = "blue", alpha = 0.75) +
  labs(title = "Histogram of Sequence Lengths",
       x = "Sequence Length",
       y = "Frequency") +
  theme_minimal()

```

## 4. Running Blastp

We run blastp since it is a protein file.

```{bash}
/home/shared/ncbi-blast-2.11.0+/bin/blastp \
-query ../data/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.fasta \
-db ../blastdb/uniprot_sprot_r2023_04 \
-out ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
```

Peeking at the output file

```{bash}
head -2 ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.tab
```

```{bash}
echo "Number of lines in output"
wc -l ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.tab
```

## 5. Joining Blast Table with Annotations

### Prepping Blast table for easy join

```{bash}

tr '|' '\t' < ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.tab \
> ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins_sep.tab

#peeking to make sure it looks as expected
head -1 ../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins_sep.tab


```

## 6. Blast Result Visualization

" Could do some cool stuff in R here reading in the table "

Save our output file and reference uniprot table as objects

```{r}

bltabl <- read.csv("../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins_sep.tab", sep = '\t', header = FALSE)

spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)

```

Load packages and create data tables of output file and uniprot reference. Uses the DT package.

```{r}

datatable(head(bltabl), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE))
```

```{r}
datatable(head(spgo), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE))
```

Join the two datatables you created. Uses dplyr and stringr

```{r}

datatable(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) 
  %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)

```

Same code as above but saved as object

```{r}

annot_tab <-
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))

```

Get string counts, and create a plot of top 10 which will provide "top 10 species hits"

```{r}

# Read dataset
dataset <- read.csv("../output/Rphil-assembly_v4_NoCont.AGAT_predicted-proteins.tab", sep = '\t', header = FALSE)  # Replace with the path to your dataset

# Select the column of interest
#dataset$V1 <- "Organism"  # Replace with the name of the column of interest
#column_data <- dataset[[V]]

# Count the occurrences of the strings in the column
string_counts <- table(dataset$V2)

# Convert to a data frame, sort by count, and select the top 10
string_counts_df <- as.data.frame(string_counts)
colnames(string_counts_df) <- c("String", "Count")
string_counts_df <- string_counts_df[order(string_counts_df$Count, decreasing = TRUE), ]
top_10_strings <- head(string_counts_df, n = 10)

# Plot the top 10 most common strings using ggplot2
ggplot(top_10_strings, aes(x = reorder(String, -Count), y = Count, fill = String)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  labs(title = "Top 10 Species hits",
       x = dataset$V2,
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "none") +
  coord_flip()


```

Produces graph of top 20 biological processes

-   note that the legend identifying what each column is, is saved as a png in the output folder in the last chunk of code.

```{r}
data <- annot_tab

# Rename the `Gene.Ontology..biological.process.` column to `Biological_Process`
colnames(data)[colnames(data) == "Gene.Ontology..biological.process."] <- "Biological_Process"

# Separate the `Biological_Process` column into individual biological processes
data_separated <- unlist(strsplit(data$Biological_Process, split = ";"))

# Trim whitespace from the biological processes
data_separated <- gsub("^\\s+|\\s+$", "", data_separated)

# Count the occurrences of each biological process
process_counts <- table(data_separated)
process_counts <- data.frame(Biological_Process = names(process_counts), Count = as.integer(process_counts))
process_counts <- process_counts[order(-process_counts$Count), ]

# Select the 20 most predominant biological processes
top_20_processes <- process_counts[1:20, ]

# Create a color palette for the bars
bar_colors <- rainbow(nrow(top_20_processes))

# Create a staggered vertical bar plot with different colors for each bar
barplot(top_20_processes$Count, names.arg = rep("", nrow(top_20_processes)), col = bar_colors,
        ylim = c(0, max(top_20_processes$Count) * 1.25),
        main = "Occurrences of the 20 Most Predominant Biological Processes", xlab = "Biological Process", ylab = "Count")


# Create a separate plot for the legend
png("../output/GOlegend.png", width = 800, height = 600)
par(mar = c(0, 0, 0, 0))
plot.new()
legend("center", legend = top_20_processes$Biological_Process, fill = bar_colors, cex = 1, title = "Biological Processes")
dev.off()
```

```{r}
knitr::include_graphics("../output/GOlegend.png")
```