---
title: "03-Blast md"
author: Hannia Larino 
output: html_document
date: "2025-04-17"
output: 
  html_document:
    theme: readable
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
editor_options: 
  markdown: 
    wrap: 72
---

**Introduction**

Using the instructions on <https://sr320.github.io/course-fish546-2025/assignments/01-blast.html>, this document will show the different components required to complete a BLAST protein analysis. There are X sections, with the final product displaying tables of the results.

Links used: *NCBI software: <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/>* Swiss Prot protein sequences .gz file: <https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz> \*Query sequence FASTA file: <https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa>

The query sequences are being compared to the UniProt Swiss-Prot protein database that is created in section 2.

Libraries used: *tidyverse* kableExtra *knitr* DT *Biostrings* tm

------------------------------------------------------------------------

**DOWNLOADING NCBI SOFTWARE**

1.  Downloading latest NCBI software .gz file from <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/>

```{r, engine='bash'}

cd /home/jovyan/applications #setting directory 
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz #uploading NCBI software to directory 
tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz #extract files from .tar archive 

```

#Check to see if the software downloaded in the "applications" folder.

```{r, engine='bash'}
ls /home/jovyan/applications/ncbi-blast-2.16.0+-x64-linux.tar.gz
```

**MAKING THE DATABASE**

2.  Make a blast database. Download latest Swiss-Prot protein sequences.

```{r, engine='bash'}
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
#curl -O saves output to a file. This file is an unzipped folder. 

mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
#mv used to move or rename files and directories 
#accidentally renamed 2023 instead of 2025 

gunzip -k uniprot_sprot_r2025_01.fasta.gz
#gunzip -k decompresses file but preserves original .gz file. 
ls 
#lists files and directories in current directory, when run ls w/out argument, it displays names of files and folders in current directory 
```

##Uniprot files downloaded in blastdb folder. Use 'head' to see file.

```{r, engine='bash'}
#checking to see first lines of file 
cd /home/jovyan/blastdb
head /home/jovyan/blastdb/uniprot* 
```

###This actually creates the BLAST database in the blastdb folder.

```{r, engine='bash'}
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/makeblastdb \
-in ../../blastdb/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out /home/jovyan/blastdb/uniprot_sprot_r2023_01

```

**GETTING QUERY SEQUENCE** 3. Download the FASTA file "Ab_4denovo_CLC6_a.fa"

```{r, engine='bash'}
cd /home/jovyan/applications/data
curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> /home/jovyan/applications/data/Ab_4denovo_CLC6_a.fa 
```

#View head of Ab_4denovo.

```{r, engine='bash'}
head /home/jovyan/applications/data/Ab_4denovo_CLC6_a.fa

```

##Checking more of the query sequences.

```{r, engine='bash'}
echo "How many sequences are there?"
grep -c ">" /home/jovyan/applications/data/Ab_4denovo_CLC6_a.fa

#grep -c Counts the number of lines that start with > — which correspond to sequence headers in a FASTA file. This gives you the number of sequences in the file.
```

#Histogram of blast file

```{r histogram, eval=TRUE}
# Read FASTA file

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")
library(Biostrings)

fasta_file <- "/home/jovyan/applications/data/Ab_4denovo_CLC6_a.fa"  # Replace with the name of your FASTA file
sequences <- readDNAStringSet(fasta_file)

# Calculate sequence lengths
sequence_lengths <- width(sequences)

# Create a data frame
sequence_lengths_df <- data.frame(Length = sequence_lengths)

# Plot histogram using ggplot2
library(ggplot2)
ggplot(sequence_lengths_df, aes(x = Length)) +
  geom_histogram(binwidth = 1, color = "grey", fill = "blue", alpha = 0.75) +
  labs(title = "Histogram of Sequence Lengths",
       x = "Sequence Length",
       y = "Frequency") +
  theme_minimal()


```

#4.**RUNNING BLAST**

*Translates nucleotide sequences in Ab_4denovo_CLC6_a.fa file in all six reading frames.* Comparing those translated proteins against the Swiss-Prot section of UniProt (uniprot_sprot_r2023_01). *Outputs top hits (1 per query) with very stringent matches (evalue 1E-20).* Produces output in tabular format saved to: output/Ab_4-uniprot_blastx.tab.

```{r, engine='bash'}
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/blastx \
-query /home/jovyan/applications/data/Ab_4denovo_CLC6_a.fa \
-db ../../blastdb/uniprot_sprot_r2023_01 \
-out /home/jovyan/applications/output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 1 \
-max_target_seqs 1 \
-outfmt 6
```

#Previewing the BLAST results.

```{r, engine='bash'}

head -2 /home/jovyan/applications/output/Ab_4-uniprot_blastx.tab
#-2 part says you only want to see first two lines of file 
#.tab = tab separated file w annotation data from UniProt's REST API. 
wc -l /home/jovyan/applications/output/Ab_4-uniprot_blastx.tab
#wc = word count, -1 = counts number of files 

```

```{r, engine='bash'}

head -2 /home/jovyan/applications/output/Ab_4-uniprot_blastx_sep.tab
#-2 part says you only want to see first two lines of file 
#.tab = tab separated file w annotation data from UniProt's REST API. 
wc -l /home/jovyan/applications/output/Ab_4-uniprot_blastx.tab
#wc = word count, -1 = counts number of files 

```

**CREATING TABLES** 5. check note: Joining blast table with annotation table. locate blast output table and locate blast annotation table.

#Replacing all pipe (\|) characters with tabs (). This will make further code easier.

```{r, engine='bash'}
tr '|' '\t' < /home/jovyan/applications/output/Ab_4-uniprot_blastx.tab | head -2

#tr '|' '\t'  Translates (replaces) all pipe (|) characters with tabs (\t) — useful for parsing when identifiers are delimited like sp|P12345|PROT_HUMAN. 
#'< ... .tab'  Feeds in your BLAST output file as input to tr. 
#'| head -2' Only shows the first 2 lines
#replacing pipes with tabs makes it easier to extract the second column with tools like cut, awk, or pandas.
```

##Creating a reformatted table of Ab_4-uniprot_blastx.tab called "Ab_4-uniprot_blastx_sep.tab"

```{bash}
cd /home/jovyan/applications/output 
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab
```

```{bash}
head -2 /home/jovyan/applications/output/Ab_4-uniprot_blastx_sep.tab #shows first 2 lines 
wc -l /home/jovyan/applications/output/Ab_4-uniprot_blastx_sep.tab #spits out lines in file.
```

#Downloading libraries to create tables.

```{r, engine='r'}
library(tidyverse)
install.packages("kableExtra")
library("kableExtra")
```

#reading in the blast .tab results, named bltabl.

```{r, engine='r'}
bltabl <- read.csv("/home/jovyan/applications/output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
bltabl

spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)
spgo
```

#displaying first few rows of bltabl using interactive HTML data table

```{r, eval=TRUE}
install.packages("DT")
library(DT)
datatable(head(bltabl), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE))
```

```{r spgo-table, eval=TRUE}
datatable(head(spgo), options = list(scrollX = TRUE, scrollY = "400px", scrollCollapse = TRUE, paging = FALSE))
```

#combined bltabl and spgo results

```{r see, eval=TRUE}

library(dplyr)
library(stringr)

datatable(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
```

#annotated table

```{r join}
annot_tab <-
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
annot_tab
```