---
title: "01-blast"
format: html
editor: visual
---
I'm going to download BLAST and use it to compare it to unknown sequences.
```{bash}
cd /home/jovyan/applications
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz
```
```{bash}
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/blastx -h #verify installation
```
# Make BLAST database
We're using Swiss-Prot in UniProtKB.
```{bash}
pwd
```
Our working directory is /home/jovyan/renee-myco/assignments, where this Quarto file is located. So this is the wd that this document defaults to. (Realizing this sooner would have saved me several headaches.)
```{bash}
cd ../../blastdb
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz
gunzip -k uniprot_sprot_r2025_01.fasta.gz
ls
```
```{bash}
head ../../blastdb/unip* #check the file
```
Looking good so far:
```
==> ../../blastdb/uniprot_sprot_r2025_01.fasta <==
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
```
```{bash}
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/makeblastdb \
-in ../../blastdb/uniprot_sprot_r2025_01.fasta \
-dbtype prot \
-out ../../blastdb/uniprot_sprot_r2025_01
```
```
Building a new DB, current time: 04/10/2025 10:45:01
New DB name: /home/jovyan/blastdb/uniprot_sprot_r2025_01
New DB title: ../../blastdb/uniprot_sprot_r2025_01.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /home/jovyan/blastdb/uniprot_sprot_r2025_01
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 572970 sequences in 16.0218 seconds.
```
# Get the query sequence
```{bash}
curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../../blastdb/Ab_4denovo_CLC6_a.fa
head ../../blastdb/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../../blastdb/Ab_4denovo_CLC6_a.fa
```
There were some hiccups here, mainly with hidden characters (extra spaces after slashes) with the curl command. However, I was able to confirm 5490 sequences.
# Run BLAST
```{bash}
../../applications/ncbi-blast-2.16.0+/bin/blastx \
-query ../../blastdb/Ab_4denovo_CLC6_a.fa \
-db ../../blastdb/uniprot_sprot_r2025_01 \
-out ../../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
```
Running BLAST took about 45 minutes. Here was the output:
```
Warning: [Query 5476-5490] Examining 5 or more matches is recommended
Warning: [Query 5476-5490] Number of threads was reduced to 16 to match the number of available CPUs
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
765 ../../output/Ab_4-uniprot_blastx.tab
```
So far, I can interpret that there were 765 hits (no. rows in Ab_4-uniprot_blastx.tab).
# Getting more information
Obtaining an annotation table to use with the BLAST output table.
```{r}
spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)
```
## Joining blast table with annotation table
```{bash}
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
```
```
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
81 ../../output/Ab_4-uniprot_blastx.tab
```
```{bash}
tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab | head -2
```
```
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp O42248 GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp Q08013 SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
```
```{bash}
tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab \
> ../../output/Ab_4-uniprot_blastx_sep.tab
ls ../../output
```
```
Ab_4-uniprot_blastx.tab
Ab_4-uniprot_blastx_sep.tab
```
```{bash}
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
```
```
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
81 ../../output/Ab_4-uniprot_blastx.tab
```
# Creating a table
The following is all draft code needs editing!
```{r}
library(tidyverse)
library("kableExtra")
```
```{r}
bltabl <- read.csv("../../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
```
```{r echo=TRUE}
kbl(
head(
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
write_delim("../../output/blast_annot_go.tab", delim = '\t')
```