This report documents the installation, verification, and use of a local NCBI BLAST database to query an unknown sequence. For this we will take the following steps:

  1. Make the BLAST database on a high performance university server (Raven),

  2. Obtain a query sequence,

  3. Run BLAST,

  4. Obtain annotations,

  5. Create a table with the results.

1 Make BLAST database

1.1 Download NCBI BLAST files

To begin, we will download BLAST files to an applications folder, a sibling folder to our GitHub repository to avoid the file limit constraints.

cd /home/jovyan/applications #is this a good relative file path? Reproducible?
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz

The following code previews the file (header) to help us verify that the above curl commands downloaded the software correctly.

/home/jovyan/applications/ncbi-blast-2.16.0+/bin/blastx -h #verify installation
## USAGE
##   blastx [-h] [-help] [-import_search_strategy filename]
##     [-export_search_strategy filename] [-task task_name] [-db database_name]
##     [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
##     [-negative_gilist filename] [-negative_seqidlist filename]
##     [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
##     [-negative_taxidlist filename] [-no_taxid_expansion] [-ipglist filename]
##     [-negative_ipglist filename] [-entrez_query entrez_query]
##     [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
##     [-subject subject_input_file] [-subject_loc range] [-query input_file]
##     [-out output_file] [-evalue evalue] [-word_size int_value]
##     [-gapopen open_penalty] [-gapextend extend_penalty]
##     [-qcov_hsp_perc float_value] [-max_hsps int_value]
##     [-xdrop_ungap float_value] [-xdrop_gap float_value]
##     [-xdrop_gap_final float_value] [-searchsp int_value]
##     [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
##     [-soft_masking soft_masking] [-matrix matrix_name]
##     [-threshold float_value] [-culling_limit int_value]
##     [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
##     [-subject_besthit] [-window_size int_value] [-ungapped] [-lcase_masking]
##     [-query_loc range] [-strand strand] [-parse_deflines]
##     [-query_gencode int_value] [-outfmt format] [-show_gis]
##     [-num_descriptions int_value] [-num_alignments int_value]
##     [-line_length line_length] [-html] [-sorthits sort_hits]
##     [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
##     [-num_threads int_value] [-mt_mode int_value] [-remote]
##     [-comp_based_stats compo] [-use_sw_tback] [-version]
## 
## DESCRIPTION
##    Translated Query-Protein Subject BLAST 2.16.0+
## 
## Use '-help' to print detailed descriptions of command line arguments

1.2 Download UniProtK/Swiss-Pro files

UniProt is a universal protein knowledgebase, and UniProtKB/Swiss-Prot is its manually annotated and section. These will provide a central resource for protein sequences and similar functional information.

cd ../../blastdb #directs the file downloads into a blastdb folder

curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz

gunzip -k uniprot_sprot_r2025_01.fasta.gz #unzips the files

ls #shows a list of all files downloaded into this folder
cd ../../blastdb
head unip* #previews the file and ensures that everything downloaded correctly

An adequate head will look like this:

==> ../../blastdb/uniprot_sprot_r2025_01.fasta <==
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC

Then we create a local BLAST database from a FASTA file of protein sequences.

/home/jovyan/applications/ncbi-blast-2.16.0+/bin/makeblastdb \  # calling the makeblastdb program
-in ../../blastdb/uniprot_sprot_r2025_01.fasta \  # specifies the input file — in this case, a FASTA file of protein sequences from UniProt
-dbtype prot \  # tells BLAST to treat the input as protein sequences
-out ../../blastdb/uniprot_sprot_r2025_01 #defines the base name for the output database files
Building a new DB, current time: 04/10/2025 10:45:01
New DB name:   /home/jovyan/blastdb/uniprot_sprot_r2025_01
New DB title:  ../../blastdb/uniprot_sprot_r2025_01.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /home/jovyan/blastdb/uniprot_sprot_r2025_01
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 572970 sequences in 16.0218 seconds.

2 Obtain the query sequence

curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \  #grabs an unknown sequence
-k \
> ../../blastdb/Ab_4denovo_CLC6_a.fa 

head ../../blastdb/Ab_4denovo_CLC6_a.fa  #quick preview to confirm the download and inspect sequence structure
echo "How many sequences are there?"
grep -c ">" ../../blastdb/Ab_4denovo_CLC6_a.fa. #counts the number of sequences in the file

From running this code we are able to confirm 5490 sequences.

3 Run BLAST

Now we will take a query, align it to a local BLAST protein database, and find matches. The matches will be filtered and we will get a report of top hits.

../../applications/ncbi-blast-2.16.0+/bin/blastx \  #translates a nucleotide query and aligns it to a protein database
-query ../../blastdb/Ab_4denovo_CLC6_a.fa \  #points to the FASTA file of the unknown sequence
-db ../../blastdb/uniprot_sprot_r2025_01 \  #points to the local BLAST protein database from UniProt
-out ../../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \  #filters out weak matches, only highly significant alignments will be reported
-num_threads 20 \ 
-max_target_seqs 1 \  #reports only the top hit per query sequence
-outfmt 6

head -2 ../../output/Ab_4-uniprot_blastx.tab  #previews the first 2 hits
wc -l ../../output/Ab_4-uniprot_blastx.tab  #counts the number of hits in the output file
Warning: [Query 5476-5490] Examining 5 or more matches is recommended
Warning: [Query 5476-5490] Number of threads was reduced to 16 to match the number of available CPUs
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3    sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.83e-103   301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5    sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.41e-28    104
765 ../../output/Ab_4-uniprot_blastx.tab

We can interpret that there were 765 hits (no. rows in Ab_4-uniprot_blastx.tab).

4 Annotations and more information

We will now obtain an annotation table to use with the BLAST output table. The following code downloads and reads a tab-delimited annotation table from UniProt.

spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)

4.1 Joining BLAST with annotation table

Now we will check that the BLAST output file looks ok, and since each row represents a BLAST hit, this tells you how many significant alignments were reported.

head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.83e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.41e-28    104
## 81 ../../output/Ab_4-uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3    sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.83e-103   301  
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5    sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.41e-28    104
81 ../../output/Ab_4-uniprot_blastx.tab

From the last line of the output, we can determine that there were 81 hits in BLAST.

Now we will transform this file into a format that is easier to join with the UniProt annotation table.

tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab | head -2
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp  O42248  GBLP_DANRE  82.456  171 30  0   1   513 35  205 2.83e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp  Q08013  SSRG_RAT    75.385  65  16  0   3   197 121 185 1.41e-28    104
tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab \
> ../../output/Ab_4-uniprot_blastx_sep.tab
ls ../../output
## Ab_4-uniprot_blastx.tab
## Ab_4-uniprot_blastx_sep.tab
## blast_annot_go.tab

This code created a new file “Ab_4-uniprot_blastx_sep.tab.”

Ab_4-uniprot_blastx.tab
Ab_4-uniprot_blastx_sep.tab
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.83e-103   301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.41e-28    104
## 81 ../../output/Ab_4-uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3    sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.83e-103   301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5    sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.41e-28    104
81 ../../output/Ab_4-uniprot_blastx.tab

5 Creating a table

First we will load the necessary packages to create a table from the data generated above. Then we will

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("kableExtra")
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Then we will read in the data.

bltabl <- read.csv("../../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
kbl(
head(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
V1 V3 V13 Protein.names Organism Gene.Ontology..biological.process. Gene.Ontology.IDs
Ab_contig_3 O42248 0 Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK) Danio rerio (Zebrafish) (Brachydanio rerio) angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344] GO:0001525; GO:0001934; GO:0005080; GO:0005634; GO:0005737; GO:0005829; GO:0005840; GO:0030178; GO:0032880; GO:0043022; GO:0045182; GO:0051302; GO:0060027; GO:0072344; GO:1990904; GO:2000114; GO:2000543
Ab_contig_5 Q08013 0 Translocon-associated protein subunit gamma (TRAP-gamma) (Signal sequence receptor subunit gamma) (SSR-gamma) Rattus norvegicus (Rat) SRP-dependent cotranslational protein targeting to membrane [GO:0006614] GO:0005783; GO:0005784; GO:0006614
Ab_contig_6 P12234 0 Phosphate carrier protein, mitochondrial (Phosphate transport protein) (PTP) (Solute carrier family 25 member 3) Bos taurus (Bovine) mitochondrial phosphate ion transmembrane transport [GO:1990547]; phosphate ion transmembrane transport [GO:0035435] GO:0005315; GO:0005739; GO:0005743; GO:0015293; GO:0035435; GO:0044877; GO:1990547
Ab_contig_9 Q41629 0 ADP,ATP carrier protein 1, mitochondrial (ADP/ATP translocase 1) (Adenine nucleotide translocator 1) (ANT 1) Triticum aestivum (Wheat) mitochondrial ADP transmembrane transport [GO:0140021]; mitochondrial ATP transmembrane transport [GO:1990544] GO:0005471; GO:0005743; GO:0140021; GO:1990544
Ab_contig_13 Q32NG4 0 Glutamine amidotransferase-like class 1 domain-containing protein 1 (Parkinson disease 7 domain-containing protein 1) Xenopus laevis (African clawed frog) GO:0005576
Ab_contig_23 Q9GNE2 0 60S ribosomal protein L23 (AeRpL17A) (L17A) Aedes aegypti (Yellowfever mosquito) (Culex aegypti) translation [GO:0006412] GO:0003735; GO:0005840; GO:0006412; GO:1990904
left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../../output/blast_annot_go.tab", delim = '\t')

This table displays key information that will aid in further downstream interpretation of the BLAST run: organism,

6 Species hits

left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
  count(Organism, sort = TRUE) %>%
  top_n(10) %>%
  ggplot(aes(x = reorder(Organism, n), y = n, fill = Organism)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Top 10 Organisms in BLAST Hits", x = "Organism", y = "Number of Hits") +
  theme_minimal()
## Selecting by n

This chart pulls information from the organism column in the previous table to display the top 10 organisms in the BLAST hits.