This report documents the installation, verification, and use of a local NCBI BLAST database to query an unknown sequence. For this we will take the following steps:
Make the BLAST database on a high performance university server (Raven),
Obtain a query sequence,
Run BLAST,
Obtain annotations,
Create a table with the results.
To begin, we will download BLAST files to an applications folder, a sibling folder to our GitHub repository to avoid the file limit constraints.
cd /home/jovyan/applications #is this a good relative file path? Reproducible?
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz
The following code previews the file (header) to help us verify that the above curl commands downloaded the software correctly.
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/blastx -h #verify installation
## USAGE
## blastx [-h] [-help] [-import_search_strategy filename]
## [-export_search_strategy filename] [-task task_name] [-db database_name]
## [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
## [-negative_gilist filename] [-negative_seqidlist filename]
## [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
## [-negative_taxidlist filename] [-no_taxid_expansion] [-ipglist filename]
## [-negative_ipglist filename] [-entrez_query entrez_query]
## [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
## [-subject subject_input_file] [-subject_loc range] [-query input_file]
## [-out output_file] [-evalue evalue] [-word_size int_value]
## [-gapopen open_penalty] [-gapextend extend_penalty]
## [-qcov_hsp_perc float_value] [-max_hsps int_value]
## [-xdrop_ungap float_value] [-xdrop_gap float_value]
## [-xdrop_gap_final float_value] [-searchsp int_value]
## [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
## [-soft_masking soft_masking] [-matrix matrix_name]
## [-threshold float_value] [-culling_limit int_value]
## [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
## [-subject_besthit] [-window_size int_value] [-ungapped] [-lcase_masking]
## [-query_loc range] [-strand strand] [-parse_deflines]
## [-query_gencode int_value] [-outfmt format] [-show_gis]
## [-num_descriptions int_value] [-num_alignments int_value]
## [-line_length line_length] [-html] [-sorthits sort_hits]
## [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
## [-num_threads int_value] [-mt_mode int_value] [-remote]
## [-comp_based_stats compo] [-use_sw_tback] [-version]
##
## DESCRIPTION
## Translated Query-Protein Subject BLAST 2.16.0+
##
## Use '-help' to print detailed descriptions of command line arguments
UniProt is a universal protein knowledgebase, and UniProtKB/Swiss-Prot is its manually annotated and section. These will provide a central resource for protein sequences and similar functional information.
cd ../../blastdb #directs the file downloads into a blastdb folder
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz
gunzip -k uniprot_sprot_r2025_01.fasta.gz #unzips the files
ls #shows a list of all files downloaded into this folder
cd ../../blastdb
head unip* #previews the file and ensures that everything downloaded correctly
An adequate head will look like this:
==> ../../blastdb/uniprot_sprot_r2025_01.fasta <==
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
Then we create a local BLAST database from a FASTA file of protein sequences.
/home/jovyan/applications/ncbi-blast-2.16.0+/bin/makeblastdb \ # calling the makeblastdb program
-in ../../blastdb/uniprot_sprot_r2025_01.fasta \ # specifies the input file — in this case, a FASTA file of protein sequences from UniProt
-dbtype prot \ # tells BLAST to treat the input as protein sequences
-out ../../blastdb/uniprot_sprot_r2025_01 #defines the base name for the output database files
Building a new DB, current time: 04/10/2025 10:45:01
New DB name: /home/jovyan/blastdb/uniprot_sprot_r2025_01
New DB title: ../../blastdb/uniprot_sprot_r2025_01.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /home/jovyan/blastdb/uniprot_sprot_r2025_01
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 572970 sequences in 16.0218 seconds.
curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ #grabs an unknown sequence
-k \
> ../../blastdb/Ab_4denovo_CLC6_a.fa
head ../../blastdb/Ab_4denovo_CLC6_a.fa #quick preview to confirm the download and inspect sequence structure
echo "How many sequences are there?"
grep -c ">" ../../blastdb/Ab_4denovo_CLC6_a.fa. #counts the number of sequences in the file
From running this code we are able to confirm 5490 sequences.
Now we will take a query, align it to a local BLAST protein database, and find matches. The matches will be filtered and we will get a report of top hits.
../../applications/ncbi-blast-2.16.0+/bin/blastx \ #translates a nucleotide query and aligns it to a protein database
-query ../../blastdb/Ab_4denovo_CLC6_a.fa \ #points to the FASTA file of the unknown sequence
-db ../../blastdb/uniprot_sprot_r2025_01 \ #points to the local BLAST protein database from UniProt
-out ../../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \ #filters out weak matches, only highly significant alignments will be reported
-num_threads 20 \
-max_target_seqs 1 \ #reports only the top hit per query sequence
-outfmt 6
head -2 ../../output/Ab_4-uniprot_blastx.tab #previews the first 2 hits
wc -l ../../output/Ab_4-uniprot_blastx.tab #counts the number of hits in the output file
Warning: [Query 5476-5490] Examining 5 or more matches is recommended
Warning: [Query 5476-5490] Number of threads was reduced to 16 to match the number of available CPUs
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
765 ../../output/Ab_4-uniprot_blastx.tab
We can interpret that there were 765 hits (no. rows in Ab_4-uniprot_blastx.tab).
We will now obtain an annotation table to use with the BLAST output table. The following code downloads and reads a tab-delimited annotation table from UniProt.
spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)
Now we will check that the BLAST output file looks ok, and since each row represents a BLAST hit, this tells you how many significant alignments were reported.
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
## 81 ../../output/Ab_4-uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
81 ../../output/Ab_4-uniprot_blastx.tab
From the last line of the output, we can determine that there were 81 hits in BLAST.
Now we will transform this file into a format that is easier to join with the UniProt annotation table.
tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab | head -2
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp O42248 GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp Q08013 SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab \
> ../../output/Ab_4-uniprot_blastx_sep.tab
ls ../../output
## Ab_4-uniprot_blastx.tab
## Ab_4-uniprot_blastx_sep.tab
## blast_annot_go.tab
This code created a new file “Ab_4-uniprot_blastx_sep.tab.”
Ab_4-uniprot_blastx.tab
Ab_4-uniprot_blastx_sep.tab
head -2 ../../output/Ab_4-uniprot_blastx.tab
wc -l ../../output/Ab_4-uniprot_blastx.tab
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
## solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
## 81 ../../output/Ab_4-uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104
81 ../../output/Ab_4-uniprot_blastx.tab
First we will load the necessary packages to create a table from the data generated above. Then we will
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("kableExtra")
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
Then we will read in the data.
bltabl <- read.csv("../../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
kbl(
head(
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
V1 | V3 | V13 | Protein.names | Organism | Gene.Ontology..biological.process. | Gene.Ontology.IDs |
---|---|---|---|---|---|---|
Ab_contig_3 | O42248 | 0 | Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK) | Danio rerio (Zebrafish) (Brachydanio rerio) | angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344] | GO:0001525; GO:0001934; GO:0005080; GO:0005634; GO:0005737; GO:0005829; GO:0005840; GO:0030178; GO:0032880; GO:0043022; GO:0045182; GO:0051302; GO:0060027; GO:0072344; GO:1990904; GO:2000114; GO:2000543 |
Ab_contig_5 | Q08013 | 0 | Translocon-associated protein subunit gamma (TRAP-gamma) (Signal sequence receptor subunit gamma) (SSR-gamma) | Rattus norvegicus (Rat) | SRP-dependent cotranslational protein targeting to membrane [GO:0006614] | GO:0005783; GO:0005784; GO:0006614 |
Ab_contig_6 | P12234 | 0 | Phosphate carrier protein, mitochondrial (Phosphate transport protein) (PTP) (Solute carrier family 25 member 3) | Bos taurus (Bovine) | mitochondrial phosphate ion transmembrane transport [GO:1990547]; phosphate ion transmembrane transport [GO:0035435] | GO:0005315; GO:0005739; GO:0005743; GO:0015293; GO:0035435; GO:0044877; GO:1990547 |
Ab_contig_9 | Q41629 | 0 | ADP,ATP carrier protein 1, mitochondrial (ADP/ATP translocase 1) (Adenine nucleotide translocator 1) (ANT 1) | Triticum aestivum (Wheat) | mitochondrial ADP transmembrane transport [GO:0140021]; mitochondrial ATP transmembrane transport [GO:1990544] | GO:0005471; GO:0005743; GO:0140021; GO:1990544 |
Ab_contig_13 | Q32NG4 | 0 | Glutamine amidotransferase-like class 1 domain-containing protein 1 (Parkinson disease 7 domain-containing protein 1) | Xenopus laevis (African clawed frog) | GO:0005576 | |
Ab_contig_23 | Q9GNE2 | 0 | 60S ribosomal protein L23 (AeRpL17A) (L17A) | Aedes aegypti (Yellowfever mosquito) (Culex aegypti) | translation [GO:0006412] | GO:0003735; GO:0005840; GO:0006412; GO:1990904 |
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1,
pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
write_delim("../../output/blast_annot_go.tab", delim = '\t')
This table displays key information that will aid in further downstream interpretation of the BLAST run: organism,
left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
count(Organism, sort = TRUE) %>%
top_n(10) %>%
ggplot(aes(x = reorder(Organism, n), y = n, fill = Organism)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Top 10 Organisms in BLAST Hits", x = "Organism", y = "Number of Hits") +
theme_minimal()
## Selecting by n
This chart pulls information from the organism column in the previous table to display the top 10 organisms in the BLAST hits.