| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| 01-seastar.Rmd | 2026-02-03 11:35 | 817 | ||
| augustus.hints.codin..> | 2026-02-03 11:37 | 37M | ||
| blast2slim.sh | 2025-11-04 11:57 | 19K | ||
| blastdb/ | 2026-02-03 11:39 | - | ||
| derm-protein.fa | 2026-03-05 10:13 | 14M | ||
| example_transcripts...> | 2025-11-04 11:57 | 423 | ||
| example_workflow.qmd | 2025-11-04 11:57 | 8.5K | ||
| output/ | 2026-03-05 10:13 | - | ||
| workflow-annotation...> | 2026-06-10 11:16 | 205 | ||
A comprehensive bash script that annotates nucleotide or protein FASTA sequences with Swiss-Prot best hits, UniProt protein names, full Gene Ontology (GO) annotations, and GO-Slim terms.
The blast2slim.sh script is a self-contained pipeline
that:
1. Downloads & indexes Swiss-Prot database
(one-time setup)
2. Runs BLAST searches (BLASTX for nucleotides, BLASTP
for proteins)
3. Retrieves UniProt annotations via REST API
4. Maps GO terms to GO-Slim categories using
GOATOOLS
5. Outputs structured TSV files ready for downstream
analysis
bash blast2slim.sh -i <input.fasta> [OPTIONS]
| Argument | Description | Example |
|---|---|---|
-i, --input FILE |
Input FASTA file (supports .fasta, .fa, .fas, .cds extensions) | -i transcripts.cds |
| Argument | Default | Description | Example |
|---|---|---|---|
-o, --outdir DIR |
output |
Output subdirectory name (always under output/) |
-o my_project |
--dbdir DIR |
blastdb |
Directory for Swiss-Prot database files | --dbdir /shared/db |
--threads N |
40 |
Number of CPU threads for BLAST/DIAMOND | --threads 16 |
--protein |
(none) | Use BLASTP mode for protein input (default: BLASTX) | --protein |
--diamond |
(none) | Use DIAMOND instead of BLAST+ (faster for large datasets) | --diamond |
# FASTA format
bash blast2slim.sh -i transcripts.fasta
# CDS format (coding sequences)
bash blast2slim.sh -i genes.cds -o cds_analysis --threads 16
# Other supported formats
bash blast2slim.sh -i sequences.fa
bash blast2slim.sh -i transcripts.fas
# Protein sequences require --protein flag
bash blast2slim.sh -i proteins.faa --protein
# With custom output directory
bash blast2slim.sh -i proteins.faa --protein -o protein_annotation --threads 12
# DIAMOND BLASTX (much faster for large nucleotide datasets)
bash blast2slim.sh -i large_transcriptome.fasta --diamond --threads 40
# DIAMOND BLASTP (faster protein searches)
bash blast2slim.sh -i proteins.faa --protein --diamond -o fast_protein_search
# Combined with other options
bash blast2slim.sh -i genes.cds --diamond -o diamond_analysis --threads 64
# Download and annotate FASTA directly from URL
bash blast2slim.sh -i "https://example.org/sequences.fasta" --protein -o url_analysis
# Real example with public data
bash blast2slim.sh -i "https://raw.githubusercontent.com/urol-e5/timeseries_molecular/refs/heads/main/D-Apul/data/Machinery.fasta" --protein -o machinery_proteins
# Use shared database directory to avoid re-downloading
bash blast2slim.sh -i sequences.fasta --dbdir /shared/blastdb -o shared_db_run
# Useful for multiple projects using same reference
bash blast2slim.sh -i project1.fasta --dbdir /data/uniprot_db -o project1
bash blast2slim.sh -i project2.fasta --dbdir /data/uniprot_db -o project2
# Maximum threads for large datasets
bash blast2slim.sh -i large_transcriptome.fasta --threads $(nproc) -o large_analysis
# Monitor progress with logging
bash blast2slim.sh -i sequences.fasta -o logged_run --threads 16 2>&1 | tee analysis.log
# Process multiple FASTA files
for fasta in data/*.fasta; do
basename=$(basename "$fasta" .fasta)
bash blast2slim.sh -i "$fasta" -o "batch_${basename}" --threads 4
done
-i, --input FILE (REQUIRED) - Accepts
local FASTA files (.fasta, .fa, .fas, .cds) or HTTP/HTTPS URLs - For
URLs, file is automatically downloaded to output directory - Supports
both nucleotide and protein sequences - CDS files (.cds) are treated as
nucleotide sequences by default - Sequence headers: text before first
space becomes query ID
--protein (FLAG) - Changes from BLASTX
(nucleotide→protein) to BLASTP (protein→protein) - Required when input
contains protein sequences - Omit this flag for nucleotide sequences
(transcripts, genes, etc.)
--diamond (FLAG) - Use DIAMOND instead
of BLAST+ for much faster searches - Recommended for large datasets
(>10,000 sequences) - ~10-20x faster than BLAST+ with similar
sensitivity - Creates .dmnd database files instead of
BLAST+ .pin/.psq files - Compatible with both
--protein and nucleotide modes
-o, --outdir DIR (Default:
output) - Creates timestamped run directory:
output/{DIR}/run_YYYYMMDD_HHMMSS/ - Examples: -
-o myproject →
output/myproject/run_20250826_143022/ - Default →
output/run_20250826_143022/ - All files organized under
this directory structure
--dbdir DIR (Default:
blastdb) - Location for Swiss-Prot database files (~90MB
compressed, ~260MB uncompressed) - Reusable across multiple runs - Files
created: uniprot_sprot.fasta.gz,
uniprot_sprot.fasta, uniprot_sprot.p*
--threads N (Default: 40)
- Number of CPU threads for BLAST search - Higher values = faster
processing (up to available cores) - Recommended: match to available CPU
cores ($(nproc)) - Memory usage scales with thread
count
The script uses optimized BLAST parameters:
| Parameter | Value | Purpose |
|---|---|---|
| E-value threshold | 1e-20 |
Stringent significance cutoff |
| Max target sequences | 1 |
Only best hit per query |
| Output format | Tabular | qseqid sacc pident length evalue bitscore stitle |
| Database | Swiss-Prot | Reviewed UniProt entries only |
output/
└── {project_name}/
└── run_YYYYMMDD_HHMMSS/
├── {basename}.blast.tsv # Raw BLAST results
├── annotation_full_go.tsv # BLAST + UniProt + full GO
├── annotation_with_goslim.tsv # Main output with GO-Slim
├── summary.md # Concise summary report with stats & visualization
├── goslim_chart.png # GO-Slim category bar chart
├── postprocess_uniprot_go.py # Generated processing script
├── go-basic.obo # GO ontology
└── goslim_generic.obo # GO-Slim ontology
summary.md - Concise summary report -
Job metadata (duration, CPUs used, input file) - Key statistics (BLAST
hits, GO matches, GO-Slim mappings) - Top GO-Slim categories table and
visualization - Performance metrics
annotation_with_goslim.tsv - Primary
results file - BLAST hits with UniProt annotations and GO-Slim terms -
Columns: query, accession, protein_name, organism, BLAST metrics, GO
terms, GO-Slim categories
annotation_full_go.tsv - Intermediate
results - Same as above but with full GO terms instead of GO-Slim
{basename}.blast.tsv - Raw BLAST output
- Tabular BLAST results before UniProt annotation - Useful for
troubleshooting or custom post-processing
makeblastdb,
blastx, blastp (NCBI BLAST+ suite) ORdiamond (faster alternative
for large datasets)requests (UniProt REST API)pandas (data processing)goatools (GO term mapping)matplotlib (summary visualization)bash, curl,
gzipVia Conda (Recommended)
conda create -n annotation python=3.11 -y
conda activate annotation
# Install BLAST+ (default option)
conda install -c bioconda blast -y
# OR install DIAMOND (faster option)
conda install -c bioconda diamond -y
# Install Python packages
pip install requests pandas goatools matplotlib
Via Package Managers
# Ubuntu/Debian
sudo apt install ncbi-blast+ python3-pip
pip3 install requests pandas goatools
# macOS with Homebrew
brew install blast python3
pip3 install requests pandas goatools
| Metric | BLAST+ | DIAMOND | Notes |
|---|---|---|---|
| Speed | Baseline | 10-20x faster | DIAMOND optimized for large datasets |
| Sensitivity | High | Very similar | Minimal difference in annotation quality |
| Memory | 1-4GB | 2-6GB | DIAMOND uses slightly more RAM |
| Database size | ~350MB | ~120MB | DIAMOND databases are smaller |
| Best for | <10K sequences | >10K sequences | Recommendation by dataset size |
BLAST+ Mode: - Database setup: 5-15 minutes (first run only) - Search speed: ~0.1-1 second per query sequence - UniProt annotation: ~0.1 seconds per unique hit - GO mapping: ~1-10 seconds total
DIAMOND Mode: - Database setup: 2-5 minutes (first run only) - Search speed: ~0.005-0.05 seconds per query sequence - UniProt annotation: ~0.1 seconds per unique hit - GO mapping: ~1-10 seconds total
“command not found: blastx” or “command not found: diamond”
# Install BLAST+
conda install -c bioconda blast
# OR install DIAMOND (faster)
conda install -c bioconda diamond
# OR via package manager
sudo apt install ncbi-blast+ diamond-aligner
“ModuleNotFoundError: No module named ‘requests’”
pip install requests pandas goatools matplotlib
“Permission denied”
chmod +x blast2slim.sh
“No space left on device” - Check disk space:
df -h - Use custom --dbdir on larger partition
- Clean up old output directories
Long UniProt API delays - Network-dependent; just wait - Large numbers of unique hits take longer - Check firewall/proxy settings if timing out
# Run with logging
bash blast2slim.sh -i large_file.fasta -o big_run --threads 16 2>&1 | tee run.log
# Monitor progress in another terminal
tail -f run.log
# Check BLAST progress
ps aux | grep blast
# Check output file completeness
wc -l output/*/run_*/annotation_with_goslim.tsv
head -1 output/*/run_*/annotation_with_goslim.tsv
# Verify database files
ls -lh blastdb/
library(readr)
results <- read_tsv("output/myproject/run_*/annotation_with_goslim.tsv")
import pandas as pd
results = pd.read_csv("output/myproject/run_*/annotation_with_goslim.tsv", sep='\t')
# Part of larger workflow
bash preprocess_sequences.sh input.fasta > cleaned.fasta
bash blast2slim.sh -i cleaned.fasta -o analysis_$(date +%Y%m%d)
Rscript analyze_results.R output/analysis_*/run_*/annotation_with_goslim.tsv
When using this workflow in publications:
“Sequences were annotated using Swiss-Prot BLAST best hits (E-value ≤1e-20) with UniProt GO term retrieval and GO-Slim mapping via GOATOOLS.”
This script is provided as-is for research and educational use.