--- title: "01-blast" format: html editor: visual --- I'm going to download BLAST and use it to compare it to unknown sequences. ```{bash} cd /home/jovyan/applications curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz ``` ```{bash} /home/jovyan/applications/ncbi-blast-2.16.0+/bin/blastx -h #verify installation ``` # Make BLAST database We're using Swiss-Prot in UniProtKB. ```{bash} pwd ``` Our working directory is /home/jovyan/renee-myco/assignments, where this Quarto file is located. So this is the wd that this document defaults to. (Realizing this sooner would have saved me several headaches.) ```{bash} cd ../../blastdb curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz gunzip -k uniprot_sprot_r2025_01.fasta.gz ls ``` ```{bash} head ../../blastdb/unip* #check the file ``` Looking good so far: ``` ==> ../../blastdb/uniprot_sprot_r2025_01.fasta <== >sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD SFRKIYTDLGWKFTPL >sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-002L PE=4 SV=1 MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC ``` ```{bash} /home/jovyan/applications/ncbi-blast-2.16.0+/bin/makeblastdb \ -in ../../blastdb/uniprot_sprot_r2025_01.fasta \ -dbtype prot \ -out ../../blastdb/uniprot_sprot_r2025_01 ``` ``` Building a new DB, current time: 04/10/2025 10:45:01 New DB name: /home/jovyan/blastdb/uniprot_sprot_r2025_01 New DB title: ../../blastdb/uniprot_sprot_r2025_01.fasta Sequence type: Protein Deleted existing Protein BLAST database named /home/jovyan/blastdb/uniprot_sprot_r2025_01 Keep MBits: T Maximum file size: 3000000000B Adding sequences from FASTA; added 572970 sequences in 16.0218 seconds. ``` # Get the query sequence ```{bash} curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ -k \ > ../../blastdb/Ab_4denovo_CLC6_a.fa head ../../blastdb/Ab_4denovo_CLC6_a.fa echo "How many sequences are there?" grep -c ">" ../../blastdb/Ab_4denovo_CLC6_a.fa ``` There were some hiccups here, mainly with hidden characters (extra spaces after slashes) with the curl command. However, I was able to confirm 5490 sequences. # Run BLAST ```{bash} ../../applications/ncbi-blast-2.16.0+/bin/blastx \ -query ../../blastdb/Ab_4denovo_CLC6_a.fa \ -db ../../blastdb/uniprot_sprot_r2025_01 \ -out ../../output/Ab_4-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 head -2 ../../output/Ab_4-uniprot_blastx.tab wc -l ../../output/Ab_4-uniprot_blastx.tab ``` Running BLAST took about 45 minutes. Here was the output: ``` Warning: [Query 5476-5490] Examining 5 or more matches is recommended Warning: [Query 5476-5490] Number of threads was reduced to 16 to match the number of available CPUs solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104 765 ../../output/Ab_4-uniprot_blastx.tab ``` So far, I can interpret that there were 765 hits (no. rows in Ab_4-uniprot_blastx.tab). # Getting more information Obtaining an annotation table to use with the BLAST output table. ```{r} spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE) ``` ## Joining blast table with annotation table ```{bash} head -2 ../../output/Ab_4-uniprot_blastx.tab wc -l ../../output/Ab_4-uniprot_blastx.tab ``` ``` solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104 81 ../../output/Ab_4-uniprot_blastx.tab ``` ```{bash} tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab | head -2 ``` ``` solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp O42248 GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp Q08013 SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104 ``` ```{bash} tr '|' '\t' < ../../output/Ab_4-uniprot_blastx.tab \ > ../../output/Ab_4-uniprot_blastx_sep.tab ls ../../output ``` ``` Ab_4-uniprot_blastx.tab Ab_4-uniprot_blastx_sep.tab ``` ```{bash} head -2 ../../output/Ab_4-uniprot_blastx.tab wc -l ../../output/Ab_4-uniprot_blastx.tab ``` ``` solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3 sp|O42248|GBLP_DANRE 82.456 171 30 0 1 513 35 205 2.83e-103 301 solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5 sp|Q08013|SSRG_RAT 75.385 65 16 0 3 197 121 185 1.41e-28 104 81 ../../output/Ab_4-uniprot_blastx.tab ``` # Creating a table The following is all draft code needs editing! ```{r} library(tidyverse) library("kableExtra") ``` ```{r} bltabl <- read.csv("../../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) ``` ```{r echo=TRUE} kbl( head( left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) ) ) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("../../output/blast_annot_go.tab", delim = '\t') ```