--- title: "Blast" output: html_document date: "2023-11-28" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Before starting, make a directory/repo to do all this in that contains the following folders: - Code (this is where this/the code file should be) - Data - Output There are some packages that are used in this process including Dplyr, Stringr, ggplot2, and DT. Packages are installed/loaded here (disregard chunk if already loaded) ```{r} # if you need to install, uncomment the following: # install.packages("BiocManager") # BiocManager::install("Biostrings") # install.packages("ggplot2") # install.packages('DT') # install.packages('dplyr') # install.packages('stringr') # load installed packages library(BiocManager) library(Biostrings) library(ggplot2) library(DT) library(dplyr) library(stringr) ``` ## 1. Database Creation [**This part is not unique to your file**]{.underline}, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest). ### Obtain Fasta (UniProt/Swiss-Prot) ```{bash} cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_02.fasta.gz gunzip -k uniprot_sprot_r2024_02.fasta.gz ``` ### Making the Database ```{bash} mkdir ../blastdb /home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \ -in ../data/uniprot_sprot_r2024_02.fasta \ -dbtype prot \ -out ../blastdb/uniprot_sprot_r2024_02 ``` ## 2. Getting the query fasta file This is where you start [**changing things for your specific file**]{.underline} of interest. We already have our query rna fasta file. Taking a peek at that file via `head()` and getting a count of sequences ```{bash} head -3 ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna ``` ```{bash} echo "How many sequences are there?" grep -c ">" ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna ``` ## 3. Running Blastx ```{bash} /home/shared/ncbi-blast-2.15.0+/bin/blastx \ -query ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna \ -db ../blastdb/uniprot_sprot_r2024_02 \ -out ../output/GCF_026571515.1.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` Peeking at the output file ```{bash} head -2 ../output/GCF_026571515.1.tab ``` ```{bash} echo "Number of lines in output" wc -l ../output/GCF_026571515.1.tab ``` ##