---
title: "Blast"
output: html_document
date: "2023-11-28"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Before starting, make a directory/repo to do all this in that contains the following folders:
- Code (this is where this/the code file should be)
- Data
- Output
There are some packages that are used in this process including Dplyr, Stringr, ggplot2, and DT. Packages are installed/loaded here (disregard chunk if already loaded)
```{r}
# if you need to install, uncomment the following:
# install.packages("BiocManager")
# BiocManager::install("Biostrings")
# install.packages("ggplot2")
# install.packages('DT')
# install.packages('dplyr')
# install.packages('stringr')
# load installed packages
library(BiocManager)
library(Biostrings)
library(ggplot2)
library(DT)
library(dplyr)
library(stringr)
```
## 1. Database Creation
[**This part is not unique to your file**]{.underline}, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest).
### Obtain Fasta (UniProt/Swiss-Prot)
```{bash}
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_02.fasta.gz
gunzip -k uniprot_sprot_r2024_02.fasta.gz
```
### Making the Database
```{bash}
mkdir ../blastdb
/home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2024_02.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2024_02
```
## 2. Getting the query fasta file
This is where you start [**changing things for your specific file**]{.underline} of interest.
We already have our query rna fasta file.
Taking a peek at that file via `head()` and getting a count of sequences
```{bash}
head -3 ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna
```
```{bash}
echo "How many sequences are there?"
grep -c ">" ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna
```
## 3. Running Blastx
```{bash}
/home/shared/ncbi-blast-2.15.0+/bin/blastx \
-query ../data/ncbi_dataset/data/GCF_026571515.1/rna.fna \
-db ../blastdb/uniprot_sprot_r2024_02 \
-out ../output/GCF_026571515.1.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
```
Peeking at the output file
```{bash}
head -2 ../output/GCF_026571515.1.tab
```
```{bash}
echo "Number of lines in output"
wc -l ../output/GCF_026571515.1.tab
```
##