---
title: "BLAST, but make it look nice"
author: "Sarah Yerrace"
date: "2023-04-13"
output: 
  html_document:
    theme: readable
    highlight: tango
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
---

In this script, we will be taking genetic sequence reads and giving them taxonomic assignments using BLAST. This was originally written using Jupyter

# Set Up

here are the packages that are used in this script

```{r setup, include=TRUE, message=FALSE, warning=FALSE}
library(knitr)
library(tidyverse)
library(kableExtra)
library(DT)
library(Biostrings)
library(tm)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center" # Align plots to the center
)
```

# Making the Blast Database

## download the file

This is downloading the NCBI Blast software. I put it in a folder called applications. cd is indicating a new director and curl is downloading from the given website.

```{r, engine = 'bash'}
cd /home/jovyan/applications
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz
```

This unzips the file

```{r, engine = 'bash'}
tar -xf ncbi-blast-2.13.0+-x64-linux.tar.gz
```

Check that the software is working by using the help function (-h)

```{r, engine = 'bash'}
~/applications/ncbi-blast-2.13.0+/bin/blastx -h
```

## Make a database

This is putting our sequences in a folder named Data. using ls will return all a list of files in the folder so we know that they went to the right place

```{r, engine = 'bash'}
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data
```

```{r, engine = 'bash'}
~/applications/ncbi-blast-2.13.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01
```

## Get the query sequence

This code downloads a query sequence and saves it in the data folder

```{r, engine = 'bash'}
curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../data/Ab_4denovo_CLC6_a.fa
```

The head function allows us to look at the first few lines of the data. This code will also return the number of sequences in our data.

```{r, engine = 'bash'}
head ../data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../data/Ab_4denovo_CLC6_a.fa
```

# Run Blast

This section uses the NCBI Blast software using the query sequence and database we downloaded. It creates an output file (.tab) with the results.

```{r, engine = 'bash'}
~/applications/ncbi-blast-2.13.0+/bin/blastx \
-query ../data/Ab_4denovo_CLC6_a.fa \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
```

## Add Annotation information

Start by changing the format of blast output. This next section replaces "\|" with "\t" and saves a new output file. This change will make it easier to join with out second table.

```{r, engine = 'bash'}
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab
```

This code reads the CSVs and stores them as objects

```{r, engine = 'bash'}
bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)
```

This chunk joins the two CSVs together in one table. There are many options for exploring this data set, this is just one of them.

```{r, engine = 'bash'}
left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')
```