---
title: "Week 3 - BLAST Report - now Knitted!"
author: "Liz Boggs"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output: 
  html_document:
    theme: readable
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
---

# Week 3 - Knitting BLAST Report

## This report uses NCBI BLAST and UniProt protein sequences to identify genes in a given set of unknown sequences.

### 1. Download BLAST from NCBI using `curl` 

*NOTE: make sure to pick the correct operating system version from the executables page from the URL!*

```{bash}

cd /home/shared/8TB_HDD_02/lizboggs/applications # absolute path - user needs to change to where they want it!
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz # retrieves the files in a gzip

```

*An additional note here - I prefer not to use absolute paths, obviously, for the sake of the exercise and making sure things are where they should be I kept this one absolute*

### 2. Double checking that the download worked by using the `-h` command (help):

```{bash}

~/applications/ncbi-blast-2.16.0+/bin/blastx -h # pulls up blastx usage options

```

### 3. Downloading a file of known sequences from UniProt/SwissProt

```{bash}

cd /home/shared/8TB_HDD_02/lizboggs/blastdb # this is where the database files will live

curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz # pulling fasta from webpage

mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz # rename the file

gunzip -k uniprot_sprot_r2025_01.fasta.gz # unzip the sequence folder

ls # list out the files to make sure they're there

```

### 4. Making our BLAST database

```{bash}

/home/shared/8TB_HDD_02/lizboggs/applications/ncbi-blast-2.16.0+/bin/makeblastdb \
-in ~/blastdb/uniprot_sprot_r2025_01.fasta \
-dbtype prot \
-out ~/blastdb/uniprot_sprot_r2025_01

```

### 5. Getting the query sequence (our unknown)

```{bash}

curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ 
-k \
> ~/blastdb/Ab_4denovo_CLC6_a.fa

head ~/blastdb/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?" # output for next line
grep -c ">" ~/blastdb/Ab_4denovo_CLC6_a.fa # count of sequences

```

### 6. Running the BLAST program to find matches for our sequence

```{bash}

~/applications/ncbi-blast-2.16.0+/bin/blastx \
-query ~/blastdb/Ab_4denovo_CLC6_a.fa \
-db ~/blastdb/uniprot_sprot_r2025_01 \
-out ~/output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 16 \
-max_target_seqs 1 \
-outfmt 6

head -2 ~/output/Ab_4-uniprot_blastx.tab # showing the top 2 results
wc -l ~/output/Ab_4-uniprot_blastx.tab # counting the output

```

### 7. Retrieving the annotation table for the BLAST output

#### These next `curl` commands will grab annotation information from UniProt.

```{bash}

curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"

```

```{bash}
curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength&format=tsv&query=%28*%29+AND+%28reviewed%3Atrue%29"

```

### 8. Joining BLAST with annotations

#### Now we need to join the tables we've made in order to understand the unknown sequences' function(s).

#### Let's check the outputted file first:

```{bash}
head -2 ~/output/Ab_4-uniprot_blastx.tab
wc -l ~/output/Ab_4-uniprot_blastx.tab
```

#### Pull out the accession number:

```{bash}
head -2 ~/output/Ab_4-uniprot_blastx.tab | tr '|' '\t'

```

#### Save it so we can use it.

```{bash}
tr '|' '\t' < ~/output/Ab_4-uniprot_blastx.tab \
> ~/output/Ab_4-uniprot_blastx_sep.tab
```

### Time to join the tables!

#### `kable` will help us produce a prettier output

```{r}
library(tidyverse)
library(kableExtra)

bltabl <- read.csv("\~/output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)

spgo <- read.csv("\~/blastdb/uniprot_sprot_r2025_01.tab", sep = '\t', header = TRUE)

kbl( head( left_join(bltabl, spgo, by = c("V3" = "Entry")) %\>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %\>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) ) ) %\>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

left_join(bltabl, spgo, by = c("V3" = "Entry")) %\>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %\>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %\>% write_delim("\~/output/blast_annot_go.tab", delim = '\t')

```