---
title: "Blast with coding sequences"
output: html_document
date: "2024-05-10"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Before starting, make a directory/repo to do all this in that contains the following folders:

-   Code (this is where this/the code file should be)

-   Data

-   Output

There are some packages that are used in this process including Dplyr, Stringr, ggplot2, and DT. Packages are installed/loaded here (disregard chunk if already loaded)

```{r}

# if you need to install, uncomment the following: 
# install.packages("BiocManager")
# BiocManager::install("Biostrings")
# install.packages("ggplot2")
# install.packages('DT')
# install.packages('dplyr')
# install.packages('stringr')

# load installed packages
library(BiocManager)
library(Biostrings)
library(ggplot2)
library(DT)
library(dplyr)
library(stringr)
library(tidyr)
library(readr)


```

## 1. Database Creation

[**This part is not unique to your file**]{.underline}, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest).

### Obtain Fasta (UniProt/Swiss-Prot)

```{bash}
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_02.fasta.gz
gunzip -k uniprot_sprot_r2024_02.fasta.gz
```

### Making the Database

```{bash}
mkdir ../blastdb
/home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2024_02.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2024_02
```

## 2. Getting the query fasta file

This is where you start [**changing things for your specific file**]{.underline} of interest.

We already have our query rna fasta file.

Taking a peek at that file via `head()` and getting a count of sequences

```{bash}

head -3 ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna

```

```{bash}

echo "How many sequences are there?"
grep -c ">" ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna

```

## 3. Running Blastx

```{bash}
/home/shared/ncbi-blast-2.15.0+/bin/blastx \
-query ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna \
-db ../blastdb/uniprot_sprot_r2024_02 \
-out ../output/0817-rphil_blastout_cds.tab \
-evalue 1E-5 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
```

Peeking at the output file

```{bash}
head -2 ../output/0817-rphil_blastout_cds.tab
```

```{bash}
echo "Number of lines in output"
wc -l ../output/0817-rphil_blastout_cds.tab
```

## Blast cds output does not have LOC information in output, here we will retrieve that to prep for joining with deseq2 out.

```{bash}
 cat ../data/ncbi_dataset_202404/data/GCF_026571515.1/cds_from_genomic.fna | grep ">" > ../output/cds_from_genomic_headers.txt

```

```{r}
# from lauras comment: https://github.com/RobertsLab/resources/issues/1897

rphil.blast.cds.headers <- read_delim(file = "../output/cds_from_genomic_headers.txt", delim = "nothing", col_names = "all") %>% 
  mutate(cds=gsub(">| \\[", "", str_extract(all, ">lcl\\|(.*?) \\[")),
         gene=gsub("gene=|\\]", "", str_extract(all, "gene=(.*?)\\]")),
         ncbi_id=gsub("db_xref=GeneID:|\\]", "", str_extract(all, "db_xref=GeneID:(.*?)\\]")),
         protein=gsub("protein=|\\]", "", str_extract(all, "protein=(.*?)\\]")),
         protein_id=gsub("protein_id=|\\]", "", str_extract(all, "protein_id=(.*?)\\]")),
         location=gsub("location=|\\]", "", str_extract(all, "location=(.*?)\\]")),
         gbkey=gsub("gbkey=|\\]", "", str_extract(all, "gbkey=(.*?)\\]"))) %>% 
  dplyr::select(-all)
```

```{r}
# from lauras comment: https://github.com/RobertsLab/resources/issues/1897 

rphil.blast.cds <- right_join(rphil.blast.cds.headers,
                            read_delim(file = "../output/0817-rphil_blastout_cds.tab",
                                       delim = "\t", 
                                       col_names = c("cds", "saccver", "pident", "length",
                                                     "mismatch", "gapopen", "qstart", "qend",
                                                     "sstart", "send", "evalue", "bitscore")), 
                                        by="cds") 


write.table(rphil.blast.cds,"../output/0821-rphil_blast_cds.tab", col.names = T)

```

```{r}
# making new blast result that has just the best hit -- getting it down to the gene level instead of isoform

# Load necessary library
library(dplyr)

# Read BLAST results
blast_results <- rphil.blast.cds

# Extract the best hit based on the lowest e-value for each gene
best_hits <- blast_results %>% group_by(gene) %>% dplyr::slice(which.min(evalue)) %>% separate(saccver, sep="\\|", into=c("na", "SPID", "gene.Uni"), remove = F) %>% dplyr::select(-na) %>% separate(gene.Uni, sep="_", into=c("gene.Uni", "species"), remove=T)

# Save best hits
write.table(best_hits, "../output/blast_out_gene_level_20240820.tab", row.names = T, col.names = T)


```

```{r}


```