---
title: "Running Blast with alternative output format"
author: "Megan Ewing"
date: "2024-11-27"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 1. Database Creation

[**This part is not unique to your file**]{.underline}, it is creating the database you will use when running blast on your file (ie. you don't need to change this section with each different file of interest).

### Obtain Fasta (UniProt/Swiss-Prot)

```{bash}
cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2024_02.fasta.gz
gunzip -k uniprot_sprot_r2024_02.fasta.gz
```

### Making the Database

```{bash}
mkdir ../blastdb
/home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2024_02.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2024_02
```

## 2. Getting the query fasta file

This is where you start [**changing things for your specific file**]{.underline} of interest.

We already have our query rna fasta file.

Taking a peek at that file via `head()` and getting a count of sequences

```{bash}

head -3 ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna

```

```{bash}

echo "How many sequences are there?"
grep -c ">" ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna

```

## 3. Running Blastx

```{bash}
/home/shared/ncbi-blast-2.15.0+/bin/blastx \
-query ../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna \
-db ../blastdb/uniprot_sprot_r2024_02 \
-out ../output/1125-GCF_026571515.1.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
```

Peeking at the output file

```{bash}
head -2 ../output/1125-GCF_026571515.1.tab
```

```{bash}
echo "Number of lines in output"
wc -l ../output/1125-GCF_026571515.1.tab
```

## Pulling Gene ID from Headers (if needed)

11/25 – using blast output from `../output/10-blast_with_cds.rmd`

LOC### Info not present, adding back in here by pulling the headers from the cds file

```{r}

# File paths
cds_file <- "../data/ncbi_dataset/data/GCF_026571515.1/cds_from_genomic.fna"
blast_output_file <- "../output/0817-rphil_blastout_cds.tab"
output_file <- "../output/1125-blastout.tab"


# Step 1: Extract LOC### identifiers from the CDS file
original_headers <- readLines(cds_file) %>%
  .[grep("^>", .)] %>%
  data.frame(header = ., stringsAsFactors = FALSE) %>%
  mutate(
    LOC = gsub(".*\\[gene=(LOC[0-9]+)\\].*", "\\1", header),
    Query_ID = gsub("^>(\\S+).*", "\\1", header)
  )


# Step 2: Read BLAST results
blast_results <- read.table(blast_output_file, header = FALSE, sep = "\t",
                            col.names = c("Query_ID", "Subject_ID", "Identity", 
                                          "Alignment_Length", "Mismatches", 
                                          "Gap_Openings", "Query_Start", 
                                          "Query_End", "Subject_Start", 
                                          "Subject_End", "Evalue", "Bit_Score"))


# Step 3: Annotate BLAST results with LOC### identifiers
annotated_results <- blast_results %>%
  left_join(original_headers, by = "Query_ID") %>%
  select(LOC, everything())  # Move LOC to the first column

# Step 4: Save annotated BLAST results
write.table(annotated_results, file = output_file, quote = FALSE, sep = "\t", 
            row.names = FALSE, col.names = TRUE)

cat("Annotated BLAST results saved to:", output_file, "\n")
head(output_file)


```