---
title: "33-Ptuh-epi-machinery-BLAST"
author: "Kathleen Durkin"
date: "2026-04-28"
always_allow_html: true
output: 
  github_document:
    toc: true
    toc_depth: 3
    number_sections: true
    html_preview: true 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
bibliography: ../../references.bib
link-citations: true
---

```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
library(kableExtra)
library(DT)
library(Biostrings)
library(tm)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,         # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  fig.width = 6,       # Set plot width in inches
  fig.height = 4,      # Set plot height in inches
  fig.align = "center", # Align plots to the center
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

Using Steven's code from doing this in `timeseries`: https://github.com/urol-e5/timeseries_molecular/blob/dada661a07a45bf80b42953c4f8f8b7dad6824db/F-Ptua/code/03-Ptua-epimods-blast.qmd
 
NOTE: Instead of using Steven's e-value threshold of 1e-05, I'll use the more stringent threshold used in Ashey et al. 2025 in Jill's check for the existence of ncRNA machinery, for consistency: **1e-40**

Will use the protein database already generated in 32-Ptuh-ncRNA-machinery-BLAST


Epimachinery fasta provided by Hollie, obtain from timeseries repo if needed:

https://raw.githubusercontent.com/urol-e5/timeseries_molecular/dada661a07a45bf80b42953c4f8f8b7dad6824db/D-Apul/data/Machinery.fasta

```{bash, eval=FALSE}
cd ../../data
curl -o Machinery.fasta https://raw.githubusercontent.com/urol-e5/timeseries_molecular/dada661a07a45bf80b42953c4f8f8b7dad6824db/D-Apul/data/Machinery.fasta

```

```{bash, eval=FALSE}
head ../../data/Machinery.fasta
```

## BLASTp

```{bash}
fasta="../../data/Machinery.fasta"

/home/shared/ncbi-blast-2.15.0+/bin/blastp \
-query $fasta \
-db ../output/32-Ptuh-ncRNA-machinery-BLAST/Ptuh-proteins \
-out ../output/33-Ptuh-epi-machinery-BLAST/Mach-blastp-Ptuh_out.tab \
-evalue 1E-40 \
-num_threads 48 \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt 6
```

```{r, engine='bash', eval=TRUE}       
wc -l ../output/33-Ptuh-epi-machinery-BLAST/Mach-blastp-Ptuh_out.tab
```

This blast search using a 1e-40 evlaue threshold produces 403 hits, compared to the 480 hits produced in [Steven's `timeseries` code](https://github.com/urol-e5/timeseries_molecular/blob/dada661a07a45bf80b42953c4f8f8b7dad6824db/F-Ptua/code/03-Ptua-epimods-blast.qmd) using 1e-05

```{r, engine='bash', eval=TRUE}        
head ../output/33-Ptuh-epi-machinery-BLAST/Mach-blastp-Ptuh_out.tab
```

## Reduce

Let's also make sure we have an easy-to-use csv associating each protein with any matches.

```{r}
raw <- read.table("../output/33-Ptuh-epi-machinery-BLAST/Mach-blastp-Ptuh_out.tab")

formatted <- raw  %>% dplyr::select(V1, V2)
formatted$target <- raw$V2
formatted$gene_name <- raw$V1
formatted <- formatted %>% dplyr::select(-V1, -V2)

# Remove trailing ##s after the target gene names (artifact of Ensembl)
formatted$gene_name <- sub("-2[0-9]{2}$", "", formatted$gene_name)

write.csv(formatted, "../output/33-Ptuh-epi-machinery-BLAST/Mach-blastp-Ptuh_reduced.csv", row.names = FALSE)
```