--- title: "32-Ptuh-ncRNA-machinery-BLAST" author: "Kathleen Durkin" date: "2026-04-15" always_allow_html: true output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: false bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: ../../references.bib link-citations: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DT) library(Biostrings) library(tm) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center", # Align plots to the center comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` NOTE: Instead of using Steven's e-value threshold of 1e-05, I'll use the more stringent threshold used in Ashey et al. 2025 in Jill's check for the existence of ncRNA machinery, to maintain consistency: **1e-40** # Proteins Will be using P. meandrina protein sequences here: https://gannet.fish.washington.edu/seashell/snaps/Pocillopora_meandrina_HIv1.genes.pep.faa Download if necessary ```{bash} cd ../data curl -o Pocillopora_meandrina_HIv1.genes.pep.faa https://gannet.fish.washington.edu/seashell/snaps/Pocillopora_meandrina_HIv1.genes.pep.faa ``` ```{bash} head ../data/Pocillopora_meandrina_HIv1.genes.pep.faa ``` # Make BLAST db ```{bash} /home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \ -in ../data/Pocillopora_meandrina_HIv1.genes.pep.faa \ -dbtype prot \ -out ../output/32-Ptuh-ncRNA-machinery-BLAST/Ptuh-proteins ``` ```{bash} head ../../data/ncRNA_machinery.fasta ``` # BLASTp ```{bash} fasta="../../data/ncRNA_machinery.fasta" /home/shared/ncbi-blast-2.15.0+/bin/blastp \ -query $fasta \ -db ../output/32-Ptuh-ncRNA-machinery-BLAST/Ptuh-proteins \ -out ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab \ -evalue 1E-40 \ -num_threads 48 \ -max_target_seqs 1 \ -max_hsps 1 \ -outfmt 6 ``` ```{r, engine='bash', eval=TRUE} wc -l ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab ``` ```{r, engine='bash', eval=TRUE} head ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab ``` # Parse and annotate This .tab file associates A. pulchra proteins with ncRNA machinery proteins found in the reference db. However, the names are just taken straight from the fasta headers, which ends up being mostly just the Uniprot accession number. Let's parse this into a more interpretable db. parse the reference ncRNA machinery fasta headers ```{r, engine='bash'} cd ../../data/ grep "^>" ncRNA_machinery.fasta | \ sed 's/^>//' | \ awk -F'|' '{ # Get accession acc = $2 # Everything after the third pipe rest = $3 # Strip entry_name (first token before space) sub(/^[^ ]+ /, "", rest) # Parse fields using OS=, OX=, GN=, CGN=, PE= protein = rest; sub(/ CGN=.*/, "", protein); sub(/ OS=.*/, "", protein) species = rest; sub(/.*OS=/, "", species); sub(/ OX=.*/, "", species) if (rest ~ /GN=/) { gene = rest; sub(/.*GN=/, "", gene); sub(/ PE=.*/, "", gene) } else { gene = "NA" } if (rest ~ /CGN=/) { cgn = rest; sub(/.*CGN=/, "", cgn); sub(/ OS=.*/, "", cgn) } else { cgn = "NA" } # Quote protein and species in case of commas gsub(/"/, "\"\"", protein) gsub(/"/, "\"\"", species) print acc "," "\"" protein "\"" "," gene "," cgn "," "\"" species "\"" }' | \ sed '1i accession,protein_name,gene_name,std_gene_name,species' \ > ncRNA_machinery_reference_table.csv ``` Now parse blastp output rows to reduce to accession numbers ```{bash} cd ../output/32-Ptuh-ncRNA-machinery-BLAST awk -F'\t' 'BEGIN{OFS="\t"} {split($1,a,"|"); $1=a[2]; print}' ncRNAmach-blastp-Ptuh_out.tab > ncRNAmach-blastp-Ptuh_parsed.tab ``` Now read in and match with reference db, so that blast results are associated with protein/gene names (not just accession numbers) (Note that I can't rely on just the gene names, because most of the Uniprot entries used non-standard gene names) ```{r} library(dplyr) # Read reference table ref <- read.csv("../../data/ncRNA_machinery_reference_table.csv", stringsAsFactors = FALSE) # Read parsed BLAST output blast <- read.delim("../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_parsed.tab", header = FALSE, col.names = c("accession", "target", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) # Join blast_annotated <- blast %>% left_join(ref, by = "accession") %>% dplyr::select(target, protein_name, gene_name, std_gene_name, species, accession, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore) write.csv(blast_annotated, "../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_annotated.csv", row.names = FALSE) ``` Finally, let's reduce this to unique associations. In the ncNRA machinery db I compiled, I sometimes included multiple sequences of a given protein sequenced in different species. This was partly due to convenience, but also because (a) I wanted "backups" in case one of the sequences was mis-annotated or had other problems (these are all unreviewed entries), and (b) in case a protein is highly divergent, and only a subset of its entries matched our species. Now that I've done the blast, though, I only really need to know the [p.evermanni mRNA] - [standardized gene name] association. Let's make this reduced db. ```{r} blast_reduc <- blast_annotated %>% dplyr::select(target, std_gene_name) %>% distinct() write.csv(blast_reduc, "../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_annotated_reduced.csv", row.names = FALSE) ```