32-Ptuh-ncRNA-machinery-BLAST ================ Kathleen Durkin 2026-04-15 - [1 Proteins](#1-proteins) - [2 Make BLAST db](#2-make-blast-db) - [3 BLASTp](#3-blastp) - [4 Parse and annotate](#4-parse-and-annotate) NOTE: Instead of using Steven’s e-value threshold of 1e-05, I’ll use the more stringent threshold used in Ashey et al. 2025 in Jill’s check for the existence of ncRNA machinery, to maintain consistency: **1e-40** # 1 Proteins Will be using P. meandrina protein sequences here: Download if necessary ``` bash cd ../data curl -o Pocillopora_meandrina_HIv1.genes.pep.faa https://gannet.fish.washington.edu/seashell/snaps/Pocillopora_meandrina_HIv1.genes.pep.faa ``` ``` bash head ../data/Pocillopora_meandrina_HIv1.genes.pep.faa ``` # 2 Make BLAST db ``` bash /home/shared/ncbi-blast-2.15.0+/bin/makeblastdb \ -in ../data/Pocillopora_meandrina_HIv1.genes.pep.faa \ -dbtype prot \ -out ../output/32-Ptuh-ncRNA-machinery-BLAST/Ptuh-proteins ``` ``` bash head ../../data/ncRNA_machinery.fasta ``` # 3 BLASTp ``` bash fasta="../../data/ncRNA_machinery.fasta" /home/shared/ncbi-blast-2.15.0+/bin/blastp \ -query $fasta \ -db ../output/32-Ptuh-ncRNA-machinery-BLAST/Ptuh-proteins \ -out ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab \ -evalue 1E-40 \ -num_threads 48 \ -max_target_seqs 1 \ -max_hsps 1 \ -outfmt 6 ``` ``` bash wc -l ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab ``` 801 ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab ``` bash head ../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_out.tab ``` tr|A0A2B4SX82|A0A2B4SX82_STYPI Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 95.161 806 39 0 1 806 1 806 0.0 1602 tr|A0A6P8J329|A0A6P8J329_ACTTE Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 65.675 807 271 6 1 802 1 806 0.0 1139 tr|A0A6S7G2D7|A0A6S7G2D7_PARCT Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 52.847 808 370 5 1 799 1 806 0.0 915 tr|A0A7M5V3C0|A0A7M5V3C0_9CNID Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 50.428 817 393 5 1 816 1 806 0.0 876 tr|A0A913XIF0|A0A913XIF0_EXADI Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 64.764 806 279 4 1 801 1 806 0.0 1129 tr|A0A9W9ZWX6|A0A9W9ZWX6_9CNID Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 83.643 807 131 1 1 807 1 806 0.0 1429 tr|A0AAD9QN77|A0AAD9QN77_ACRCE Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 73.908 824 159 2 1 788 1 804 0.0 1268 tr|A0AAU9VQE5|A0AAU9VQE5_9CNID Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 100.000 806 0 0 1 806 1 806 0.0 1671 tr|A0ABM4DB04|A0ABM4DB04_HYDVU Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 50.860 814 390 6 1 812 1 806 0.0 865 tr|A0ABM4DDG6|A0ABM4DDG6_HYDVU Pocillopora_meandrina_HIv1___RNAseq.g6540.t1 50.860 814 390 6 1 812 1 806 0.0 864 # 4 Parse and annotate This .tab file associates A. pulchra proteins with ncRNA machinery proteins found in the reference db. However, the names are just taken straight from the fasta headers, which ends up being mostly just the Uniprot accession number. Let’s parse this into a more interpretable db. parse the reference ncRNA machinery fasta headers ``` bash cd ../../data/ grep "^>" ncRNA_machinery.fasta | \ sed 's/^>//' | \ awk -F'|' '{ # Get accession acc = $2 # Everything after the third pipe rest = $3 # Strip entry_name (first token before space) sub(/^[^ ]+ /, "", rest) # Parse fields using OS=, OX=, GN=, CGN=, PE= protein = rest; sub(/ CGN=.*/, "", protein); sub(/ OS=.*/, "", protein) species = rest; sub(/.*OS=/, "", species); sub(/ OX=.*/, "", species) if (rest ~ /GN=/) { gene = rest; sub(/.*GN=/, "", gene); sub(/ PE=.*/, "", gene) } else { gene = "NA" } if (rest ~ /CGN=/) { cgn = rest; sub(/.*CGN=/, "", cgn); sub(/ OS=.*/, "", cgn) } else { cgn = "NA" } # Quote protein and species in case of commas gsub(/"/, "\"\"", protein) gsub(/"/, "\"\"", species) print acc "," "\"" protein "\"" "," gene "," cgn "," "\"" species "\"" }' | \ sed '1i accession,protein_name,gene_name,std_gene_name,species' \ > ncRNA_machinery_reference_table.csv ``` Now parse blastp output rows to reduce to accession numbers ``` bash cd ../output/32-Ptuh-ncRNA-machinery-BLAST awk -F'\t' 'BEGIN{OFS="\t"} {split($1,a,"|"); $1=a[2]; print}' ncRNAmach-blastp-Ptuh_out.tab > ncRNAmach-blastp-Ptuh_parsed.tab ``` Now read in and match with reference db, so that blast results are associated with protein/gene names (not just accession numbers) (Note that I can’t rely on just the gene names, because most of the Uniprot entries used non-standard gene names) ``` r library(dplyr) # Read reference table ref <- read.csv("../../data/ncRNA_machinery_reference_table.csv", stringsAsFactors = FALSE) # Read parsed BLAST output blast <- read.delim("../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_parsed.tab", header = FALSE, col.names = c("accession", "target", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) # Join blast_annotated <- blast %>% left_join(ref, by = "accession") %>% dplyr::select(target, protein_name, gene_name, std_gene_name, species, accession, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore) write.csv(blast_annotated, "../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_annotated.csv", row.names = FALSE) ``` Finally, let’s reduce this to unique associations. In the ncNRA machinery db I compiled, I sometimes included multiple sequences of a given protein sequenced in different species. This was partly due to convenience, but also because (a) I wanted “backups” in case one of the sequences was mis-annotated or had other problems (these are all unreviewed entries), and (b) in case a protein is highly divergent, and only a subset of its entries matched our species. Now that I’ve done the blast, though, I only really need to know the $$p.evermanni mRNA$$ - $$standardized gene name$$ association. Let’s make this reduced db. ``` r blast_reduc <- blast_annotated %>% dplyr::select(target, std_gene_name) %>% distinct() write.csv(blast_reduc, "../output/32-Ptuh-ncRNA-machinery-BLAST/ncRNAmach-blastp-Ptuh_annotated_reduced.csv", row.names = FALSE) ```