--- title: "03.1-Knit-Blast" author: "Claudia_halibut" date: "`r format(Sys.Date(), '%B %d, %Y')`" output: html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show --- 4/22/2025 Objective - Create a knit using previously coded files ```{bash} ##start by checking the current directory pwd ``` ```{bash} ##curl() Retrieves downloadable NCBI data (via URL) from NCBI website and tar() manipulating the data cd /home/shared/8TB_HDD_02/cberry1/applications pwd curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz ``` ```{bash} pwd ``` ```{bash} ##tests the function of the above lines cd /home/shared/8TB_HDD_02/cberry1/applications /home/shared/8TB_HDD_02/cberry1/applications/ncbi-blast-2.16.0+/bin/blastx -h ``` Start creation of NCBI blast database via uni-plot #https://www.uniprot.org/downloads ```{bash} #NOTE: you must use "../" to represent the number of times you are going backwards in the directories (i.e., If you were in the assignments then you are going backwards twice). pwd ``` ```{bash} ##make sure the directory is where it should be before running more code cd /home/shared/8TB_HDD_02/cberry1/blastdb curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ##moves the fasta file with mv() and added "2025_01" as an identifier to allow for type identification. gunzip() decompresses files mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz gunzip -k uniprot_sprot_r2025_01.fasta.gz ##could add '../blastdb' to the ls function, but already in the desired directory, so only ls ls ``` ```{bash} ##the head command will allow for viewing of the first few lines in the UniProt document. #NOTE: the \* symbol acts as a wildcard with any files matching the typed file name cd /home/shared/8TB_HDD_02/cberry1/blastdb head uni* ``` ```{bash} pwd ``` ```{bash} ##the following code provides details about what the document does as well as the function time for DNA sequences #NOTE: the '\\' acts as an indicator to connect the lines of code, so while it is displayed as separate lines, it is all one /home/shared/8TB_HDD_02/cberry1/applications/ncbi-blast-2.16.0+/bin/makeblastdb \ -in ../blastdb/uniprot_sprot_r2025_01.fasta \ -dbtype prot \ -out ../blastdb/uniprot_sprot_r2025_01 ``` ```{bash} pwd ``` ```{bash} ##sends the following "Ab" fasta file to the data folder while grabbing the query sequences via curl() curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ -k >> data/Ab_4denovo_CLC6_a.fa ``` ```{bash} pwd ``` ```{bash} ##retrieves the requested info from the attached data set, print it, and the echo() function lets you print out a separate question head data/Ab_4denovo_CLC6_a.fa echo "How many sequences are there?" grep -c "\>" data/Ab_4denovo_CLC6_a.fa ``` ```{bash} pwd ``` ```{bash} ##the following code runs the sequencing blast from the data directory, re-routes it back into the blastdb, and outputs it into the output folder directory /ncbi-blast-2.16.0+/bin/blastx \ -query /data/Ab_4denovo_CLC6_a.fa \ -db /home/shared/8TB_HDD_02/cberry1/blastdb/uniprot_sprot_r2025_01 \ -out /output/Ab_4-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 1 \ -max_target_seqs 1 \ -outfmt 6 ``` ```{bash} ##the head function writes in the first few lines from the set data file head -2 output/Ab_4-uniprot_blastx.tab wc -l output/Ab_4-uniprot_blastx.tab #NOTE: the below code has {{}}. This allows for better data manipulation in the following-up code ``` ``` {{bash}} curl -O "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606" ``` ```{bash} ##gather more information on BLAST file and change directory #redirect file cd /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29" ``` ```{bash} ##print the first few (number is the determined lines) lines from the following document in designated directory head -2 output/Ab_4-uniprot_blastx.tab wc -l output/Ab_4-uniprot_blastx.tab ``` ```{bash} ##the following code states that any time the system runs across a pipes to replace it with the indicated value/character. Followed by an output pwd tr '|' '\t' < /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/Ab_4-uniprot_blastx.tab | head -5 ``` ```{bash} ##output the first two lines from the called upon data file /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/Ab_4-uniprot_blastx_sep.tab | head -2 ``` ```{bash} ##the following code states to translate the pipe then out the resuls into a dedicated file cd /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments tr '|' '\t' < output/Ab_4-uniprot_blastx.tab \ > output/Ab_4-uniprot_blastx_sep.tab ``` The following code joins the two tables to one another, but code is invalidated since only one tab is created. ```{r} library(tidyverse) install.packages("kableExtra") library(kableExtra) library(dplyr) #mutate function below was throwing error messages, so I added this library ##the current working directory, in R, is: getwd() ##read the 'tab-separated' table: bltabl <- read.csv("output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) ##pulls the file from Dr. Roberts server, this may throw an error for a long run-time, but the following code does run spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE) ##provides the first few lines of the blastx file then gives a quick visualization of the data table to make sure that it IS what we want it to be. str(bltabl) alternate function head(bltabl) kbl(head (left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) )) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/blast_annot_go.tab", delim = '\t') ```