---
title: "03.1-Knit-Blast"
author: "Claudia_halibut"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output: 
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
---

4/22/2025 Objective - Create a knit using previously coded files 

```{bash}
##start by checking the current directory
pwd
```

```{bash}
##curl() Retrieves downloadable NCBI data (via URL) from NCBI website and tar() manipulating the data

cd /home/shared/8TB_HDD_02/cberry1/applications
pwd

curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz

tar -xf ncbi-blast-2.16.0+-x64-linux.tar.gz
```

```{bash}
pwd
```
```{bash}
##tests the function of the above lines

cd /home/shared/8TB_HDD_02/cberry1/applications

/home/shared/8TB_HDD_02/cberry1/applications/ncbi-blast-2.16.0+/bin/blastx -h
```

Start creation of NCBI blast database via uni-plot #https://www.uniprot.org/downloads

```{bash}
#NOTE: you must use "../" to represent the number of times you are going backwards in the directories (i.e., If you were in the assignments then you are going backwards twice).

pwd
```

```{bash}
##make sure the directory is where it should be before running more code

cd /home/shared/8TB_HDD_02/cberry1/blastdb

curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

##moves the fasta file with mv() and added "2025_01" as an identifier to allow for type identification. gunzip() decompresses files

mv uniprot_sprot.fasta.gz uniprot_sprot_r2025_01.fasta.gz
gunzip -k uniprot_sprot_r2025_01.fasta.gz

##could add '../blastdb' to the ls function, but already in the desired directory, so only ls

ls
```

```{bash}
##the head command will allow for viewing of the first few lines in the UniProt document.

#NOTE: the \* symbol acts as a wildcard with any files matching the typed file name

cd /home/shared/8TB_HDD_02/cberry1/blastdb
head uni*
```

```{bash}
pwd
```

```{bash}
##the following code provides details about what the document does as well as the function time for DNA sequences

#NOTE: the '\\' acts as an indicator to connect the lines of code, so while it is displayed as separate lines, it is all one

/home/shared/8TB_HDD_02/cberry1/applications/ncbi-blast-2.16.0+/bin/makeblastdb \

-in ../blastdb/uniprot_sprot_r2025_01.fasta \

-dbtype prot \

-out ../blastdb/uniprot_sprot_r2025_01
```

```{bash}
pwd
```

```{bash}
##sends the following "Ab" fasta file to the data folder while grabbing the query sequences via curl()

curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \

-k >> data/Ab_4denovo_CLC6_a.fa
```
```{bash}
pwd
```

```{bash}
##retrieves the requested info from the attached data set, print it, and the echo() function lets you print out a separate question 

head data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c "\>" data/Ab_4denovo_CLC6_a.fa
```
```{bash}
pwd
```

```{bash}
##the following code runs the sequencing blast from the data directory, re-routes it back into the blastdb, and outputs it into the output folder directory

/ncbi-blast-2.16.0+/bin/blastx \

-query /data/Ab_4denovo_CLC6_a.fa \

-db /home/shared/8TB_HDD_02/cberry1/blastdb/uniprot_sprot_r2025_01 \

-out /output/Ab_4-uniprot_blastx.tab \

-evalue 1E-20 \

-num_threads 1 \

-max_target_seqs 1 \

-outfmt 6
```

```{bash}
##the head function writes in the first few lines from the set data file

head -2 output/Ab_4-uniprot_blastx.tab

wc -l output/Ab_4-uniprot_blastx.tab

#NOTE: the below code has {{}}. This allows for better data manipulation in the following-up code
```

``` {{bash}}
curl -O "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"
```

```{bash}
##gather more information on BLAST file and change directory

#redirect file
cd /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output

curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"
```

```{bash}
##print the first few (number is the determined lines) lines from the following document in designated directory

head -2 output/Ab_4-uniprot_blastx.tab
wc -l output/Ab_4-uniprot_blastx.tab
```

```{bash}
##the following code states that any time the system runs across a pipes to replace it with the indicated value/character. Followed by an output

pwd
tr '|' '\t' < /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/Ab_4-uniprot_blastx.tab | head -5
```

```{bash}
##output the first two lines from the called upon data file

/home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/Ab_4-uniprot_blastx_sep.tab | head -2
```
```{bash}
##the following code states to translate the pipe then out the resuls into a dedicated file

cd /home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments

tr '|' '\t' < output/Ab_4-uniprot_blastx.tab \

> output/Ab_4-uniprot_blastx_sep.tab
```

The following code joins the two tables to one another, but code is invalidated since only one tab is created.

```{r}
library(tidyverse)
install.packages("kableExtra")
library(kableExtra)
library(dplyr) #mutate function below was throwing error messages, so I added this library

##the current working directory, in R, is:
getwd()

##read the 'tab-separated' table:
bltabl <- read.csv("output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)

##pulls the file from Dr. Roberts server, this may throw an error for a long run-time, but the following code does run 
spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)

##provides the first few lines of the blastx file then gives a quick visualization of the data table to make sure that it IS what we want it to be. str(bltabl) alternate function

head(bltabl)
kbl(head (left_join(bltabl, spgo, by = c("V3" = "Entry")) %>%
 select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% 
  mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
 )) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
           pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("home/shared/8TB_HDD_02/cberry1/Claudia-Halibut/assignments/output/blast_annot_go.tab", delim = '\t')
```