--- title: "BLAST, but make it look nice" author: "Sarah Yerrace" date: "2023-04-13" output: html_document: theme: readable highlight: tango toc: true toc_float: true number_sections: true code_folding: show code_download: true --- In this script, we will be taking genetic sequence reads and giving them taxonomic assignments using BLAST. This was originally written using Jupyter # Set Up here are the packages that are used in this script ```{r setup, include=TRUE, message=FALSE, warning=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DT) library(Biostrings) library(tm) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Making the Blast Database ## download the file This is downloading the NCBI Blast software. I put it in a folder called applications. cd is indicating a new director and curl is downloading from the given website. ```{r, engine = 'bash'} cd /home/jovyan/applications curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz ``` This unzips the file ```{r, engine = 'bash'} tar -xf ncbi-blast-2.13.0+-x64-linux.tar.gz ``` Check that the software is working by using the help function (-h) ```{r, engine = 'bash'} ~/applications/ncbi-blast-2.13.0+/bin/blastx -h ``` ## Make a database This is putting our sequences in a folder named Data. using ls will return all a list of files in the folder so we know that they went to the right place ```{r, engine = 'bash'} cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz gunzip -k uniprot_sprot_r2023_01.fasta.gz ls ../data ``` ```{r, engine = 'bash'} ~/applications/ncbi-blast-2.13.0+/bin/makeblastdb \ -in ../data/uniprot_sprot_r2023_01.fasta \ -dbtype prot \ -out ../blastdb/uniprot_sprot_r2023_01 ``` ## Get the query sequence This code downloads a query sequence and saves it in the data folder ```{r, engine = 'bash'} curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ -k \ > ../data/Ab_4denovo_CLC6_a.fa ``` The head function allows us to look at the first few lines of the data. This code will also return the number of sequences in our data. ```{r, engine = 'bash'} head ../data/Ab_4denovo_CLC6_a.fa echo "How many sequences are there?" grep -c ">" ../data/Ab_4denovo_CLC6_a.fa ``` # Run Blast This section uses the NCBI Blast software using the query sequence and database we downloaded. It creates an output file (.tab) with the results. ```{r, engine = 'bash'} ~/applications/ncbi-blast-2.13.0+/bin/blastx \ -query ../data/Ab_4denovo_CLC6_a.fa \ -db ../blastdb/uniprot_sprot_r2023_01 \ -out ../output/Ab_4-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` ## Add Annotation information Start by changing the format of blast output. This next section replaces "\|" with "\t" and saves a new output file. This change will make it easier to join with out second table. ```{r, engine = 'bash'} tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \ > ../output/Ab_4-uniprot_blastx_sep.tab ``` This code reads the CSVs and stores them as objects ```{r, engine = 'bash'} bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) spgo <- read.csv("https://gannet.fish.washington.edu/seashell/snaps/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE) ``` This chunk joins the two CSVs together in one table. There are many options for exploring this data set, this is just one of them. ```{r, engine = 'bash'} left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("../output/blast_annot_go.tab", delim = '\t') ```