--- title: "Protein Annotation with workflow-annotation" author: "Your Name" date: today format: html: toc: true toc-depth: 3 code-fold: false embed-resources: true pdf: toc: true number-sections: true execute: warning: false message: false cache: false --- ## Overview This document demonstrates how to use the `workflow-annotation` pipeline to annotate protein or nucleotide sequences with Swiss-Prot best hits, UniProt annotations, and GO-Slim terms. ## Setup ### Clone the Repository First, let's clone the workflow repository and navigate to it: ```{bash} #| label: clone-repo #| eval: false git clone https://github.com/sr320/workflow-annotation.git cd workflow-annotation ``` ### Check Requirements Verify that all required tools are available: ```{bash} #| label: check-requirements #| eval: false # Check BLAST+ installation which makeblastdb blastx blastp # Check Python and required packages python3 -c "import requests, pandas, goatools; print('All packages available')" ``` If packages are missing, install them: ```{bash} #| label: install-packages #| eval: false python3 -m pip install requests pandas goatools ``` ## Example 1: Annotating Protein Sequences ### Download Example Data Let's download a sample protein FASTA file: ```{bash} #| label: download-proteins #| eval: false wget -O example_proteins.faa \ "https://raw.githubusercontent.com/urol-e5/timeseries_molecular/refs/heads/main/D-Apul/data/Machinery.fasta" # Check the file head -20 example_proteins.faa wc -l example_proteins.faa ``` ### Run Protein Annotation Now let's annotate these protein sequences: ```{bash} #| label: run-protein-annotation #| eval: false # Run the annotation pipeline bash blast2slim.sh -i example_proteins.faa --protein -o protein_example --threads 8 # Check the output structure ls -la output/protein_example/run_*/ ``` ### Examine Results Let's look at the main output file: ```{bash} #| label: check-protein-results #| eval: false # Find the latest run directory LATEST_RUN=$(ls -1t output/protein_example/run_* | head -1) echo "Latest run: $LATEST_RUN" # Look at the main results file head -5 "$LATEST_RUN/annotation_with_goslim.tsv" echo "---" tail -5 "$LATEST_RUN/annotation_with_goslim.tsv" # Count annotated sequences wc -l "$LATEST_RUN/annotation_with_goslim.tsv" ``` ## Example 2: Annotating Nucleotide Sequences ### Create Example Nucleotide Data For demonstration, let's create a small nucleotide FASTA file: ```{bash} #| label: create-nucleotides #| eval: false cat > example_transcripts.fasta << 'EOF' >transcript_001 ATGGCGAAGGTGCTGGACGAGCTGACCGAGGAGCTGCTGAAGGGCGACGAGGTGGTGATCGAG GTGACCGAGGGCGACGAGGTGGTGATCGAGGTGACCGAGGGCGACGAGGTGGTGATCGAG >transcript_002 ATGGTGAAGGTGCTGGACGAGCTGACCGAGGAGCTGCTGAAGGGCGACGAGGTGGTGATCGAG GTGACCGAGGGCGACGAGGTGGTGATCGAGGTGACCGAGGGCGACGAGGTGGTGATCGAG >transcript_003 ATGCTGAAGGTGCTGGACGAGCTGACCGAGGAGCTGCTGAAGGGCGACGAGGTGGTGATCGAG GTGACCGAGGGCGACGAGGTGGTGATCGAGGTGACCGAGGGCGACGAGGTGGTGATCGAG EOF # Check the file cat example_transcripts.fasta ``` ### Run Nucleotide Annotation (BLASTX) ```{bash} #| label: run-nucleotide-annotation #| eval: false # Run BLASTX annotation (default for nucleotide sequences) bash blast2slim.sh -i example_transcripts.fasta -o nucleotide_example --threads 4 # Check results ls -la output/nucleotide_example/run_*/ ``` ## Data Analysis in R Now let's load and analyze the results in R: ```{r} #| label: setup-r library(readr) library(dplyr) library(ggplot2) library(tidyr) library(knitr) ``` ### Load Annotation Results ```{r} #| label: load-data #| eval: false # Find the latest protein annotation results protein_files <- list.files("output/protein_example", pattern = "annotation_with_goslim.tsv", recursive = TRUE, full.names = TRUE) if (length(protein_files) > 0) { # Load the most recent results protein_results <- read_tsv(protein_files[1], show_col_types = FALSE) cat("Loaded", nrow(protein_results), "annotated sequences\n") cat("Columns:", ncol(protein_results), "\n") # Show structure glimpse(protein_results) } else { cat("No annotation results found. Please run the annotation pipeline first.\n") } ``` ### Summary Statistics ```{r} #| label: summary-stats #| eval: false if (exists("protein_results")) { # Basic statistics cat("=== Annotation Summary ===\n") cat("Total sequences:", nrow(protein_results), "\n") cat("Sequences with UniProt hits:", sum(!is.na(protein_results$accession)), "\n") cat("Sequences with GO annotations:", sum(!is.na(protein_results$go_ids)), "\n") cat("Sequences with GO-Slim terms:", sum(!is.na(protein_results$goslim_ids)), "\n") # Top organisms cat("\n=== Top Organisms ===\n") protein_results %>% count(organism, sort = TRUE) %>% filter(!is.na(organism)) %>% head(10) %>% kable() } ``` ### GO-Slim Analysis ```{r} #| label: goslim-analysis #| eval: false if (exists("protein_results")) { # Analyze GO-Slim categories goslim_counts <- protein_results %>% filter(!is.na(goslim_names), goslim_names != "") %>% separate_rows(goslim_names, sep = "; ") %>% count(goslim_names, sort = TRUE) %>% filter(goslim_names != "") cat("=== Top GO-Slim Categories ===\n") print(kable(head(goslim_counts, 15))) # Plot top categories if (nrow(goslim_counts) > 0) { p <- goslim_counts %>% head(15) %>% ggplot(aes(x = reorder(goslim_names, n), y = n)) + geom_col(fill = "steelblue", alpha = 0.7) + coord_flip() + labs(title = "Top GO-Slim Categories", x = "GO-Slim Term", y = "Number of Sequences") + theme_minimal() print(p) } } ``` ### BLAST Statistics ```{r} #| label: blast-stats #| eval: false if (exists("protein_results")) { # BLAST hit quality blast_stats <- protein_results %>% filter(!is.na(pident)) %>% summarise( mean_identity = mean(pident, na.rm = TRUE), median_identity = median(pident, na.rm = TRUE), mean_evalue = mean(as.numeric(evalue), na.rm = TRUE), sequences_with_hits = n() ) cat("=== BLAST Hit Quality ===\n") cat("Mean percent identity:", round(blast_stats$mean_identity, 2), "%\n") cat("Median percent identity:", round(blast_stats$median_identity, 2), "%\n") cat("Mean E-value:", format(blast_stats$mean_evalue, scientific = TRUE), "\n") # Plot identity distribution p2 <- protein_results %>% filter(!is.na(pident)) %>% ggplot(aes(x = pident)) + geom_histogram(bins = 30, fill = "darkgreen", alpha = 0.7) + labs(title = "Distribution of BLAST Percent Identity", x = "Percent Identity (%)", y = "Number of Sequences") + theme_minimal() print(p2) } ``` ## Output Files The pipeline generates several useful files: ```{bash} #| label: list-outputs #| eval: false echo "=== Output Files ===" find output/ -name "*.tsv" -o -name "*.py" -o -name "*.obo" | head -10 ``` ### File Descriptions - **`annotation_with_goslim.tsv`**: Main results with GO-Slim terms - **`annotation_full_go.tsv`**: Results with full GO annotations - **`*.blast.tsv`**: Raw BLAST results - **`postprocess_uniprot_go.py`**: Generated processing script - **`go-basic.obo`, `goslim_generic.obo`**: Ontology files ## Advanced Usage ### Batch Processing Multiple Files ```{bash} #| label: batch-processing #| eval: false # Process multiple FASTA files mkdir -p input_data # Example: process all FASTA files in a directory for fasta_file in input_data/*.fasta; do if [ -f "$fasta_file" ]; then basename=$(basename "$fasta_file" .fasta) echo "Processing $basename..." bash blast2slim.sh -i "$fasta_file" -o "batch_${basename}" --threads 4 fi done ``` ### Custom Database Location ```{bash} #| label: custom-database #| eval: false # Use a shared database location mkdir -p /tmp/shared_blastdb bash blast2slim.sh -i example_proteins.faa --protein \ --dbdir /tmp/shared_blastdb -o custom_db_example ``` ## Session Information ```{r} #| label: session-info sessionInfo() ``` ## Conclusion This workflow provides a streamlined approach to annotating sequences with: 1. **Swiss-Prot best hits** via BLAST 2. **UniProt protein annotations** via REST API 3. **GO term mappings** to functional categories 4. **GO-Slim terms** for high-level functional classification The output TSV files can be easily integrated into downstream analyses, visualization, and reporting workflows. --- **Next Steps:** - Modify parameters for your specific needs - Integrate results with other omics data - Create publication-ready figures and tables