# Ortholog Group Functional Annotation ## Overview This directory contains scripts for annotating ortholog groups identified in the previous orthology analysis with comprehensive functional information. The annotation approach integrates multiple databases and tools to provide detailed functional characterization of orthologous proteins across three coral species. ## Approach ### 1. **Multi-Source Annotation Strategy** The annotation pipeline combines information from multiple databases and tools: - **InterProScan**: Comprehensive protein family and domain analysis - **Swiss-Prot**: Curated protein annotations and functions - **Pfam**: Protein domain families - **Gene Ontology (GO)**: Biological processes, molecular functions, cellular components - **KEGG**: Metabolic and signaling pathways - **InterPro**: Integrated protein family and domain signatures ### 2. **Representative Sequence Selection** For each ortholog group, a representative protein sequence is selected: - **Three-way orthologs**: Use Acropora pulchra (Apul) sequence as representative - **Two-way orthologs**: Use available sequence from the species pair - **Rationale**: Ensures consistent annotation across ortholog groups while maximizing coverage ### 3. **Annotation Integration** Annotations from multiple sources are integrated into a comprehensive database: - **Cross-reference mapping**: Links ortholog groups to functional annotations - **Quality filtering**: Applies confidence thresholds for annotation reliability - **Comprehensive coverage**: Combines multiple annotation types for complete functional characterization ## Scripts ### 1. **11.3-ortholog-annotation.Rmd** (R Markdown) Main annotation script with comprehensive analysis and visualization capabilities. **Features:** - Loads ortholog groups from previous analysis - Extracts representative protein sequences - Runs InterProScan and BLAST analyses - Integrates annotations from multiple sources - Creates comprehensive visualizations - Generates functional enrichment analysis - Produces annotation databases for downstream analysis **Usage:** ```r # Run in RStudio or R console rmarkdown::render("11.3-ortholog-annotation.Rmd") ``` ### 2. **11.3-ortholog-annotation.py** (Python) Complementary Python script with additional functionality and utilities. **Features:** - Object-oriented annotation pipeline - Robust error handling and logging - Additional visualization options - Flexible annotation database creation - Command-line interface **Usage:** ```bash python3 11.3-ortholog-annotation.py ``` ### 3. **run_ortholog_annotation.sh** (Shell Script) Automated pipeline runner with dependency checking and error handling. **Features:** - Comprehensive dependency checking - Automated pipeline execution - Detailed logging and error reporting - Results validation - Summary report generation **Usage:** ```bash ./run_ortholog_annotation.sh ``` ## Input Requirements ### Required Files 1. **Ortholog Groups**: `../output/11-orthology-analysis/ortholog_groups.csv` - Output from previous orthology analysis - Contains ortholog group assignments and gene mappings 2. **Protein Sequences**: - `../../D-Apul/data/Apulchra-genome.pep.faa` (Acropora pulchra) - `../../E-Peve/data/Porites_evermanni_v1.annot.pep.fa` (Porites evermanni) - `../../F-Ptua/data/Pocillopora_meandrina_HIv1.genes.pep.faa` (Pocillopora tuahiniensis) ### Dependencies - **R**: Core analysis and visualization - **Python3**: Additional utilities and processing - **BLAST+**: Swiss-Prot database searches - **InterProScan**: Protein family and domain analysis - **R Packages**: tidyverse, Biostrings, ggplot2, VennDiagram, pheatmap - **Python Packages**: pandas, numpy, Bio, matplotlib, seaborn ## Output Files ### Core Results - `integrated_annotations.csv` - Complete annotation table - `annotation_summary_report.csv` - Summary statistics - `functional_enrichment_analysis.csv` - Enriched functional terms ### Database Files - `ortholog_annotations_database.csv` - Annotation database - `gene_to_ortholog_mapping.csv` - Gene ID to ortholog group mapping - `pfam_terms_database.csv` - Pfam domains database - `go_terms_database.csv` - GO terms database - `kegg_pathways_database.csv` - KEGG pathways database ### Analysis Results - `pfam_domain_analysis.csv` - Pfam domain distribution analysis - `go_term_analysis.csv` - GO term distribution analysis ### Visualizations - `pfam_domain_distribution.png` - Top Pfam domains - `go_term_distribution.png` - Top GO terms - `annotation_coverage.png` - Annotation coverage by ortholog type - `functional_enrichment.png` - Enriched functional terms ### Raw Data - `representative_sequences.faa` - Representative protein sequences - `interproscan_results.tsv` - Raw InterProScan results - `swissprot_blast.tsv` - Raw BLAST results ## Annotation Process ### Step 1: Sequence Extraction 1. Load ortholog groups from previous analysis 2. Extract representative protein sequences for each ortholog group 3. Create FASTA file for annotation tools ### Step 2: Functional Annotation 1. **InterProScan Analysis**: - Run InterProScan on representative sequences - Extract Pfam domains, GO terms, KEGG pathways - Parse results for integration 2. **Swiss-Prot BLAST**: - Download Swiss-Prot database (if needed) - Run BLAST searches against Swiss-Prot - Extract best hits and descriptions ### Step 3: Annotation Integration 1. Combine annotations from multiple sources 2. Create comprehensive annotation table 3. Generate functional term databases 4. Create gene-to-ortholog mappings ### Step 4: Analysis and Visualization 1. Analyze functional distribution 2. Perform functional enrichment analysis 3. Create visualizations 4. Generate summary reports ## Applications The annotated ortholog groups can be used for: ### 1. **Comparative Functional Genomics** - Understand functional conservation across coral species - Identify species-specific functional adaptations - Study functional evolution of gene families ### 2. **Pathway Analysis** - Identify conserved metabolic and signaling pathways - Study pathway evolution across species - Understand pathway-specific adaptations ### 3. **Gene Set Enrichment Analysis** - Find overrepresented functional terms in expression studies - Identify biological processes affected by experimental conditions - Study functional responses to environmental stress ### 4. **Cross-Species Expression Analysis** - Compare expression of functionally annotated orthologs - Study expression conservation and divergence - Identify expression patterns associated with functional categories ### 5. **Evolutionary Analysis** - Study functional evolution of gene families - Identify functional innovations and losses - Understand evolutionary constraints on function ## Quality Control ### Annotation Confidence - **E-value thresholds**: 1e-5 for BLAST searches - **Coverage requirements**: Minimum 50% sequence coverage - **Identity thresholds**: Minimum 30% sequence identity - **Multiple evidence**: Require multiple annotation sources when possible ### Validation Steps - Check annotation coverage across ortholog types - Validate functional term distributions - Verify cross-species annotation consistency - Assess annotation quality metrics ## Troubleshooting ### Common Issues 1. **Missing Dependencies**: Install required software and packages 2. **File Path Issues**: Verify input file locations and permissions 3. **Memory Issues**: Use subset of data for testing 4. **InterProScan Errors**: Check InterProScan installation and configuration ### Performance Optimization - Use parallel processing where available - Process data in batches for large datasets - Optimize memory usage for large protein sets - Use efficient data structures for annotation storage ## Citation If you use these annotations in your research, please cite: - **InterProScan**: Mitchell et al. (2019) Nucleic Acids Research - **Swiss-Prot**: Bairoch & Apweiler (2000) Nucleic Acids Research - **Pfam**: El-Gebali et al. (2019) Nucleic Acids Research - **Gene Ontology**: Ashburner et al. (2000) Nature Genetics - **KEGG**: Kanehisa & Goto (2000) Nucleic Acids Research ## Contact For questions or issues with the annotation pipeline, please contact the multi-species analysis team. --- *This annotation approach provides a comprehensive foundation for functional genomics studies across coral species, enabling detailed comparative analysis of gene function and evolution.*