# WGCNA Module Annotation ## Overview This script annotates WGCNA modules with functional information by joining module assignments with ortholog group annotations. ## Script `19-annotate-wgcna-modules.py` ## Purpose The script performs the following tasks: 1. Joins `wgcna_ortholog_module_assignments.csv` with `ortholog_groups_annotated.csv` on OG ID (group_id) 2. Summarizes dominant GO processes, gene functions, and physiological pathways for each WGCNA module 3. Generates comprehensive annotation reports in the output directory ## Input Files - **WGCNA Module Assignments**: `M-multi-species/output/18-ortholog-wgcna/wgcna_ortholog_module_assignments.csv` - Contains ortholog group IDs and their assigned WGCNA modules - Format: `group_id,wgcna_module` (e.g., `OG_00705,7`) - **Ortholog Annotations**: `M-multi-species/output/12-ortho-annot/ortholog_groups_annotated.csv` - Contains functional annotations for ortholog groups - Includes GO terms (BP, CC, MF), GO Slim terms, and protein names ## Output Files All output files are saved to: `M-multi-species/output/18-ortholog-wgcna/` ### 1. Detailed Text Report - **File**: `wgcna_module_annotation_summary.txt` - **Content**: Comprehensive summary for each WGCNA module including: - Total orthologs and annotation coverage - Top 10 GO Biological Process terms - Top 10 GO Cellular Component terms - Top 10 GO Molecular Function terms - Top 10 GO Slim terms (physiological pathways) - Top 10 protein names (gene functions) ### 2. Module Overview - **File**: `wgcna_module_overview.csv` - **Content**: Summary statistics for all modules in tabular format - **Columns**: - `module`: Module ID (0-14) - `num_orthologs`: Total orthologs in module - `num_annotated`: Orthologs with functional annotations - `annotation_coverage_%`: Percentage of orthologs with annotations - `num_go_bp_terms`: Number of unique GO BP terms - `num_go_cc_terms`: Number of unique GO CC terms - `num_go_mf_terms`: Number of unique GO MF terms - `num_goslim_terms`: Number of unique GO Slim terms - `num_unique_proteins`: Number of unique protein names - `top_go_bp`: Most common GO BP term - `top_goslim`: Most common GO Slim term - `top_protein`: Most common protein name ### 3. GO Term Summary Files Each file contains term counts across all modules: - **`wgcna_module_go_bp_summary.csv`**: GO Biological Process terms - **`wgcna_module_go_cc_summary.csv`**: GO Cellular Component terms - **`wgcna_module_go_mf_summary.csv`**: GO Molecular Function terms - **`wgcna_module_goslim_summary.csv`**: GO Slim terms (physiological pathways) - **`wgcna_module_protein_summary.csv`**: Protein names (gene functions) **Format**: Each file has columns: - `term`: The GO term or protein name - `module_0` through `module_14`: Count of term occurrences in each module - `total`: Total occurrences across all modules Terms are sorted by total count (descending), showing the top 50 most common terms. ## Usage ### Run the annotation script: ```bash cd M-multi-species/scripts python3 19-annotate-wgcna-modules.py ``` ### Test the output: ```bash cd M-multi-species/scripts python3 test_wgcna_annotation.py ``` ## Requirements - Python 3.10+ - pandas Install dependencies: ```bash pip3 install pandas ``` ## Module Statistics Based on current data: - **Total WGCNA Modules**: 15 (numbered 0-14) - **Total Orthologs**: 9,827 across all modules - **Annotated Orthologs**: 3,801 (38.7% coverage) - **Largest Module**: Module 1 with 2,649 orthologs - **Highest Annotation Coverage**: Module 5 with 47.9% coverage ## Dominant Functional Themes The script identifies dominant functional themes for each module: 1. **GO Biological Process**: Cellular and molecular processes (e.g., transcription regulation, cell division, proteolysis) 2. **GO Cellular Component**: Subcellular localization (e.g., nucleus, cytoplasm, plasma membrane) 3. **GO Molecular Function**: Molecular activities (e.g., ATP binding, protein binding, catalytic activity) 4. **GO Slim Terms**: High-level physiological pathways (e.g., organelle function, catalytic activity, development) 5. **Protein Names**: Specific gene functions based on protein annotations ## Notes - Modules are numbered 0-14 (15 modules total) - Not all orthologs have functional annotations (~39% annotation coverage overall) - Terms are counted with multiplicity - if an ortholog has multiple GO terms, each is counted - GO Slim terms provide a higher-level summary of physiological pathways compared to detailed GO terms