# Component Gene List Annotation Script

## Overview

The `27-annotate-component-genes.py` script annotates component gene CSV files with ortholog group annotations from the comprehensive ortholog database.

## What it does

1. **Reads all CSV files** from `M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/`
2. **Joins each file** with `M-multi-species/output/12-ortho-annot/ortholog_groups_annotated.csv` based on OG ID (ortholog group identifier)
3. **Creates annotated files** with suffix `_annotation.csv` containing:
   - Original data (OG ID and component values)
   - Ortholog information (apul, peve, ptua gene IDs)
   - Protein annotations (protein name, organism, UniProt IDs)
   - GO annotations (GO biological process, cellular component, molecular function)
   - GOSlim annotations
4. **Generates a summary table** (`annotation_summary.md`) showing:
   - Total genes in each file
   - Number of annotated genes
   - Predominant GO Biological Processes (top 3)
   - Predominant Gene Functions (top 3)

## Usage

```bash
cd M-multi-species/scripts
python3 27-annotate-component-genes.py
```

The script will:
- Process all 35 CSV files in the top_genes_per_component directory
- Create 35 corresponding `_annotation.csv` files
- Generate an `annotation_summary.md` markdown table

## Output Files

### Annotation Files
Each component CSV file (e.g., `Component_1_top100.csv`) gets a corresponding annotation file (`Component_1_top100_annotation.csv`) with the following columns:

1. `group_id` - Ortholog group ID (e.g., OG_00531)
2. Original data columns (varies by file, typically component values)
3. `apul` - A. pulchra gene ID
4. `peve` - P. evermanni gene ID  
5. `ptua` - P. tuahiniensis gene ID
6. `type` - Ortholog type (e.g., three_way)
7. `avg_identity` - Average identity percentage
8. `query` - Query gene ID
9. `accession` - UniProt accession
10. `id` - UniProt ID
11. `reviewed` - UniProt review status
12. `protein_name` - Protein name and description
13. `organism` - Source organism
14. `pident`, `length`, `evalue`, `bitscore` - BLAST statistics
15. `title` - Full protein title
16. `go_ids` - GO term IDs
17. `go_bp` - GO Biological Process descriptions
18. `go_cc` - GO Cellular Component descriptions
19. `go_mf` - GO Molecular Function descriptions
20. `goslim_ids` - GOSlim IDs
21. `goslim_names` - GOSlim term names

### Summary Table
The `annotation_summary.md` file contains a markdown table with one row per input file showing:
- File name
- Total number of genes
- Number of genes with annotations
- Top 3 most common GO Biological Processes
- Top 3 most common Gene Functions

## Example

For a file like `Component_1_top100.csv` containing:
```csv
OG_ID,Component_1
OG_03116,0.39079580706554823
OG_02366,0.3561842607520677
...
```

The output `Component_1_top100_annotation.csv` will contain:
```csv
group_id,Component_1,apul,peve,ptua,type,avg_identity,...,protein_name,...,go_bp,...
OG_03116,0.39079580706554823,FUN_012430-T1,Peve_00026925,...,three_way,55.942,...,Phenolphthiocerol/phthiocerol polyketide synthase,...,...
OG_02366,0.3561842607520677,FUN_008973-T1,Peve_00044338,...,three_way,66.187,...,Potassium voltage-gated channel,...,...
...
```

## Dependencies

- Python 3.10+
- pandas

## Notes

- The script automatically skips files that end with `_annotation.csv` to avoid re-processing
- Genes without annotations will have empty values in the annotation columns
- The script uses a left join, so all original genes are retained even if they lack annotations
- The script is modeled after `23-annotate-rank-lists.py` which performs a similar function for rank output files