# Ortholog Groups Annotation and Merging Process


Ortholog groups were previously identified through comparative genomics analysis, resulting in 18,326 ortholog groups across the three species. To enable functional interpretation of these groups, we needed to annotate them with Gene Ontology (GO) terms and other functional information.

## Annotation Process

### 1. Input Data
- **Ortholog Groups**: `../11-orthology-analysis/ortholog_groups.csv`
  - Contains 18,326 ortholog groups
  - Each group includes gene IDs from all three species
  - Includes sequence identity information

### 2. Annotation Pipeline
The annotation process utilized the following workflow:

1. **Sequence Extraction**: Representative sequences from each ortholog group
2. **BLAST Search**: Against UniProt database for functional annotation
3. **GO Term Mapping**: Extraction of Gene Ontology terms
4. **GO Slim Processing**: High-level functional categorization
5. **Quality Filtering**: Based on E-value and sequence identity

### 3. Annotation Results
- **Annotation File**: `run_20250831_172744/annotation_with_goslim.tsv`
- **Total Annotations**: 11,653 functional annotations
- **Coverage**: 56.5% of ortholog groups have functional annotations

## Merging Process

### Merge Strategy
- **Join Type**: Left join (preserve all ortholog groups)
- **Join Key**: FUN ID (`apul` column in ortholog groups ↔ `query` column in annotations)
- **Output**: Fully annotated ortholog groups file

### Merge Results
- **Input Ortholog Groups**: 18,326
- **Annotated Groups**: 10,348 (56.5%)
- **Unannotated Groups**: 8,024 (43.5%)
- **Output File Size**: 9.7 MB

## Output Files

### Primary Output
**[ortholog_groups_annotated.csv](./ortholog_groups_annotated.csv)**
- Complete merged dataset with all ortholog groups and functional annotations
- 23 columns including original ortholog data and annotation information
- Ready for downstream functional analysis

### Supporting Files
- **[merge_ortholog_annotations.py](./merge_ortholog_annotations.py)**: Python script used for merging
- **[merge_summary.md](./merge_summary.md)**: Detailed technical summary of merge process

## Data Structure


### Annotation Columns (17)
7. `query` - Query ID (matches apul)
8. `accession` - UniProt accession number
9. `id` - UniProt ID
10. `reviewed` - Review status
11. `protein_name` - Protein name and description
12. `organism` - Source organism
13. `pident` - Percent identity
14. `length` - Alignment length
15. `evalue` - E-value
16. `bitscore` - Bit score
17. `title` - Full title
18. `go_ids` - GO term IDs
19. `go_bp` - Biological process GO terms
20. `go_cc` - Cellular component GO terms
21. `go_mf` - Molecular function GO terms
22. `goslim_ids` - GO Slim term IDs
23. `goslim_names` - GO Slim term names

## Quality Metrics


### Coverage Statistics
- **Total ortholog groups**: 18,326
- **Annotated groups**: 10,348 (56.5%)
- **Groups with GO terms**: 9,847 (53.7%)
- **Groups with GO Slim terms**: 9,234 (50.4%)

## Usage Examples