# Gene List Annotation Script ## Overview The `23-annotate-rank-lists.py` script annotates rank CSV files with ortholog group annotations from the comprehensive ortholog database. ## What it does 1. **Reads all CSV files** from `M-multi-species/output/22-Visualizing-Rank-outs/` 2. **Joins each file** with `M-multi-species/output/12-ortho-annot/ortholog_groups_annotated.csv` based on OG ID (ortholog group identifier) 3. **Creates annotated files** with suffix `_annotation.csv` containing: - Original data (OG ID and values) - Ortholog information (apul, peve, ptua gene IDs) - Protein annotations (protein name, organism, UniProt IDs) - GO annotations (GO biological process, cellular component, molecular function) - GOSlim annotations 4. **Generates a summary table** (`annotation_summary.md`) showing: - Total genes in each file - Number of annotated genes - Predominant GO Biological Processes (top 3) - Predominant Gene Functions (top 3) ## Usage ```bash python3 M-multi-species/scripts/23-annotate-rank-lists.py ``` The script will: - Process all 524 CSV files in the rank-outs directory - Create 524 corresponding `_annotation.csv` files - Generate an `annotation_summary.md` markdown table ## Output Files ### Annotation Files Each rank CSV file (e.g., `rank_55_comp49_top100.csv`) gets a corresponding annotation file (`rank_55_comp49_top100_annotation.csv`) with the following columns: 1. `group_id` - Ortholog group ID (e.g., OG_00531) 2. Original data columns (varies by file) 3. `apul` - A. pulchra gene ID 4. `peve` - P. evermanni gene ID 5. `ptua` - P. tuahiniensis gene ID 6. `type` - Ortholog type (e.g., three_way) 7. `avg_identity` - Average identity percentage 8. `query` - Query gene ID 9. `accession` - UniProt accession 10. `id` - UniProt ID 11. `reviewed` - UniProt review status 12. `protein_name` - Protein name and description 13. `organism` - Source organism 14. `pident`, `length`, `evalue`, `bitscore` - BLAST statistics 15. `title` - Full protein title 16. `go_ids` - GO term IDs 17. `go_bp` - GO Biological Process descriptions 18. `go_cc` - GO Cellular Component descriptions 19. `go_mf` - GO Molecular Function descriptions 20. `goslim_ids` - GOSlim IDs 21. `goslim_names` - GOSlim term names ### Summary Table The `annotation_summary.md` file contains a markdown table with one row per input file showing: - File name - Total number of genes - Number of genes with annotations - Top 3 most common GO Biological Processes - Top 3 most common Gene Functions ## Example For a file like `rank_55_comp49_top100.csv` containing: ```csv OG,Component_49 OG_00531,0.879 OG_07144,0.811 ... ``` The output `rank_55_comp49_top100_annotation.csv` will contain: ```csv group_id,Component_49,apul,peve,ptua,type,avg_identity,...,protein_name,...,go_bp,... OG_00531,0.879,FUN_001545-T1,Peve_00033291,...,three_way,49.81,...,Protein name,...,GO processes,... OG_07144,0.811,FUN_032474-T1,Peve_00036450,...,three_way,42.55,...,Protein name,...,GO processes,... ... ``` ## Dependencies - Python 3.x - pandas ## Notes - The script automatically skips files that end with `_annotation.csv` to avoid re-processing - Genes without annotations will have empty values in the annotation columns - The script uses a left join, so all original genes are retained even if they lack annotations