# GO Slim Term Summarization

## Script: 28-summarize-goslim-terms.py

This script summarizes GO Slim term occurrences across component gene annotation files.

### Purpose

Counts and summarizes the occurrences of GO Slim terms from the `goslim_names` column across all annotation files in the top genes per component directory.

### Input

- All CSV files containing "annotation" in their filename from:
  - `M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/`

### Output

Creates a single CSV file:
- `M-multi-species/output/26-rank35-optimization/lambda_gene_0.2/top_genes_per_component/goslim_term_counts.csv`

The output file contains:
- `term`: The GO Slim term name
- `Component_1` through `Component_35`: Count of the term in each component
- `total`: Total number of occurrences across all components

Results are sorted by total count in descending order (most common terms first).

### Usage

```bash
cd /path/to/timeseries_molecular
python3 M-multi-species/scripts/28-summarize-goslim-terms.py
```

### Testing

A test script is provided to validate the output:

```bash
python3 M-multi-species/scripts/test_goslim_summary.py
```

The test validates:
- Output file exists
- Correct file structure (columns: term, Component_1...Component_35, total)
- Data types are correct
- All counts are positive
- Totals match sum of component columns
- Results are sorted by total count
- Terms exist in source annotation files

### Example Output

```
term,Component_1,Component_2,Component_3,...,Component_35,total
organelle,39,27,34,...,21,966
catalytic activity,24,11,17,...,6,514
nucleus,18,20,17,...,7,452
cytosol,22,13,17,...,12,449
anatomical structure development,23,16,13,...,6,439
...
```

### Notes

- GO Slim terms in the `goslim_names` column are semicolon-separated
- The script handles missing values (NaN) appropriately
- Each term is counted once per occurrence in each row
- The script processes all annotation files found in the target directory