# 18 Differential Methylation & PAV — Gene Annotation and Phenotype Plan ## Goal For the lean vs siscowet comparison on the **GCF_016432855.1 (SaNama_1.0)** reference, determine **which genes** the differential methylation (DMRs) and differential PAVs fall in/near, attach **functional annotation** to those genes, and reason about **plausible phenotypic consequences** for the two ecotypes. The output is an interpretable, ranked gene table plus a short biological narrative — not new variant calling. ## Inputs already in the repo | Layer | File(s) | Notes | |---|---|---| | DMRs | `analyses/14-diff-meth/dmrs.bed`, `dmrs_annotated.csv`, `dmrs_genes_intersect.bed` | 302 DMRs; 25 already intersect genes. Direction in `dmrs.csv` (`hyper`/`hypo_siscowet`). | | DMCs | `analyses/14-diff-meth/significant_dmcs.bed` | 4,440 single-CpG hits — wider net than DMRs. | | Diff PAV (lenient) | `analyses/15-diff-pav/{lean,siscowet}_specific.{insertions,deletions}.bed` | Large, noisy. | | Diff PAV (stringent) | `analyses/15-diff-pav/stringent.siscowet_specific.deletions.bed` | **3,465 deletions >100 bp, all-4-vs-none — the high-confidence set to lead with.** | | Gene models | `data/20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed` | 46,359 genes, coords + name only (9,738 real symbols, rest `LOC######`). | | RNAseq | `data/whole_tx_table.csv`, `fpkm-all_transcripts.csv` | Per-sample FPKM with `gene_name` — orthogonal expression support. | | Phenotype | `data/measurements.xlsx` | Direct ecotype phenotype measures to anchor interpretation. | ## Key gap to close first The repo has **no functional annotation** — only gene coordinates and symbols. Inferring phenotype requires mapping each implicated gene to a product/description, GO/KEGG terms, and (ideally) a vertebrate ortholog. This is the one new data dependency. **Step 0 below acquires it; everything downstream reuses bedtools intersects the repo already does.** --- ## Plan of work ### Step 0 — Build a functional annotation table for SaNama_1.0 genes ✅ DONE > Implemented in `code/18-build-gene-function-table.py` → > `analyses/18-annotation/gene_function_table.tsv` (see `analyses/18-annotation/README.md`). > All 46,359 BED genes covered; 9,738 symbols, 46,231 products, 34,367 with GO. GO came > straight from NCBI's `gene_ontology.gaf.gz`, so the eggNOG/DIAMOND ortholog pass below was > not needed for first-pass coverage (revisit only for symbol-less + product-less LOC genes). - Pull the NCBI annotation for GCF_016432855.1 (the `*_genomic.gff.gz` and the feature/`gene_ontology` or protein tables from the RefSeq `Annotation_comparison` / `*_feature_table.txt.gz`). Goal columns: `gene_symbol`, `LOC_id`, `product/description`, `GO_terms`, and a vertebrate ortholog (zebrafish/human) where available. - For `LOC######` genes with no symbol, assign orthology by the protein FASTA against a reference proteome (zebrafish SwissProt) — eggNOG-mapper or a DIAMOND/BLAST + best-hit pass is sufficient. Many symbols are already zebrafish-style, so coverage should be good. - Deliverable: `analyses/18-annotation/gene_function_table.tsv` keyed on the same `gene-XXX` IDs used in `data/...genes.bed`. This is the join key for all later steps. ### Step 1 — Assign DMRs (and DMCs) to genes with positional context ✅ DONE > `code/18.1-assign-features-to-genes.py` → `analyses/18-annotation/dmr_gene_assignments.tsv` > + `dmc_gene_assignments.tsv`. 149/302 DMRs within 5 kb of a gene (26 gene-body, matches the > report's 25); 88 DMRs land in promoters. Promoter=TSS±2kb, flanking=gene±5kb. - Reuse the existing `bedtools intersect` pattern. Beyond exact overlap, add **proximity classes** because regulatory methylation often sits in promoters: intersect DMRs against gene bodies, and against promoter windows (TSS ±1–2 kb, strand-aware) and gene ±5 kb. - Classify each DMR–gene pair as `promoter / gene_body / intergenic-near / distal`, and carry DMR `direction` (hyper/hypo in siscowet) and `mean_diff`. - Run the same against `significant_dmcs.bed` to catch single-CpG signals near genes the 302-DMR set misses. - Deliverable: `analyses/18-annotation/dmr_gene_assignments.tsv`. ### Step 2 — Assign differential PAVs to genes, by overlap type ✅ DONE > `code/18.1-assign-features-to-genes.py` → `analyses/18-annotation/pav_gene_assignments.tsv` > (stringent siscowet-specific deletions: 1,543/3,465 near a gene, **54 with exon overlap**) > + `pav_gene_burden.tsv` (lenient lean/siscowet del+ins per-gene burden, 22,450 genes; > reference-bias caveat carried in-file). - **Lead with `stringent.siscowet_specific.deletions.bed`** (cleanest signal). Intersect with gene bodies and exons; classify as `exon-overlapping` (potential loss-of-function / copy change) vs `intronic` vs `flanking`. - Repeat for the lenient lean- and siscowet-specific sets as a lower-confidence tier, and report a per-gene PAV burden (count + total bp of ecotype-specific deletion). - Insertions: the stringent insertion set is empty, so treat the lenient insertion sets as exploratory only. - Deliverable: `analyses/18-annotation/pav_gene_assignments.tsv` with a confidence tier. ### Step 3 — Integrate the two layers and add expression evidence ✅ DONE > `code/18.2-integrate-candidates.py` → `analyses/18-annotation/integrated_candidate_genes.tsv` > (2,036 candidate genes; **4 convergent** DMR+stringent-PAV genes). Liver RNAseq joined on > `gene_id`; expression linkage weak (different individuals/tissue) so used as support only. > Ranked view deprioritizes repetitive ncRNA via a `caution` flag. Lipid-metabolism genes > (`angptl5`, `mogat2`, epoxide hydrolase 1) flagged for Step 5. - Join DMR-genes and PAV-genes on the Step 0 table to produce one master gene list with flags: `has_DMR`, `DMR_direction`, `in_promoter`, `has_stringent_PAV`, `PAV_exonic`. - **Highlight convergent genes** hit by *both* methylation and PAV — strongest candidates. - Cross-reference each candidate against the RNAseq FPKM table (`whole_tx_table.csv`) by `gene_name`: is the gene expressed in liver, and does lean-vs-siscowet expression move in the direction methylation/PAV would predict? Note the RNAseq is liver/parasite-focused, so treat as supporting, not confirmatory. - Deliverable: `analyses/18-annotation/integrated_candidate_genes.tsv` (ranked). ### Step 4 — Functional enrichment ✅ DONE (GO; KEGG deferred) > `code/18.3-go-enrichment.py` → `analyses/18-annotation/go_enrichment_{dmr,pav,union}.tsv`. > Self-contained hypergeometric ORA on local GO with DAG propagation (go-basic.obo), BH-FDR. > **Robust signal: calcium ion transport/channels (FDR 3e-3) in the PAV set.** DMR enrichment > is a histone + znf883 tandem-cluster artifact; lipid/growth terms only suggestive (FDR>0.1). > Confounders flagged in README: gene-length bias (PAV ORA), reference bias, IEA annotations, > gene clustering. KEGG deferred (needs KO/ortholog mapping; GO is the substantive result). - Run GO/KEGG over-representation on the DMR-gene set, the PAV-gene set, and the union, each against the full annotated gene set as background (gprofiler2 / clusterProfiler with the zebrafish ortholog mapping from Step 0). - Report enriched terms with FDR; flag any tied to **lipid/fatty-acid metabolism, buoyancy/swim-bladder, lipid storage, growth, depth/pressure or hypoxia response** — the axes that distinguish lean (shallow, lean-bodied) from siscowet (deep-water, high lipid content) ecotypes. - Deliverable: enrichment tables + a couple of summary plots in `analyses/18-annotation/`. ### Step 5 — Phenotype interpretation ✅ DONE > `code/18-diff-annotation-phenotype.Rmd` (base-R, render-ready; renders where pandoc is > available, e.g. RStudio) + figure `figures/18-top-candidates.png`. Per-axis evidence blocks > (lipid/energy, body-shape/growth/muscle, calcium/sensory-neural, immune, chromatin-artifact), > anchored to the morphometric `data/measurements.xlsx`, with the full causality/reference > caveat set. Verified by knitting to markdown (all chunks execute, tables + figure populate). - For the top convergent/enriched candidates, write a short evidence block each: gene, function, methylation/PAV/expression evidence, and the **specific lake-trout phenotype** it plausibly affects (lipid deposition, buoyancy, depth tolerance, growth rate, etc.). - Anchor claims to `data/measurements.xlsx` where a measured ecotype trait lines up. - Be explicit about causality limits: associations on a single reference, no functional validation; promoter-hypermethylation→silencing is a hypothesis, not a result. - Deliverable: `code/18-diff-annotation-phenotype.Rmd` (or `.qmd`) rendering the ranked table + narrative, mirroring the `14-diff-meth.Rmd` style, and a top-candidates figure for the genome browser. --- ## Caveats / scope guards - **The reference is a LEAN-background genome.** GCF_016432855.1 (SaNama_1.0) was assembled from a female gynogenetic **doubled haploid** derived from the **Seneca Lake (NY) brood stock** (Smith et al. 2022, Mol Ecol Resour; Iron River NFH). The paper gives no explicit ecotype, but the **Seneca Lake strain is a lean morphotype** (shallow <80 m, late-fall spawner) — so the reference is effectively lean, and from an Atlantic-drainage population distinct from Great Lakes fish, meaning even the lean samples are only an approximate match. - **Directional PAV bias (lead with this caveat).** Siscowet diverges more from a lean reference, so siscowet reads map less completely and siscowet shows systematically *more* apparent variants — an artifact, not biology (consistent with the 1.33M vs 0.99M and the 3,465 vs 0 stringent-deletion asymmetry already observed). Therefore: - Siscowet-specific vs lean-specific **counts are not magnitude-comparable**. - A "siscowet-specific deletion" is enriched for **lean-present/siscowet-absent** sequence and partly reflects the lean reference rather than true siscowet loss. - Lean-specific **deletions are suppressed** (can't delete what the reference lacks); **lean-specific *insertions*** (lean sequence absent from the reference) are the most trustworthy lean signal. Lead PAV interpretation with siscowet-specific deletions and lean-specific insertions — the two least-biased classes. - **Methylation power bias.** Divergence-driven mapping loss costs siscowet covered CpGs in the most divergent regions, biasing DMR detection toward reference-concordant (lean-like) loci — conservative against exactly the regions where ecotypes differ most. Flag high- divergence regions as low-power. - **Invisible siscowet gene content.** Genes absent/highly divergent in the lean reference are not in the gene BED, so a reference-only intersect cannot see siscowet-specific gene content. The per-ecotype hifiasm track (`code/13.3-...plan.md`) is the needed complement; keep the two analyses in distinct output folders. - **Doubled-haploid reference**: a single fully homozygous haplotype stands in for the species — no allelic variation, not a pangenome. Affects the annotation backbone, not the per-sample methylation values (called on the bc20xx samples). - **PAV direction semantics**: confirm the calling convention in `code/15.5-diff-pav.py` before asserting which ecotype "lost" a gene, and interpret it against the lean reference. - **FDR reality**: 0 DMCs survive q<0.1 in the current report. Lead interpretation with the DMR-level and stringent-PAV sets; treat single-CpG and lenient-PAV hits as hypothesis- generating, and say so. - **Annotation coverage**: phenotype reasoning is only as good as Step 0; `LOC` genes with no ortholog stay as "unknown function" rather than being forced into a story. ## Suggested implementation order 1. Step 0 (annotation table) — unblocks everything. 2. Steps 1 & 2 (intersects) — fast, reuse existing bedtools patterns. 3. Step 3 (integrate + expression). 4. Steps 4 & 5 (enrichment + write-up).