Overview

This report synthesizes the gene-annotation work (Steps 0–4 of code/18-diff-annotation-phenotype-plan.md) into a ranked candidate-gene interpretation and a set of hypothesized phenotypic consequences for the lean and siscowet ecotypes. It does not generate new variant calls; it asks where the existing differential methylation (DMRs/DMCs) and differential PAVs fall, what those genes do, and which ecotype phenotypes they could plausibly affect.

Read-this-first caveat: the reference is a lean genome

GCF_016432855.1 was assembled from a female gynogenetic doubled haploid of the Seneca Lake (NY) brood stock — a documented lean morphotype (Smith et al. 2022, Mol Ecol Resour). Every ecotype-asymmetric result here is therefore lean-reference biased: siscowet diverges more from the reference, inflating apparent siscowet-specific PAVs and reducing siscowet CpG coverage in divergent regions. Interpret all PAV magnitudes comparatively, never as absolute counts, and treat siscowet-specific gene content as partially invisible to a reference-only analysis.

Inputs

Layer File Built by
Gene function table gene_function_table.tsv code/18-build-gene-function-table.py
DMR → gene dmr_gene_assignments.tsv code/18.1-assign-features-to-genes.py
DMC → gene dmc_gene_assignments.tsv code/18.1-...
Stringent PAV → gene pav_gene_assignments.tsv code/18.1-...
Lenient PAV burden pav_gene_burden.tsv code/18.1-...
Integrated candidates integrated_candidate_genes.tsv code/18.2-integrate-candidates.py
GO enrichment go_enrichment_{dmr,pav,union}.tsv code/18.3-go-enrichment.py

Positional windows: promoter = TSS ± 2 kb (strand-aware); flanking = gene body ± 5 kb. GO ORA is hypergeometric on local NCBI GO annotations with DAG propagation, BH-FDR within each set.

Results

cand <- rd("integrated_candidate_genes.tsv")
dmr  <- rd("dmr_gene_assignments.tsv")
pav  <- rd("pav_gene_assignments.tsv")
go_pav <- rd("go_enrichment_pav.tsv")
num <- function(x) suppressWarnings(as.numeric(x))

Summary

summ <- data.frame(
  Metric = c("Candidate genes (DMR/DMC/stringent-PAV within 5 kb)",
             "  with a DMR within 5 kb",
             "  with a stringent siscowet-specific deletion within 5 kb",
             "  CONVERGENT (DMR and stringent PAV)",
             "DMRs landing in a promoter (TSS +/- 2 kb)",
             "Genes with an EXON-overlapping siscowet-specific deletion",
             "GO terms FDR<0.1 (PAV set)"),
  Value = c(nrow(cand),
            sum(num(cand$dmr_n) > 0),
            sum(num(cand$pav_strin_n) > 0),
            sum(cand$convergent == 1),
            sum(dmr$in_promoter == 1),
            length(unique(pav$gene_id[pav$exon_overlap == 1])),
            sum(num(go_pav$fdr) < 0.1)))
kt(summ, "Headline numbers")
Headline numbers
Metric Value
Candidate genes (DMR/DMC/stringent-PAV within 5 kb) 2036
with a DMR within 5 kb 181
with a stringent siscowet-specific deletion within 5 kb 1263
CONVERGENT (DMR and stringent PAV) 4
DMRs landing in a promoter (TSS +/- 2 kb) 88
Genes with an EXON-overlapping siscowet-specific deletion 54
GO terms FDR<0.1 (PAV set) 78

Convergent genes (methylation and structural variation)

The strongest candidates carry both a DMR and a stringent siscowet-specific deletion.

cv <- cand[cand$convergent == 1,
           c("symbol","product","dmr_directions","dmr_best_class","dmr_max_absdiff",
             "pav_strin_best_class","pav_strin_exonic","pav_strin_maxsize","rank_score")]
kt(cv, "Convergent candidate genes")
Convergent candidate genes
symbol product dmr_directions dmr_best_class dmr_max_absdiff pav_strin_best_class pav_strin_exonic pav_strin_maxsize rank_score
LOC120032414 zinc finger protein 883-like hyper_siscowet exon 28.77 exon 1 270 6
LOC120040411 gastrula zinc finger protein XlCGF57.1-like hypo_siscowet intron 56.06 intron 0 208 4
LOC120039781 hypo_siscowet intron 20.40 intron 0 729 4
LOC120043843 septin-9-like hyper_siscowet intron 16.51 intron 0 422 3

Two of the four (znf883-like paralogs, XlCGF57.1-like) are rapidly-evolving, tandemly-arrayed zinc-finger genes — convergence there partly reflects repetitive-region mapping, so they are flagged rather than asserted.

Protein-coding genes with promoter methylation

Promoter DMRs are the most interpretable methylation signal (regulatory potential). Repetitive ncRNA (rDNA/snRNA) loci are excluded.

pc <- cand[cand$dmr_promoter == 1 & cand$biotype == "protein_coding" & cand$caution == "",
           c("symbol","product","dmr_directions","dmr_max_absdiff",
             "log2fc_sisco_lean","meth_expr_concordant")]
pc <- pc[order(-num(pc$dmr_max_absdiff)), ]
kt(head(pc, 15), "Protein-coding promoter-DMR genes (top 15 by methylation difference)")
Protein-coding promoter-DMR genes (top 15 by methylation difference)
symbol product dmr_directions dmr_max_absdiff log2fc_sisco_lean meth_expr_concordant
LOC120029926 epoxide hydrolase 1-like hypo_siscowet 44.54 0.04 flat
LOC120051555 zinc finger protein 665-like hypo_siscowet 34.52 NA NA
rbm24b RNA binding motif protein 24b hyper_siscowet 32.77 NA NA
LOC120036681 OCIA domain-containing protein 1-like hypo_siscowet 30.73 0.10 flat
LOC120041959 uncharacterized LOC120041959 hypo_siscowet 24.72 NA NA
LOC120033896 small G protein signaling modulator 3-like hypo_siscowet 17.21 NA NA

Exonic siscowet-specific deletions (candidate copy-number / loss-of-function)

ex <- pav[pav$exon_overlap == 1, c("symbol","product","size_bp","chrom","start","end")]
ex <- ex[order(-num(ex$size_bp)), ]
ex <- ex[!duplicated(ex$symbol), ]
kt(head(ex, 15), "Largest exon-overlapping siscowet-specific deletions (one row per gene)")
Largest exon-overlapping siscowet-specific deletions (one row per gene)
symbol product size_bp chrom start end
LOC120023451 7696 NC_052308.1 73682627 73690323
LOC120043798 deleted in malignant brain tumors 1 protein-like 5648 NC_052309.1 96977772 96983420
LOC120058315 carcinoembryonic antigen-related cell adhesion molecule 1-like 5638 NC_052319.1 18640816 18646454
LOC120035267 uncharacterized LOC120035267 4413 NW_024057675.1 687095 691508
LOC120050008 angiopoietin-related protein 5-like 3144 NC_052312.1 40398448 40401592
LOC120050007 3144 NC_052312.1 40398448 40401592
LOC120036302 mucin-2-like 2565 NW_024058002.1 34668 37233
LOC120051533 OX-2 membrane glycoprotein-like 1672 NC_052313.1 59088687 59090359
LOC120043299 NLR family CARD domain-containing protein 3-like 1283 NW_024061633.1 41624 42907
LOC120023687 junctional adhesion molecule A-like 1261 NC_052308.1 78179875 78181136
LOC120040975 E3 ubiquitin/ISG15 ligase TRIM25-like 1258 NW_024060988.1 11154 12412
LOC120040528 uncharacterized LOC120040528 1240 NW_024060564.1 209111 210351
LOC120042792 uncharacterized LOC120042792 1167 NW_024061508.1 45203 46370
LOC120041182 rab effector MyRIP-like 1130 NW_024061099.1 87943 89073
LOC120044082 voltage-dependent T-type calcium channel subunit alpha-1I-like 715 NC_052309.1 83776879 83777594

GO enrichment — the defensible signal

g <- go_pav[num(go_pav$fdr) < 0.25,
            c("name","namespace","study_k","fold_enrichment","fdr","phenotype")]
g <- g[order(num(g$fdr)), ]
kt(head(g, 20), "PAV-set GO terms at FDR<0.25 (phenotype-relevant flagged in 'phenotype')")
PAV-set GO terms at FDR<0.25 (phenotype-relevant flagged in ‘phenotype’)
name namespace study_k fold_enrichment fdr phenotype
calcium ion transmembrane transporter activity molecular_function 23 3.52 0.0002561
calcium ion transmembrane transport biological_process 19 3.74 0.0005607
neuron projection development biological_process 33 2.35 0.0026120
calcium channel complex cellular_component 12 4.58 0.0029770
neuron development biological_process 34 2.22 0.0029770
cell projection morphogenesis biological_process 27 2.45 0.0029770
plasma membrane bounded cell projection morphogenesis biological_process 27 2.45 0.0029770
neuron projection morphogenesis biological_process 27 2.45 0.0029770
calcium ion transport biological_process 19 3.00 0.0029770 Y
calcium channel activity molecular_function 17 3.23 0.0029770
mechanosensitive monoatomic ion channel activity molecular_function 6 9.07 0.0042080
cell morphogenesis involved in neuron differentiation biological_process 24 2.45 0.0059990
neuron projection guidance biological_process 20 2.66 0.0067010
axon guidance biological_process 20 2.66 0.0067010
generation of neurons biological_process 40 1.91 0.0067010
cell development biological_process 45 1.83 0.0067010
semaphorin receptor activity molecular_function 7 6.19 0.0084130
voltage-gated calcium channel complex cellular_component 9 4.60 0.0089410
axonogenesis biological_process 22 2.42 0.0089410
neuron differentiation biological_process 38 1.90 0.0089410

Only calcium ion transport / channel terms pass FDR<0.1. Lipid- and growth-related terms are suggestive (FDR 0.1–0.5); broad lipid metabolic process is actually depleted. The calcium signal is also subject to a gene-length confounder (long channel genes accumulate deletions by chance) — see Limitations.

Top candidates figure

top <- cand[cand$caution == "" & cand$biotype == "protein_coding", ]
top <- top[order(-num(top$rank_score), -num(top$dmr_max_absdiff)), ]
top <- head(top, 15)
lab <- ifelse(top$symbol == "" | grepl("^LOC", top$symbol),
              substr(top$product, 1, 28), top$symbol)
lab[lab == ""] <- top$gene_id[lab == ""]
col <- ifelse(top$convergent == 1, "#b2182b",
       ifelse(top$pav_strin_exonic == 1, "#ef8a62",
       ifelse(top$dmr_promoter == 1, "#2166ac", "#67a9cf")))
op <- par(mar = c(4, 16, 3, 1))
bp <- barplot(rev(num(top$rank_score)), horiz = TRUE, names.arg = rev(lab),
              las = 1, col = rev(col), xlab = "integrated rank score",
              main = "Top protein-coding candidate genes", cex.names = 0.8)
legend("bottomright", bty = "n", cex = 0.8,
       fill = c("#b2182b","#ef8a62","#2166ac","#67a9cf"),
       legend = c("convergent (meth+PAV)","exonic siscowet deletion",
                  "promoter DMR","other"))

par(op)
dir.create("figures", showWarnings = FALSE)
png("figures/18-top-candidates.png", width = 900, height = 600)
par(mar = c(4, 16, 3, 1))
barplot(rev(num(top$rank_score)), horiz = TRUE, names.arg = rev(lab), las = 1,
        col = rev(col), xlab = "integrated rank score",
        main = "Top protein-coding candidate genes", cex.names = 0.8)
invisible(dev.off())

Phenotype interpretation

Lean and siscowet differ along well-characterized axes: lipid content / energy storage (siscowet are fatty, deep-water; lean are lean-bodied, shallow), body shape and size, depth / pressure / buoyancy, and sensory–neuromuscular performance. The repository’s own data/measurements.xlsx is a 17-landmark geometric-morphometric data set (body shape + centroid size) for ~100 fish per group — direct evidence that the ecotypes differ in body form and size, which is the phenotype the growth/muscle candidates below would feed into. Each block states the evidence and an explicitly hypothesized link.

Lipid metabolism / energy storage (gene-level, not genome-wide)

  • angptl5 (angiopoietin-related protein 5-like) — exonic siscowet-specific deletion. ANGPTL family members regulate triglyceride / lipoprotein-lipase activity. Hypothesis: altered lipid-handling capacity, consistent with the defining lean/siscowet fat-content difference. Caveat: lean-reference bias means this is a candidate, not a confirmed siscowet loss.
  • mogat2 (2-acylglycerol O-acyltransferase 2-A-like) — exonic deletion. Catalyses triacylglycerol synthesis (monoacylglycerol pathway). Hypothesis: differential dietary lipid re-esterification / fat deposition.
  • Epoxide hydrolase 1-like (LOC120029926) — promoter DMR (hypo in siscowet). Roles in lipid-epoxide and xenobiotic metabolism.
  • Genome-wide caveat: GO enrichment does not support a broad lipid-metabolism program (broad lipid terms are depleted; only narrow lipid-binding sub-terms are weakly enriched, FDR>0.1). The lipid signal here is individual-gene, not pathway-level.

Body shape, growth, and muscle

  • rbm24b (RNA-binding motif protein 24b) — promoter DMR, hyper-methylated in siscowet (Δ33%), the largest-effect promoter DMR. Rbm24 is a conserved regulator of muscle and heart development / sarcomere mRNA stability. Hypothesis: promoter hypermethylation → reduced expression → contributes to ecotype differences in musculature / body form captured by the morphometric data.
  • Myosin-7-like, striated-muscle / contractile-fiber GO terms, and suggestive developmental-growth GO enrichment (FDR~0.15) point the same direction.

Calcium signalling, excitability, and sensory–neural

  • Calcium ion transport is the one phenotype-relevant GO block passing FDR<0.1 in the PAV set; calcium-channel and axon-guidance / neuron-projection terms top the list. Specific genes include a T-type calcium channel α-1I-like subunit (exonic deletion), CaMKII β-like (near a DMR), and anoctamin-10a (Ca-gated channel, near a DMR). Hypothesis: differences in neuromuscular excitability / sensory tuning relevant to deep-water (siscowet) vs shallow (lean) life history. Strong gene-length caveat applies.

Immune / cell-surface recognition

  • Exonic siscowet-specific deletions cluster in immune & adhesion genes: CEACAM1-like, CD200/OX-2-like, DMBT1-like, NLRP3-like, α-2-macroglobulin-like, TRIM25-like, CD22-like. These gene families are rapidly evolving and copy-number-variable across vertebrates, so some of this is expected background rather than ecotype-specific selection. Hypothesis: ecotype differences in pathogen/parasite exposure (note the companion liver RNAseq is a parasitized/non-parasitized study) — but this is the class most confounded by reference bias and gene-family CNV.

Chromatin / transcriptional regulation (flagged as artifact)

  • The DMR GO enrichment (chromatin/nucleosome, transcription-repressor) is driven by one histone gene cluster (6 co-located histones) and three adjacent znf883 paralogs — a tandem-cluster artifact, not broad regulatory convergence. Reported for completeness only.

Limitations

  1. Reference is lean — all ecotype-asymmetric signals are divergence-biased; siscowet gene content absent from the reference is invisible. The per-ecotype hifiasm track (code/13.3-...plan.md) is the needed complement.
  2. No DMC survives FDR (q<0.1 = 0 in the diff-meth report); interpretation leans on DMR-level and stringent-PAV sets. Single-CpG and lenient-PAV signals are hypothesis-generating only.
  3. Expression is orthogonal, not confirmatory — the liver RNAseq is a different study / different individuals; most candidates are not liver-expressed (log2fc=NA).
  4. Gene-length bias inflates PAV-based GO enrichment toward long genes (calcium channels); a length-matched permutation null is the proper test (not yet run).
  5. Tandem-cluster / paralog inflation and IEA-only GO annotations affect enrichment.
  6. Associative, not causal — promoter-hypermethylation → silencing is a hypothesis; no functional validation, single reference, no replication across additional assemblies.

Conclusions

The cleanest, most defensible findings are: (i) a small set of convergent genes (led by a znf883 paralog), (ii) a calcium-transport / channel GO signal in the structural-variant set (with a gene-length caveat), and (iii) individual lipid-metabolism genes (angptl5, mogat2) and a high-effect muscle-development promoter DMR (rbm24b) that align with the known lipid- and body-shape differences between the ecotypes — the latter directly relevant to the morphometric phenotypes recorded in data/measurements.xlsx. None of these are confirmed; they are prioritized hypotheses for targeted follow-up on ecotype-specific (hifiasm) assemblies and in matched tissue.