This report synthesizes the gene-annotation work (Steps 0–4 of code/18-diff-annotation-phenotype-plan.md)
into a ranked candidate-gene interpretation and a set of
hypothesized phenotypic consequences for the lean and
siscowet ecotypes. It does not generate new variant calls; it asks
where the existing differential methylation (DMRs/DMCs) and
differential PAVs fall, what those genes do, and which
ecotype phenotypes they could plausibly affect.
Read-this-first caveat: the reference is a lean genome
GCF_016432855.1 was assembled from a female gynogenetic doubled haploid of the Seneca Lake (NY) brood stock — a documented lean morphotype (Smith et al. 2022, Mol Ecol Resour). Every ecotype-asymmetric result here is therefore lean-reference biased: siscowet diverges more from the reference, inflating apparent siscowet-specific PAVs and reducing siscowet CpG coverage in divergent regions. Interpret all PAV magnitudes comparatively, never as absolute counts, and treat siscowet-specific gene content as partially invisible to a reference-only analysis.
| Layer | File | Built by |
|---|---|---|
| Gene function table | gene_function_table.tsv |
code/18-build-gene-function-table.py |
| DMR → gene | dmr_gene_assignments.tsv |
code/18.1-assign-features-to-genes.py |
| DMC → gene | dmc_gene_assignments.tsv |
code/18.1-... |
| Stringent PAV → gene | pav_gene_assignments.tsv |
code/18.1-... |
| Lenient PAV burden | pav_gene_burden.tsv |
code/18.1-... |
| Integrated candidates | integrated_candidate_genes.tsv |
code/18.2-integrate-candidates.py |
| GO enrichment | go_enrichment_{dmr,pav,union}.tsv |
code/18.3-go-enrichment.py |
Positional windows: promoter = TSS ± 2 kb (strand-aware); flanking = gene body ± 5 kb. GO ORA is hypergeometric on local NCBI GO annotations with DAG propagation, BH-FDR within each set.
cand <- rd("integrated_candidate_genes.tsv")
dmr <- rd("dmr_gene_assignments.tsv")
pav <- rd("pav_gene_assignments.tsv")
go_pav <- rd("go_enrichment_pav.tsv")
num <- function(x) suppressWarnings(as.numeric(x))
summ <- data.frame(
Metric = c("Candidate genes (DMR/DMC/stringent-PAV within 5 kb)",
" with a DMR within 5 kb",
" with a stringent siscowet-specific deletion within 5 kb",
" CONVERGENT (DMR and stringent PAV)",
"DMRs landing in a promoter (TSS +/- 2 kb)",
"Genes with an EXON-overlapping siscowet-specific deletion",
"GO terms FDR<0.1 (PAV set)"),
Value = c(nrow(cand),
sum(num(cand$dmr_n) > 0),
sum(num(cand$pav_strin_n) > 0),
sum(cand$convergent == 1),
sum(dmr$in_promoter == 1),
length(unique(pav$gene_id[pav$exon_overlap == 1])),
sum(num(go_pav$fdr) < 0.1)))
kt(summ, "Headline numbers")
| Metric | Value |
|---|---|
| Candidate genes (DMR/DMC/stringent-PAV within 5 kb) | 2036 |
| with a DMR within 5 kb | 181 |
| with a stringent siscowet-specific deletion within 5 kb | 1263 |
| CONVERGENT (DMR and stringent PAV) | 4 |
| DMRs landing in a promoter (TSS +/- 2 kb) | 88 |
| Genes with an EXON-overlapping siscowet-specific deletion | 54 |
| GO terms FDR<0.1 (PAV set) | 78 |
The strongest candidates carry both a DMR and a stringent siscowet-specific deletion.
cv <- cand[cand$convergent == 1,
c("symbol","product","dmr_directions","dmr_best_class","dmr_max_absdiff",
"pav_strin_best_class","pav_strin_exonic","pav_strin_maxsize","rank_score")]
kt(cv, "Convergent candidate genes")
| symbol | product | dmr_directions | dmr_best_class | dmr_max_absdiff | pav_strin_best_class | pav_strin_exonic | pav_strin_maxsize | rank_score |
|---|---|---|---|---|---|---|---|---|
| LOC120032414 | zinc finger protein 883-like | hyper_siscowet | exon | 28.77 | exon | 1 | 270 | 6 |
| LOC120040411 | gastrula zinc finger protein XlCGF57.1-like | hypo_siscowet | intron | 56.06 | intron | 0 | 208 | 4 |
| LOC120039781 | hypo_siscowet | intron | 20.40 | intron | 0 | 729 | 4 | |
| LOC120043843 | septin-9-like | hyper_siscowet | intron | 16.51 | intron | 0 | 422 | 3 |
Two of the four (znf883-like paralogs,
XlCGF57.1-like) are rapidly-evolving, tandemly-arrayed
zinc-finger genes — convergence there partly reflects
repetitive-region mapping, so they are flagged rather
than asserted.
Promoter DMRs are the most interpretable methylation signal (regulatory potential). Repetitive ncRNA (rDNA/snRNA) loci are excluded.
pc <- cand[cand$dmr_promoter == 1 & cand$biotype == "protein_coding" & cand$caution == "",
c("symbol","product","dmr_directions","dmr_max_absdiff",
"log2fc_sisco_lean","meth_expr_concordant")]
pc <- pc[order(-num(pc$dmr_max_absdiff)), ]
kt(head(pc, 15), "Protein-coding promoter-DMR genes (top 15 by methylation difference)")
| symbol | product | dmr_directions | dmr_max_absdiff | log2fc_sisco_lean | meth_expr_concordant |
|---|---|---|---|---|---|
| LOC120029926 | epoxide hydrolase 1-like | hypo_siscowet | 44.54 | 0.04 | flat |
| LOC120051555 | zinc finger protein 665-like | hypo_siscowet | 34.52 | NA | NA |
| rbm24b | RNA binding motif protein 24b | hyper_siscowet | 32.77 | NA | NA |
| LOC120036681 | OCIA domain-containing protein 1-like | hypo_siscowet | 30.73 | 0.10 | flat |
| LOC120041959 | uncharacterized LOC120041959 | hypo_siscowet | 24.72 | NA | NA |
| LOC120033896 | small G protein signaling modulator 3-like | hypo_siscowet | 17.21 | NA | NA |
ex <- pav[pav$exon_overlap == 1, c("symbol","product","size_bp","chrom","start","end")]
ex <- ex[order(-num(ex$size_bp)), ]
ex <- ex[!duplicated(ex$symbol), ]
kt(head(ex, 15), "Largest exon-overlapping siscowet-specific deletions (one row per gene)")
| symbol | product | size_bp | chrom | start | end |
|---|---|---|---|---|---|
| LOC120023451 | 7696 | NC_052308.1 | 73682627 | 73690323 | |
| LOC120043798 | deleted in malignant brain tumors 1 protein-like | 5648 | NC_052309.1 | 96977772 | 96983420 |
| LOC120058315 | carcinoembryonic antigen-related cell adhesion molecule 1-like | 5638 | NC_052319.1 | 18640816 | 18646454 |
| LOC120035267 | uncharacterized LOC120035267 | 4413 | NW_024057675.1 | 687095 | 691508 |
| LOC120050008 | angiopoietin-related protein 5-like | 3144 | NC_052312.1 | 40398448 | 40401592 |
| LOC120050007 | 3144 | NC_052312.1 | 40398448 | 40401592 | |
| LOC120036302 | mucin-2-like | 2565 | NW_024058002.1 | 34668 | 37233 |
| LOC120051533 | OX-2 membrane glycoprotein-like | 1672 | NC_052313.1 | 59088687 | 59090359 |
| LOC120043299 | NLR family CARD domain-containing protein 3-like | 1283 | NW_024061633.1 | 41624 | 42907 |
| LOC120023687 | junctional adhesion molecule A-like | 1261 | NC_052308.1 | 78179875 | 78181136 |
| LOC120040975 | E3 ubiquitin/ISG15 ligase TRIM25-like | 1258 | NW_024060988.1 | 11154 | 12412 |
| LOC120040528 | uncharacterized LOC120040528 | 1240 | NW_024060564.1 | 209111 | 210351 |
| LOC120042792 | uncharacterized LOC120042792 | 1167 | NW_024061508.1 | 45203 | 46370 |
| LOC120041182 | rab effector MyRIP-like | 1130 | NW_024061099.1 | 87943 | 89073 |
| LOC120044082 | voltage-dependent T-type calcium channel subunit alpha-1I-like | 715 | NC_052309.1 | 83776879 | 83777594 |
g <- go_pav[num(go_pav$fdr) < 0.25,
c("name","namespace","study_k","fold_enrichment","fdr","phenotype")]
g <- g[order(num(g$fdr)), ]
kt(head(g, 20), "PAV-set GO terms at FDR<0.25 (phenotype-relevant flagged in 'phenotype')")
| name | namespace | study_k | fold_enrichment | fdr | phenotype |
|---|---|---|---|---|---|
| calcium ion transmembrane transporter activity | molecular_function | 23 | 3.52 | 0.0002561 | |
| calcium ion transmembrane transport | biological_process | 19 | 3.74 | 0.0005607 | |
| neuron projection development | biological_process | 33 | 2.35 | 0.0026120 | |
| calcium channel complex | cellular_component | 12 | 4.58 | 0.0029770 | |
| neuron development | biological_process | 34 | 2.22 | 0.0029770 | |
| cell projection morphogenesis | biological_process | 27 | 2.45 | 0.0029770 | |
| plasma membrane bounded cell projection morphogenesis | biological_process | 27 | 2.45 | 0.0029770 | |
| neuron projection morphogenesis | biological_process | 27 | 2.45 | 0.0029770 | |
| calcium ion transport | biological_process | 19 | 3.00 | 0.0029770 | Y |
| calcium channel activity | molecular_function | 17 | 3.23 | 0.0029770 | |
| mechanosensitive monoatomic ion channel activity | molecular_function | 6 | 9.07 | 0.0042080 | |
| cell morphogenesis involved in neuron differentiation | biological_process | 24 | 2.45 | 0.0059990 | |
| neuron projection guidance | biological_process | 20 | 2.66 | 0.0067010 | |
| axon guidance | biological_process | 20 | 2.66 | 0.0067010 | |
| generation of neurons | biological_process | 40 | 1.91 | 0.0067010 | |
| cell development | biological_process | 45 | 1.83 | 0.0067010 | |
| semaphorin receptor activity | molecular_function | 7 | 6.19 | 0.0084130 | |
| voltage-gated calcium channel complex | cellular_component | 9 | 4.60 | 0.0089410 | |
| axonogenesis | biological_process | 22 | 2.42 | 0.0089410 | |
| neuron differentiation | biological_process | 38 | 1.90 | 0.0089410 |
Only calcium ion transport / channel terms pass
FDR<0.1. Lipid- and growth-related terms are suggestive (FDR
0.1–0.5); broad lipid metabolic process is actually
depleted. The calcium signal is also subject to a gene-length
confounder (long channel genes accumulate deletions by chance)
— see Limitations.
top <- cand[cand$caution == "" & cand$biotype == "protein_coding", ]
top <- top[order(-num(top$rank_score), -num(top$dmr_max_absdiff)), ]
top <- head(top, 15)
lab <- ifelse(top$symbol == "" | grepl("^LOC", top$symbol),
substr(top$product, 1, 28), top$symbol)
lab[lab == ""] <- top$gene_id[lab == ""]
col <- ifelse(top$convergent == 1, "#b2182b",
ifelse(top$pav_strin_exonic == 1, "#ef8a62",
ifelse(top$dmr_promoter == 1, "#2166ac", "#67a9cf")))
op <- par(mar = c(4, 16, 3, 1))
bp <- barplot(rev(num(top$rank_score)), horiz = TRUE, names.arg = rev(lab),
las = 1, col = rev(col), xlab = "integrated rank score",
main = "Top protein-coding candidate genes", cex.names = 0.8)
legend("bottomright", bty = "n", cex = 0.8,
fill = c("#b2182b","#ef8a62","#2166ac","#67a9cf"),
legend = c("convergent (meth+PAV)","exonic siscowet deletion",
"promoter DMR","other"))
par(op)
dir.create("figures", showWarnings = FALSE)
png("figures/18-top-candidates.png", width = 900, height = 600)
par(mar = c(4, 16, 3, 1))
barplot(rev(num(top$rank_score)), horiz = TRUE, names.arg = rev(lab), las = 1,
col = rev(col), xlab = "integrated rank score",
main = "Top protein-coding candidate genes", cex.names = 0.8)
invisible(dev.off())
Lean and siscowet differ along well-characterized axes: lipid
content / energy storage (siscowet are fatty, deep-water; lean
are lean-bodied, shallow), body shape and size,
depth / pressure / buoyancy, and
sensory–neuromuscular performance. The repository’s own
data/measurements.xlsx is a 17-landmark
geometric-morphometric data set (body shape + centroid
size) for ~100 fish per group — direct evidence that the ecotypes differ
in body form and size, which is the phenotype the growth/muscle
candidates below would feed into. Each block states the evidence and an
explicitly hypothesized link.
angptl5 (angiopoietin-related protein
5-like) — exonic siscowet-specific deletion. ANGPTL family members
regulate triglyceride / lipoprotein-lipase activity.
Hypothesis: altered lipid-handling capacity, consistent with the
defining lean/siscowet fat-content difference. Caveat:
lean-reference bias means this is a candidate, not a confirmed siscowet
loss.mogat2 (2-acylglycerol
O-acyltransferase 2-A-like) — exonic deletion. Catalyses
triacylglycerol synthesis (monoacylglycerol pathway).
Hypothesis: differential dietary lipid re-esterification / fat
deposition.rbm24b (RNA-binding motif protein 24b)
— promoter DMR, hyper-methylated in siscowet (Δ33%),
the largest-effect promoter DMR. Rbm24 is a conserved regulator of
muscle and heart development / sarcomere mRNA
stability. Hypothesis: promoter hypermethylation → reduced
expression → contributes to ecotype differences in musculature / body
form captured by the morphometric data.znf883 paralogs — a tandem-cluster artifact, not
broad regulatory convergence. Reported for completeness only.code/13.3-...plan.md) is the needed complement.log2fc=NA).The cleanest, most defensible findings are: (i) a small set of
convergent genes (led by a znf883
paralog), (ii) a calcium-transport / channel GO signal
in the structural-variant set (with a gene-length caveat), and (iii)
individual lipid-metabolism genes
(angptl5, mogat2) and a high-effect
muscle-development promoter DMR (rbm24b)
that align with the known lipid- and body-shape differences between the
ecotypes — the latter directly relevant to the morphometric phenotypes
recorded in data/measurements.xlsx. None of these are
confirmed; they are prioritized hypotheses for targeted follow-up on
ecotype-specific (hifiasm) assemblies and in matched tissue.