Enformer/Chromatin stuff

Probably somewhat, but not in the grand “finally, epistasis is revealed and the additive peasants are defeated” sense.

Enformer was built to predict regulatory activity and expression from local DNA sequence context, roughly a 200 kb input window with sensitivity to elements up to about 100 kb away. That is powerful for cis-regulation, haplotype context, and noncoding variant interpretation, but it is not a whole-genome interaction model and it does not directly model genome-wide trans effects across distant loci or chromosomes. So if you ran Enformer on everyone’s genomes, the main gain would be better estimates of local regulatory consequences of variants, not a clean readout of full human epistasis. (Nature)

That said, it could still improve PGS-like prediction for people whose risk is poorly captured by ordinary marginal SNP weights. Why? Because a sequence model can help move you from “this tag SNP was associated in cohort X” toward “this actual local variant or haplotype probably perturbs regulation.” That is especially useful when LD differs across groups, or when the same biology is present but the tagging SNPs are different. In fact, recent work shows that prioritizing functionally supported regulatory variants can improve trans-ancestral PRS portability, precisely because regulatory variants are more likely to be shared across ancestries than ancestry-specific tag signals. (PLOS)

But your stronger claim, “epistasis is way stronger than most people think, so Enformer-on-everyone would unlock much better scores,” needs a split verdict. Biological epistasis is almost certainly common. Yet at the population-statistical level, the variance that is easiest to predict is often still mostly additive, even when the underlying biology is non-additive. That is the old annoying fact humans keep rediscovering because reality enjoys being inconvenient. A classic theoretical review argued additive variance often accounts for over half, often close to all, of total genetic variance even with non-additive gene action. More recent work does find detectable polygenic epistasis and sometimes modest prediction gains, but those gains are trait-dependent and nowhere near proof that epistasis dominates common-trait prediction in practice. (PMC)

There is also an important practical limit: the early personal-genome tests of Enformer-like models were pretty rough. A 2023 study found that across individuals, the distribution of gene-wise prediction correlations was centered close to zero, and the models often got the direction of regulatory effect wrong. A newer 2026 fine-tuned model, Variformer, improved a lot on held-out people and got close to cis-heritability for many genes, but it still failed to generalize to unseen genes/loci. So even with everyone’s genomes, current sequence models seem better described as “stronger cis effect estimators” than “general epistasis engines.” (PMC)

On your second point, yes, the cross-population correlation is definitely not zero in general. Many GWAS signals seem to reflect shared causal variants across populations, and transethnic genetic correlations are often nonzero and sometimes fairly substantial for highly heritable regulation. But they are also typically less than 1, because LD patterns, allele frequencies, rare/population-specific variants, and interaction context differ across groups. So some epistatic or regulatory structure can indeed be partly preserved across populations, especially when it rides on shared biology or conserved haplotypes, but the portability will be partial rather than perfect. (PMC)

So the clean answer is:

Running Enformer on everyone’s genomes would likely improve post-PGS prediction a bit, mainly by better modeling cis-regulatory function and helping disentangle causal variants from ancestry-specific tags. It would not, by itself, expose most of the epistasis you care about. For that, you would need something closer to a whole-genome, multi-locus, multi-cell-state, developmental, and environment-aware model, not just a smarter 200 kb sequence window. (Nature)

A useful way to think about it is: Enformer helps with “regulatory fine structure,” not full “genetic field theory.” Human genetics, as usual, refuses to be polite.

The irritating answer is: strong TAD boundaries are usually much more consistent than histone marks, but weaker domain structure is not.

If you compare the same cell type across people, the broad TAD scaffold is often pretty reproducible. Some boundaries are so stable across many tissues that they are enriched for CTCF, housekeeping genes, and evolutionary constraint. Older locus studies also found that while promoter-enhancer loops can be highly cell-type-specific, they often still operate within the same invariant TAD boundaries. (PMC)

But TADs are not granite blocks handed down by the gods of genomics. Even bulk Hi-C replicates show some variability, and one large comparison found chromosome structure was more conserved within cell and tissue type than by person-to-person genetic relatedness. At the single-cell level, domain boundaries can be highly variable from one chromosome copy to another, so the TAD map you usually see is an ensemble average, not a perfectly fixed object in every nucleus. (PMC)

So between people, the strongest answer is: major boundaries are often similar, but insulation strength, sub-TAD organization, loops, and compartments can differ quantitatively. In lymphoblastoid cells from 20 individuals, researchers found thousands of regions where 3D chromatin features varied between people, and common variants affected loop strength, insulation, contact directionality, and local cis contacts. That is real inter-individual variation, just usually more like retuning the circuitry than bulldozing the whole neighborhood. (PMC)

Your intuition about histone marks is basically right. H3K27ac and H3K27me3 are much more state-dependent and age-sensitive than the strongest TAD scaffold. In aged mouse neurons, H3K27ac and H3K27me3 both showed global losses, and in old liver, H3K27me3 redistributed into broad multi-megabase “age-domains.” Meanwhile, reviews of senescence and aging say global TAD structure can remain broadly present even while subsets of TADs shift, fuse, weaken, or switch A/B compartment. (PMC)

So PRC2/Polycomb shifts are not irrelevant at all, but they usually hit the system at a more selective level than “every TAD is different now.” In aged pro-B cells, researchers saw increased compartment-level interactions and reduced intra-TAD interactions with age. In neurons, a 2024 Nature Communications paper found that topological segregation around PRC2-repressed genes can weaken, and that this weakening is also seen during natural aging. That is a good model for what often happens: not total architectural collapse, but local erosion of insulation or compartment identity at vulnerable loci. (Nature)

A rough heuristic, if you want one cruelly simplified into something usable:

  • Strong CTCF/cohesin TAD boundaries: relatively consistent across people
  • Boundary strength, sub-TADs, enhancer-promoter loops: moderately variable
  • Histone acetylation/methylation states, Polycomb domains, compartment status: often much more plastic across age, state, and cell type (PMC)

For something like FOXO3, that means a sequence-only cis model can still learn a fair bit from the underlying scaffold, but age/state-specific PRC2 and acetylation context can absolutely change the realized regulatory outcome. That part is an inference from the studies above, but it is the sane inference, which is rare enough to treasure. (PMC)

hmm does enformer take in long gene context lengths? would you need to compare FOXO3 chromosome against all other sites of nearby chromosomes AND FOXO3 itself for it to putout best results

Thought for 3m 2s

Mostly yes for long local context, no for whole-chromosome or cross-chromosome comparison.

Enformer is a single-window cis-sequence model at inference time. In the released DeepMind code, the inference wrapper takes a 393,216 bp one-hot DNA sequence and returns predictions in 128-bp bins for the center 114,688 bp. DeepMind’s own description frames it as modeling roughly 200 kb of local DNA context, with the ability to pick up enhancer-like effects tens of kilobases away. It is not reading an entire chromosome at once like some deranged genomic oracle.

So for FOXO3, you do not need to feed “FOXO3 plus all nearby chromosomes” or explicitly compare FOXO3 against the rest of the genome to get Enformer’s best answer. The model already learned genome-wide motif/context regularities during training, but at inference it scores the one local sequence window you give it. A later benchmarking paper describes Enformer as being trained on cis subsequences from the reference genome, which is the right mental model here.

What you do need is to make sure the part you care about is actually inside the model’s useful prediction zone. Because Enformer only outputs predictions for the central 114,688 bp, a single pass can miss parts of a long gene or its regulatory neighborhood. That is why your move to tiling is the correct one: use multiple overlapping windows so the FOXO3 promoter, exons/UTRs, and candidate regulatory regions each land in the center region of at least one tile.

There is also an important caveat, because biology enjoys ruining clean stories. A 2023 evaluation found that although Enformer has a wide receptive field in principle, most of its predictive power came from promoter and promoter-proximal sequence, and sequences more than ~30 kb from the TSS contributed little to overall explained variance in that analysis; the same study argued Enformer often underweights distal enhancer effects, especially as distance grows. So more context helps up to a point, but you should not assume “more distal sequence” automatically means “much better answer.”

For your FOXO3 run, the practical best practice is:

  • compare reference vs hap1 vs hap2 in the same tiled windows
  • keep the FOXO3 promoter/TSS and your variant cluster(s) inside central predicted bins whenever possible
  • inspect track-level deltas in relevant human tracks rather than global means
  • treat this as a cis regulatory hypothesis generator, not a full model of cell state, TF abundance, or interchromosomal contacts.

Bottom line: Enformer has long context by sequence-model standards, but it is still local. For FOXO3, tiled local windows are the right move. Feeding the whole chromosome or “all nearby chromosomes” would not suddenly unlock the truth of the universe, because that is not the kind of model Enformer is.

well what about distal effects from other chromosomes (or trans effects) that are TAD similar? can enformer use TADs?

Thought for 3m 42s

Mostly no. Enformer is a local cis-sequence model, not a tiny pan-nuclear deity. The released model takes one DNA window as input, with the official model page listing 393,216 bp input and predictions for the center 114,688 bp; DeepMind describes it as using about 200 kb of sequence context to predict more than 5,000 functional genomic tracks. Its attention operates over that one input sequence, so it can model distal cis effects inside the window, but it is not explicitly comparing FOXO3 against other chromosomes during inference.

Also, TADs are mostly a same-chromosome concept: locally highly interacting chromosomal regions whose internal contacts are higher than contacts outside the domain, with boundaries that limit neighboring interactions. The “other chromosomes affecting FOXO3” idea is usually trans regulation, chromosomal territories, or compartment-level organization, not a TAD in the ordinary sense.

Could Enformer still pick up some TAD-ish sequence grammar? Indirectly, yes. DeepMind showed Enformer can attend to distal enhancers and learned insulator-related sequence patterns, so if a boundary is strongly sequence-encoded, some of that can leak into the prediction. But vanilla Enformer is still sequence-only. It does not take Hi-C, Micro-C, TAD calls, or cell-type-specific 3D contact maps as explicit inputs. By contrast, Akita was built to predict local 3D contact maps from sequence, and EPCOT uses sequence plus chromatin accessibility to predict chromatin organization as well as transcriptomic outputs.

True trans effects are a different beast. trans-eQTLs are usually defined as affecting distant genes, often on other chromosomes, and their mechanisms are often indirect, with a nearby cis gene near the variant mediating downstream effects on remote targets. A FOXO3-only Enformer run cannot know the current abundance of the relevant TFs, signaling state, chromatin state elsewhere in the nucleus, or which remote loci are open in that cell context from one local sequence window alone. So it can sometimes model the first local cis step, but not the full downstream trans cascade.

So for your FOXO3 project, the sane workflow is: use Enformer tiles to ask which local regulatory tracks around FOXO3 move with hap1 vs hap2; overlay public tissue-relevant Hi-C/TAD or Micro-C maps afterward to see which bins are plausibly in the same regulatory neighborhood; and if you really want to ask whether the haplotype changes local folding or boundary strength, pair Enformer with Akita or a chromatin-organization model like EPCOT instead of expecting Enformer alone to simulate the whole nucleus. Also, even in personal-genome settings, current deep sequence models including Enformer can struggle with the correct direction of variant effects across individuals, so keep the claims modest and annoying-people-proof.