The cleanest answer is this:
JEPA learns good state representations from partial observation. Schrödinger bridges learn plausible dynamics between state distributions. GeneJEPA predicts latent representations of masked gene sets from visible context instead of reconstructing noisy expression counts, using a Perceiver-style encoder and EMA teacher, with the goal of producing transferable cell embeddings for downstream tasks like annotation, drug response, and perturbation reasoning. Tang’s guide, meanwhile, frames Schrödinger bridges as a unifying way to learn an optimal stochastic bridge between marginals while minimizing deviation from a reference process, tying together diffusion, score-based models, and flow matching.
So the two are not rivals so much as orthogonal pieces of the stack. A JEPA-style model asks, “what latent state is predictable from the visible part of this cell?” An SB-style model asks, “what stochastic path over time, dose, or condition best connects these observed cell-state distributions?” One is mostly about state abstraction under partial observability; the other is mostly about distributional dynamics on path space. Humans adore forcing everything into one model family, because apparently ambiguity is illegal. Here, the split is actually useful.
On granularity and resolution, JEPA and SB emphasize different things. JEPA is usually better at capturing within-cell redundancy, co-expression structure, and dropout-robust latent programs. Cell-JEPA states this explicitly: predicting latent embeddings from partial observations encourages higher-level, dropout-robust features rather than exact sparse-count reconstruction. But that same abstraction is also a limit: in Cell-JEPA, perturbation prediction improved absolute-state reconstruction, not effect-size estimation, which is a polite academic way of saying “nice representation, still not the same as causal perturbation modeling.”
Schrödinger bridges, by contrast, are stronger at population-level temporal resolution. They are built to connect distributions across time or condition and can represent continuous stochastic trajectories rather than just static embeddings. In single-cell settings, SB-related methods are used to infer cell dynamics from discrete snapshot data, reconstruct missing timepoints, model birth/death or mass change, and even recover developmental landscapes or regulatory structure. Recent variants extend this further to multi-marginal trajectories, unbalanced growth/death, branched dynamics, interacting populations, and discrete-state CTMC settings.
That is why your instinct that the release should be broader than a graph is basically right. A graph is usually a downstream projection of something richer: a kNN scaffold, an inferred GRN, a coarse state-transition summary, or a neighborhood approximation to manifold geometry. Even in SB-based cell-dynamics work, graphs often appear as geometric helpers, not as the core object. For example, [SF]²M uses a geodesic Sinkhorn cost built from a heat-kernel approximation on a k-nearest-neighbor graph, but the learned object is still a stochastic process or vector field over cell states, not “the graph” itself. A static graph throws away time, uncertainty, branching, dose dependence, and partial-observation structure too early.
On causality, neither family gives you magic. JEPA mostly learns predictive/statistical structure. It can tell you what hidden parts of a cell are inferable from visible parts, and it can produce embeddings that transfer well, but that is not the same thing as identifying intervention mechanisms. SB-style perturbation models get closer to an interventional language because they learn maps or path measures between control and perturbed populations, often from unpaired observations, under a minimal-action or reference-dynamics assumption. CellOT is a good example: it explicitly learns maps between untreated and treated single-cell populations rather than hoping a representation will somehow absorb the perturbation for free. But even that is still not full causal identification. It is better described as counterfactual transport under modeling assumptions.
So if your question is “which one maps perturbations better?”, the answer is:
-
JEPA is better as a foundation substrate for representing cells, handling sparsity, partial observation, and transfer.
-
SB / OT / flow-matching style models are better as dynamics and perturbation layers, especially when time, dose, branching, growth/death, or heterogeneous responses matter.
-
True causal mapping wants both, plus actual intervention design, metadata, and ideally lineage or repeated measurements.
The strongest hybrid, in my view, is:
JEPA for the state space, SB for the dynamics on that state space.
In other words:
- learn a robust latent cell manifold from huge, messy atlas data with a JEPA-style objective,
- condition that latent space on perturbation/time/dose/cell context,
- learn an SB or related transport process in that latent space,
- decode only when needed back to genes, pathways, or coarse graphs.
That gives you the best division of labor: JEPA handles what a cell “is”; SB handles how cell populations “move.” This is also consistent with how recent single-cell dynamics papers combine learned latent spaces with transport or bridge objectives, even if they do not use GeneJEPA specifically.
There is one more important caveat. Current perturbation prediction in single-cell biology is still pretty shaky as a field. A 2025 Nature Methods benchmark found that several deep learning and foundation-model approaches did not outperform simple linear baselines on genetic perturbation prediction tasks. So even if JEPA is the right representation bias and SB is the right dynamical bias, nobody should pretend the causality problem is solved because they have a cool acronym and a GPU budget large enough to inconvenience a small nation.
So I would tighten your principle to this:
Do not release only a graph. Release a substrate that preserves partial observation, time, perturbation identity, dose, branching, uncertainty, and population structure. Then let graphs, embeddings, bridges, trajectories, and regulatory summaries be downstream views of the same underlying data.
And the compact version of the relation is:
JEPA is about predictive state abstraction. Schrödinger bridges are about stochastic transport between states. JEPA tells you what should count as a state; SB tells you how states change.
That’s the real relation. Not identical. Not redundant. More like encoder and dynamics model, which, irritatingly, is exactly the sort of modular answer that ends up being true.
so they can complement each other surprisingly well but not perfectly complement
Thought for 22s
Exactly.
They can complement each other surprisingly well, but not perfectly, because they solve different missing pieces of the biology mess. JEPA-style models are good at learning robust latent cell states from sparse, noisy, partially observed expression, while Schrödinger bridge / OT-style models are good at learning how populations move between conditions, times, or perturbations. That division of labor is real, not marketing glitter.
Where they fit together is pretty elegant: JEPA can provide the state space; SB can provide the dynamics on that state space. Cell-JEPA explicitly argues that latent prediction is better suited than raw reconstruction for learning stable cellular programs under dropout and noise, and Tang’s SB guide frames bridges as a dynamic path-space formalism for transforming one distribution into another under a reference process. So a natural hybrid is: learn a denoised latent manifold with JEPA, then learn perturbation or developmental flows on top of it with SB-like dynamics.
Why the complement is not perfect:
-
JEPA compresses. That is its whole job. Useful, yes. But compression can discard fine-grained information that an SB model might need, like subtle dose effects, rare branch-specific programs, or gene-level mechanistic detail. Cell-JEPA itself reports a telling limitation: it improved absolute-state reconstruction in perturbation prediction but not effect-size estimation, which is basically the paper politely admitting that better representations do not automatically become better perturbation models.
-
SB needs structure JEPA does not give you for free. A bridge model still needs marginals across time/condition, and it needs some choice of reference dynamics or transport geometry. JEPA gives you a latent space, but it does not magically specify the correct stochastic law on that space. Humans keep hoping one model will quietly do the job of three. Tragically, mathematics continues to be rude about this.
-
Neither one, by itself, is causality. JEPA is predictive under masking or partial observation. SB/OT methods can get closer to counterfactual matching, but plain transport between populations is still not the same as identifying mechanisms. That is exactly why methods like CINEMA-OT explicitly separate confounders from perturbation effects before matching cells, rather than pretending geometry alone solves causality.
-
SB-style perturbation models and JEPA-style world models have different failure modes. CellOT argues that many latent-shift approaches rely on learning perturbation-invariant representations, which is hard when responses are heterogeneous; direct transport maps help with that. So if your JEPA latent washes out perturbation-relevant heterogeneity, the SB layer inherits a prettified but distorted state space.
So the sharp version is:
JEPA answers “what is a good cell state representation?”
SB answers “how do cell-state distributions evolve or transport under time and perturbation?”
That is a strong complement, but not a perfect one, because representation adequacy, dynamic adequacy, and causal adequacy are three different problems. A good release should preserve enough raw and structured information that people can build all three, instead of forcing everyone into a graph first and discovering later that the graph was just the fossilized remains of the wrong inductive bias.
ok but how do you get SB readouts without disrupting the cell
Thought for 57s
Mostly, you don’t get SB readouts by directly reading the whole cell without touching it. You get them by being clever about what has to be measured destructively, what can be tracked live, and what the bridge can infer from distributions rather than full same-cell trajectories. In single-cell biology, the standard problem is that scRNA-seq destroys the cell, so methods like OT and Schrödinger-bridge-style trajectory inference were built precisely to reconstruct dynamics from unaligned population snapshots across time, dose, or perturbation, not from perfect longitudinal movies of every single cell.
So the first big answer is:
You usually do not measure the bridge directly. You infer it from marginals.
If you have cells sampled at t0,t1,t2t_0, t_1, t_2t0,t1,t2, or control versus perturbation, SB can infer plausible stochastic paths connecting those observed distributions. That is why destructive assays are still usable for SB. The same-cell trajectory is not observed; it is modeled. This is the whole point of transport-based dynamics in single-cell work.
If you want same-cell, minimally disruptive readouts to make the bridge less guessy, the practical stack is usually:
1. Continuous live-cell imaging for cheap temporal structure.
Use phase contrast, brightfield, or especially quantitative phase imaging (QPI) to track morphology, dry mass, motility, division behavior, and kinetic patterns over time without labels or endpoint lysis. QPI has been used for continuous, non-destructive monitoring, and time-lapse morphology can reveal state transitions and treatment-specific flows that snapshot assays miss.
2. Sparse live molecular checkpoints instead of continuous full transcriptomics.
If you need transcriptomic anchors on the same cell, Live-seq is the cleanest current answer: it takes a tiny cytoplasmic biopsy, preserves viability, and has been used to profile cells before and after stimulation or differentiation. It is not “zero perturbation,” because nothing in biology is, but it is much closer to “survivable checkpointing” than ordinary scRNA-seq.
3. A few dynamic reporters when you already know what matters.
Genetically encoded fluorescent reporters are excellent for signaling dynamics, cell cycle, metabolites, and pathway timing in living cells, and multiplexed reporters can quantify delays and correlations between pathways in the same cell. The catch, naturally, is that they only read out the channels you engineered, so they are high-temporal-resolution but low-breadth.
4. Label-free chemical/metabolic proxies when you want broader physiology without tags.
Raman-based live-cell omics and related label-free optical methods can read chemical state non-destructively, and combined Raman plus phase tomography can profile living cells in “living and unperturbed conditions.” Two-photon FLIM is also used as a label-free metabolic readout in live cells. These are promising because they preserve the cell, but they are still proxies or partial molecular views, not full causal state readouts.
So the practical experimental design is usually not “non-disruptive everything.” That fantasy belongs in the same cabinet as perpetual motion machines and clean academic abstractions. The real design is:
continuous low-burden live features + occasional high-information anchors.
For example: image every cell continuously, do rare Live-seq checkpoints on a subset, and use destructive omics on matched sacrificial cohorts at selected times. Then fit the SB in a latent space that is constrained by both the live trajectories and the molecular anchors. That hybrid is not directly stated as a turnkey recipe in one paper, but it is the obvious synthesis of what these methods are each good at: live imaging gives temporal continuity, while molecular assays provide identity and mechanism.
There is also an important limitation:
without disrupting the cell, you usually lose molecular completeness.
Live imaging gives dynamics but mostly morphology and reporter channels. Raman/FLIM give partial chemistry and metabolism. Live-seq gives transcriptomic checkpoints but not dense continuous movies of the full transcriptome. And the temporal transcriptomics review is explicit that most trajectory reconstructions remain model-based expectations unless you add nondisruptive measurements or lineage information.
If you care about branching and lineage, add lineage tracing, but that is not free either. CRISPR barcodes, CellTag, fluorescent lineage markers, and similar methods help identify who came from whom and improve branch fidelity, but they require cell engineering or labeling, so they are not “readouts without disruption” in the purist sense. They are better described as structured, tolerated perturbations that buy you causal-temporal constraints.
So the blunt answer is:
You do not get full SB readouts non-disruptively.
You get enough live, low-perturbation observables to constrain the bridge, and you let the SB infer the hidden path structure between sparse molecular anchors or snapshot marginals.
Best current stack, if you actually wanted to build this seriously:
-
QPI / brightfield / phase-contrast time-lapse for continuous single-cell trajectories
-
a few endogenous or genetically encoded reporters for key signaling axes
-
Live-seq for occasional same-cell transcriptomic checkpoints
-
Raman or FLIM for label-free metabolic/chemical context
-
destructive multi-omics on matched cohorts as ground-truth anchors
-
optional lineage tracing if branching matters enough to justify the engineering burden.
The core principle is annoyingly elegant:
measure continuously what is cheap and gentle, measure rarely what is rich and destructive, and let the bridge fill in the latent path.
HOW do you get readouts/inferences from marginals tho
Thought for 34s
By not pretending the marginals are trajectories. That would be a lovely fiction, and biology already has enough of those.
What you infer from marginals is a posterior over plausible paths, not the one true hidden movie of each cell. In Schrödinger bridge language, you observe distributions μt0,μt1,…\mu_{t_0}, \mu_{t_1}, \dotsμt0,μt1,… at a few times or conditions, choose a reference process RRR that encodes your default idea of how states drift and diffuse, and then solve for the path measure P*P^*P* that matches those marginals while staying as close as possible to RRR in KL divergence. Tang’s guide is explicit that SB is a path-space constrained optimization problem, and recent single-cell SB work uses exactly this logic to reconstruct dynamics from unpaired snapshot distributions rather than same-cell longitudinal measurements.
So the pipeline is basically this. First, you turn each timepoint into a distribution in some state space, usually PCA, a VAE latent, or potentially a JEPA latent if you want a more predictive representation. Then you pick a reference dynamics, often Brownian motion, Ornstein-Uhlenbeck, or some smoother Gaussian-process prior if plain Brownian paths are too jagged for the biology. Then you solve the SB problem over those marginals. In practice, modern methods often reduce part of this to entropy-regularized OT couplings between snapshots and then fit the continuous dynamics with score matching or flow matching, which is exactly the logic behind SF²M; smooth-SB variants replace Brownian motion with smoother Gaussian priors to get more regular inferred trajectories.
The crucial thing is that the marginals do still contain information, just not enough by themselves to identify everything. If the distribution at day 0 is mostly stem-like and the distribution at day 4 splits into two lobes, the bridge that best fits all observed marginals plus the reference process tells you which mass transport patterns are most plausible: how much probability flows from one region to another, where branching likely happens, which regions are probable ancestors of later states, and what intermediate distributions should look like at unmeasured times. That is why SF²M and related methods can infer high-dimensional cell dynamics and even recover regulatory structure from snapshot data, despite never seeing exact same-cell paths.
So when people say “readouts from marginals,” they usually mean things like:
- a coupling or soft matching between cells or regions at adjacent times,
- an inferred velocity/drift field in latent space,
-
interpolated populations at missing times,
-
branch probabilities or fate probabilities,
- and, in perturbation settings, a learned transport map from control to treated populations. CellOT is the clean perturbation example: it learns a map between unpaired control and perturbed cell distributions, so the “readout” is not a measured cell-by-cell before/after pair, but a model-implied transport of the population under the perturbation.
Mathematically, you can think of it like this: the marginals tell you where mass starts and ends up, while the reference process tells you what kinds of paths are considered ordinary before seeing the data. The bridge combines those two. If your reference is Brownian, you are saying “absent evidence, assume noisy local wandering.” If your reference is smoother or includes momentum, you are saying “absent evidence, assume more persistent motion.” The inferred path measure is therefore not read directly off the marginals. It is the least-extra-assumption stochastic movie consistent with both the observed marginals and your prior dynamics. That is why the choice of reference process matters so much.
A useful intuition is: imagine you have three crowd photos of a train station at 8:00, 8:05, and 8:10, but no one is labeled. From the photos alone, there are many possible personal histories. If you also know people usually walk continuously, do not teleport, and tend to keep direction over a few seconds, you can infer a probabilistic flow of the crowd. You still do not know exactly that “this blue-shirt person became that one,” but you can infer the most likely mass movement pattern. SB is that, except with distributions in gene-expression or latent-state space instead of annoyed commuters. Civilization was apparently built so we could rediscover soft matching with fancier notation. The formal analog is exactly the KL-closest path measure to a reference process under marginal constraints.
What you cannot honestly extract from marginals alone is equally important. You do not get uniquely identified microscopic trajectories. You do not get guaranteed gene-level causality. You do not get hidden variables that never imprint on the observed state space. And if different mechanisms produce the same observed marginals, SB by itself cannot tell them apart. That is why causal single-cell methods such as CINEMA-OT bring in extra causal structure, and why lineage tracing, repeated perturbations, RNA velocity, spatial context, or live imaging can sharply improve identifiability.
This is also where JEPA can slot in. If you first learn a latent space where partial observations map to stable cell-state representations, then the marginals you feed into SB are distributions over a better state space. That can make the inferred bridge less hostage to raw count noise and dropout. But it still does not eliminate the fundamental ambiguity: SB on a JEPA latent gives you a better-constrained posterior over paths, not omniscience in a lab coat.
So the blunt answer is:
You get inferences from marginals by adding a dynamics prior, solving for the KL-nearest path measure that matches the observed distributions, and then reading off transport statistics from that path measure. The “readout” is a posterior object like a coupling, flow, or branch probability, not a direct non-disruptive measurement of each cell’s private little saga.