OT for generative modeling 1 — the Wasserstein geometry
Topic: Machine Learning
Date:
In Part 0 we defined the Wasserstein distance: the cheapest way to rearrange one distribution into another. We now know how much it costs to move mass. But that framing treats distributions as static objects — you compare two of them, get a number, and that was it. However, transport is an inherently dynamical process; recall our water analogy, probability distributions can continuously flow. Guiding questions for this section:
- How to describe the dynamical aspects of transport? We need the geometric structure of a manifold: distributions as points, velocity fields as tangent vectors, and an inner product. We’ll carefully disentangle the sample domain
from the distribution manifold . - How is the static
definition in part 0 related to the continuous evolution of probability distributions? The Benamou-Brenier formula shows that is the geodesic distance on : a nested action decomposition connecting static coupling costs to dynamical kinetic energy.
Part 1 is notably denser than part 0, but also much the more beautiful. From now on, we focus on
Contents
The manifold
Fixing a sample space
- A snapshot of water in
of total mass , with density at each point. - Normal
vectors are functions ; think of each as an infinite-dimensional vector with one component at each , of value . It’s subject to the additional non-negativity and integrate-to-one constraints.
In this space, a point
Continuity equation, tangent space, vector fields
We next identify derivatives on
Velocities (fancily called tangent vectors) for general vectors are intuitive: just take the component-wise derivative! However, when we’re restricting ourselves to probability distributions, we’ll “slide off” the manifold if we follow general velocities, even if infinitesimally.
Let’s go back to the fluid perspective. Generally, fluid density
This is a local conservation law which dictates that probability density (mass) cannot evolve by teleporting in the sample space
The continuity equation provides a many-to-one map from (smooth) sample-space vector fields
to permissible density evolutions on .
We can do better. The Helmholtz decomposition says that any (regular) vector field on
Definition 1 (Tangent space of
The space
Intuitively, sample-space vector fields are like wind that can blow the fluid around. However, we don’t care about the component of the wind that makes water go around in infinitesimal circles (these don’t change water density); the remaining vector field degree of freedom can always be identified as a gradient field.
The Wasserstein metric
The last section was ankle-deep in differential geometry, now let’s go knee-deep by introducing a Riemannian metric; this makes
The metric is a bilinear form
- By integrating the metric of the velocity along a curve, we obtain the length of a curve.
- Given two points, a geodesic is a minimal-length curve between them.
- The (geodesic) distance between two points is the length of the geodesic.
Remark (Euclidean geometry).
The standard Euclidean metric on
Geodesics are straight lines
What’s a natural metric on
Definition 2 (Wasserstein metric).
Definition 3 (Wasserstein length).
Given a curve
The Wasserstein distance
Let’s unpack this. We have a smooth, locally continuous deformation from distribution
Wasserstein length as free-fluid action
For a single free particle of mass
Now promote this to a free fluid with density
Up to a factor of
Definition 4 (Dynamical definition of
The Wasserstein-2 distance between
Unifying static and dynamical perspectives
There’s an elephant in the room: we have the static definition of Wasserstein distance and the dynamical definition
The unifying result is the Benamou-Brenier theorem. This theorem is pivotal because it identifies the linear optimal transport plan that realizes the Wasserstein distance — the engine at the heart of flow matching.
Theorem 1 (Benamou-Brenier).
The Kantorovich (static) and Riemannian (dynamical) definitions of
The optimum of the dynamical formulation is achieved by straight-line transport under the optimal plan
We state the result first, then prove it in four steps:
- Single-particle least action: Euler-Lagrange gives straight-line trajectories with the static quadratic cost
. - Single-particle transport plans: a transport plan
assigns endpoints to each mass element; classical mechanics takes over. - The nested decomposition: the macroscopic particle ensemble action splits into an inner infimum over single-particle trajectories, and outer infimum over coupling plans, recovering the static definition.
- From particle ensemble to fluid: the marginal velocity field is the conditional expectation of particle velocities; a variance-drop argument shows particle ensemble and fluid actions coincide for
.
Single-particle least action
A single free particle of unit mass travels from
Plugging back, the minimum action of this single particle is exactly
(dropping the conventional
Boundary conditions vs. transport plans
Now scale up to the fluid. The boundary conditions are the marginal distributions:
A transport plan
The nested decomposition
Here is the core of Benamou-Brenier. Let’s first consider the dynamical fluid as a macroscopic ensemble of particles, its action decomposes into two nested minimization problems:
- Inner infimum (classical mechanics): Fix endpoints
from the plan. The action-minimizing trajectory is a straight line; the resulting cost is . - Outer infimum (static OT): Substitute the inner solution. What remains is
— precisely the static Kantorovich definition in Part 0.
There’re three beautifully interleaved viewpoints at work here:
- Static OT is the outer infimum alone. It asks: given
and , what plan minimizes aggregate pairwise cost? Time and dynamics are absent. - Classical mechanics is the inner infimum alone. It asks: fixing endpoints and mass, what path minimizes action?
- Dynamic OT is both simultaneously. Minimizing the global fluid action discovers the optimal particle trajectories and the optimal transport plan that binds them.
From particles to fluid
There remains a subtle yet important gap: we decomposed the action into individual particle costs under a plan
Remark.
This reconciliation is the key engine behind being able to optimize the desired marginal flow matching objective using the tractable conditional flow matching objective.
Given a transport plan
Divide the two to get the macroscopic fluid velocity, which turns out to be the conditional expectation of particle velocities given position:
By the law of total variance (bias-variance tradeoff on other contexts), the total particle action decomposes as:
Fluid action is upper-bounded by particle ensemble action. Equality holds when the variance vanishes — i.e., when no two particles cross at the same point at the same time with different velocities. Under the optimal transport plan
Footnotes
-
More precisely, the relevant decomposition is into
and with ( -weighted divergence-free), which are orthogonal under the -weighted inner product. ↩ -
Technically,
is the -closure of . ↩ -
Minimal-cost implies that for any two support pairs
, , equivalently . You can draw some diagrams to convince yourself that this implies no crossing. ↩