Back to blogs

OT for generative modeling 1 — the Wasserstein geometry

Topic: Machine Learning

Date:

In Part 0 we defined the Wasserstein distance: the cheapest way to rearrange one distribution into another. We now know how much it costs to move mass. But that framing treats distributions as static objects — you compare two of them, get a number, and that was it. However, transport is an inherently dynamical process; recall our water analogy, probability distributions can continuously flow. Guiding questions for this section:

  • How to describe the dynamical aspects of transport? We need the geometric structure of a manifold: distributions as points, velocity fields as tangent vectors, and an inner product. We’ll carefully disentangle the sample domain from the distribution manifold .
  • How is the static definition in part 0 related to the continuous evolution of probability distributions? The Benamou-Brenier formula shows that is the geodesic distance on : a nested action decomposition connecting static coupling costs to dynamical kinetic energy.

Part 1 is notably denser than part 0, but also much the more beautiful. From now on, we focus on with quadratic penalty; we’ll see physics meet statistics: the continuity equation in action, the Fokker-Planck equation falling out as a corollary, and the free-particle Lagrangian action providing the key bridge between static and dynamical perspectives on optimal transport.

Contents

The manifold

Fixing a sample space , we consider the set of all probability distributions over with finite variance. For example, this could be a distribution over all images in . is endowed with the standard Euclidean topology. Two perspectives on a point :

  • A snapshot of water in of total mass , with density at each point.
  • Normal vectors are functions ; think of each as an infinite-dimensional vector with one component at each , of value . It’s subject to the additional non-negativity and integrate-to-one constraints.

In this space, a point is an entire distribution over . There are two spaces at play — the sample domain where data lives, and the distribution manifold where each point is itself a distribution — and it’s crucial to separate them.

Continuity equation, tangent space, vector fields

We next identify derivatives on . Note that is a subset of the ambient space of square-integrable functions .

Velocities (fancily called tangent vectors) for general vectors are intuitive: just take the component-wise derivative! However, when we’re restricting ourselves to probability distributions, we’ll “slide off” the manifold if we follow general velocities, even if infinitesimally.

Let’s go back to the fluid perspective. Generally, fluid density evolves according to the continuity equation

This is a local conservation law which dictates that probability density (mass) cannot evolve by teleporting in the sample space : the change equals the negative divergence of the mass flux for some vector field in the sample space.

The continuity equation provides a many-to-one map from (smooth) sample-space vector fields to permissible density evolutions on .

We can do better. The Helmholtz decomposition says that any (regular) vector field on splits as where is divergence-free — this is the familiar curl-divergence decomposition in 3D. The divergence-free component swirls mass along the contours of without changing it — it satisfies 1 and contributes nothing to in the continuity equation above. Quotienting out these invisible components:

Definition 1 (Tangent space of ).

The space of probability density velocities on the Wasserstein manifold is one-to-one 2 with the space of gradient vector fields on the sample space.

Intuitively, sample-space vector fields are like wind that can blow the fluid around. However, we don’t care about the component of the wind that makes water go around in infinitesimal circles (these don’t change water density); the remaining vector field degree of freedom can always be identified as a gradient field.

The Wasserstein metric

The last section was ankle-deep in differential geometry, now let’s go knee-deep by introducing a Riemannian metric; this makes a Riemannian manifold. A Riemannian metric endows a manifold with notions of angles and length.

The metric is a bilinear form that takes in two tangent vectors and computes an inner product.

  • By integrating the metric of the velocity along a curve, we obtain the length of a curve.
  • Given two points, a geodesic is a minimal-length curve between them.
  • The (geodesic) distance between two points is the length of the geodesic.

Remark (Euclidean geometry).

The standard Euclidean metric on consumes two vectors and computes . The length of a curve is

Geodesics are straight lines , and the geodesic distance is .

What’s a natural metric on for two tangent vectors at density , represented by gradient vector fields ? A natural candidate is their -weighted inner product on the sample space:

Definition 2 (Wasserstein metric).

Definition 3 (Wasserstein length).

Given a curve connecting to , we identify with a gradient vector field . The Wasserstein length of the curve is

The Wasserstein distance is the infimum of over all such curves.

Let’s unpack this. We have a smooth, locally continuous deformation from distribution to given by the family of distributions , where is generated by the flow of on the sample space. The Wasserstein length of the curve is the integral over time and space of the infinitesimal “action”, which is just the mass multiplied by the velocity squared .

Wasserstein length as free-fluid action

For a single free particle of mass traveling with velocity , the Lagrangian is purely kinetic: . The action over a trajectory is

Now promote this to a free fluid with density . Each infinitesimal parcel at carries mass and moves with velocity . The total action of the fluid is

Up to a factor of , the Wasserstein length of a curve is exactly the action of a free fluid flowing from to under velocity field . This is the mechanical meaning of the Wasserstein metric.

Definition 4 (Dynamical definition of ).

The Wasserstein-2 distance between is equivalently the minimum fluid action between configurations and :

Unifying static and dynamical perspectives

There’s an elephant in the room: we have the static definition of Wasserstein distance and the dynamical definition , and they had better agree. The Kantorovich formulation optimizes over transport plans; the Riemannian formulation optimizes over fluid flows. Beautiful theories should have unique, canonical definitions — and these two are the same.

The unifying result is the Benamou-Brenier theorem. This theorem is pivotal because it identifies the linear optimal transport plan that realizes the Wasserstein distance — the engine at the heart of flow matching.

Theorem 1 (Benamou-Brenier).

The Kantorovich (static) and Riemannian (dynamical) definitions of coincide:

The optimum of the dynamical formulation is achieved by straight-line transport under the optimal plan : each mass element follows trajectory with velocity . The optimal marginal velocity field is

We state the result first, then prove it in four steps:

  1. Single-particle least action: Euler-Lagrange gives straight-line trajectories with the static quadratic cost .
  2. Single-particle transport plans: a transport plan assigns endpoints to each mass element; classical mechanics takes over.
  3. The nested decomposition: the macroscopic particle ensemble action splits into an inner infimum over single-particle trajectories, and outer infimum over coupling plans, recovering the static definition.
  4. From particle ensemble to fluid: the marginal velocity field is the conditional expectation of particle velocities; a variance-drop argument shows particle ensemble and fluid actions coincide for .

Single-particle least action

A single free particle of unit mass travels from at to at . The Lagrangian is purely kinetic: . By the Euler-Lagrange equation (), the action-minimizing trajectory is a straight line at constant velocity:

Plugging back, the minimum action of this single particle is exactly

(dropping the conventional ). This is another perspective on the quadratic static cost from Part 0, i.e. the minimal action of free-particle travel between two endpoints.

Boundary conditions vs. transport plans

Now scale up to the fluid. The boundary conditions are the marginal distributions: at and at . The fluid-level boundary condition is underspecified: it tells us the shape of the final distributions, but not which densities go where. Infinitely many arrangements are compatible.

A transport plan resolves this ambiguity. It’s a coupling that assigns specific endpoints to every infinitesimal unit of mass: ” mass travels from to .” Once a plan is fixed, classical mechanics takes over — every mass element independently follows its own Euler-Lagrange straight-line path.

The nested decomposition

Here is the core of Benamou-Brenier. Let’s first consider the dynamical fluid as a macroscopic ensemble of particles, its action decomposes into two nested minimization problems:

  1. Inner infimum (classical mechanics): Fix endpoints from the plan. The action-minimizing trajectory is a straight line; the resulting cost is .
  2. Outer infimum (static OT): Substitute the inner solution. What remains is — precisely the static Kantorovich definition in Part 0.

There’re three beautifully interleaved viewpoints at work here:

  • Static OT is the outer infimum alone. It asks: given and , what plan minimizes aggregate pairwise cost? Time and dynamics are absent.
  • Classical mechanics is the inner infimum alone. It asks: fixing endpoints and mass, what path minimizes action?
  • Dynamic OT is both simultaneously. Minimizing the global fluid action discovers the optimal particle trajectories and the optimal transport plan that binds them.

From particles to fluid

There remains a subtle yet important gap: we decomposed the action into individual particle costs under a plan , but the dynamical definition is written in terms of a macroscopic velocity field , not individual particle trajectories. How do we reconcile the particle and fluid perspectives?

Remark.

This reconciliation is the key engine behind being able to optimize the desired marginal flow matching objective using the tractable conditional flow matching objective.

Given a transport plan , each mass element follows its straight-line trajectory with velocity . Multiple particles may pass through the same point at time , potentially with different velocities. The momentum density and mass density at at time is:

Divide the two to get the macroscopic fluid velocity, which turns out to be the conditional expectation of particle velocities given position:

By the law of total variance (bias-variance tradeoff on other contexts), the total particle action decomposes as:

Fluid action is upper-bounded by particle ensemble action. Equality holds when the variance vanishes — i.e., when no two particles cross at the same point at the same time with different velocities. Under the optimal transport plan , this is guaranteed by cyclical monotonicity: if two particles’ trajectories crossed, swapping their destinations would reduce total cost 3. Therefore, for the optimal plan, particle and fluid actions coincide exactly, completing the bridge.

Footnotes

  1. More precisely, the relevant decomposition is into and with (-weighted divergence-free), which are orthogonal under the -weighted inner product.

  2. Technically, is the -closure of .

  3. Minimal-cost implies that for any two support pairs , , equivalently . You can draw some diagrams to convince yourself that this implies no crossing.

Comments