Back to blogs

OT for generative modeling 0 — the static perspective

Topic: Machine Learning

Date:

There are two spaces of mathematical objects at the heart of generative modeling. On the sample space , Euclidean geometry gives us clean, benign objectives such as mean squared error. On the space of distributions maximum likelihood singles out the Kullback-Leibler divergence as the principled objective, and has its own geometry. However, KL is agnostic towards the internal geometry of and, consequently, the inductive biases associated with it; also blows up when ‘s support is outside of .

Optimal transport provides one of the most mathematically beautiful bridges between the two spaces. It has also become the backbone of modern generative modeling. My personal acquaintance with the field began with Wasserstein GAN. In school, information theory provided the necessary tools, and I’ve always been intrigued by the connections between e.g. flow matching, wasserstein gradient flow, and even renormalization. Motivated by understanding a highly impressive recent work on drifting models, I decided to write these posts to learn, and unpack, some of the concepts.

This series builds the optimal transport toolkit from scratch, with one eye on the mathematics and the other on the generative models it enables. The posts aspire to be curiosity-driven and hand-waves some epsilons and deltas. To start off, two themes run through everything, we’ve already seen one:

  1. Bridging sample and distribution geometries. Optimal transport connects Euclidean geometry to statistical divergences. This empowers us to understand training processes with benign objectives through the lens of proper distributional optimization.
  2. Bridging static and dynamic definitions. The equivalence between static, closed-form quantities (e.g. transport cost) and dynamic quantities (e.g. integral of kinetic energy) hinges on a clean decomposition: solve the single-particle action-extremizing path, then marginalize over a fixed distribution. This is the engine under Benamou-Brenier and the key to making flow matching computable using conditional flow matching.

Layout of the posts:

  • Part 0 (this post): the static picture — static definition of Wasserstein distance, and the variational dual that empowers Wasserstein-GAN.
  • Part 1: the geometric picture — the Wasserstein distribution manifold, the dynamic definition of distance, and unification with the static picture.
  • Part 2: application of the optimal transport perspective to drifting models.

Contents

Wasserstein distance

We work on . Interpret a probability distribution on as a snapshot of a fluid1 on , with denoting the height of water at ; water has uniform density, so the mass of water in each region is proportional to the enclosed distribution probability. There, mass probability.

We trivially associate the cost of moving mass from to with . You’re welcome to pick other powers or cost functions ( is another popular one). In plain words:

Definition 1.

The Wasserstein distance between distributions is the minimum cost it takes to transport to according to the quadratic cost model above.

Great! Nice definition, let’s go about rigorously defining it.

What kind of mathematical object conveniently describes a transport plan from to ? Kantorovich thought about this and came up with the set :

Definition 2.

The coupling set is the set of all joint distributions on whose marginals are and :

Here denote the first and second marginals of : for every measurable ,

Such a is called a coupling (or transport plan) of and .

A joint distribution denotes that we will transport mass from to . We have defined point-to-point cost. The static definition of Wasserstein distance is consequently manifest:

Definition 3.

The Wasserstein-2 distance between and is

We’re taking an infimum over transport plans , of the transport cost associated with . Thus the name optimal transport.

Remark.

We call the definition static because the transport plan only tells you how much to transfer from here to there, not what trajectory the transported mass travels through. The dynamic version will be introduced in the next post.

Remark.

Also recognize the Wasserstein distance as a linear program: the constraints and cost are both linear in . It’s screaming duality, which we’ll explore next.

Remark.

On finite-dimensional spaces, is represented by doubly-stochastic matrices. Look up Sinkhorn-Knopp and the Birkhoff polytope (convex hull of permutation matrices).

Variational characterization and W-GAN

When we say variational characterization [^variational] of something, we generally mean writing “something” as a minimum or maximum. Such characterizations are extremely useful because (1) theoretically, they provide bounds, (2) SGD loves extremizing things, and (3) such characterizations provide valuable insight into what the quantity is doing.

Let’s rewrite the Wasserstein distance , generalized to from , by replacing the hard constraint with nested optimization:

Conceptualize this as an adversarial equilibrium between two players

  • the primal player controls and wants to minimize as in subject to a constraint.
  • Instead of a hard constraint, we equivalently enforce the constraint by introducing an adversarial dual who controls .
  • The dual player can crank up any constraint deviation at cost to the primal player. This theme will appear very shortly, penalty is used consistently in nested extremization to enforce constraints.
  • Note: in nested extremization, since the inner optimization gets to react to the outer-choice as a constant, so chronologically, the outer operator moves first.

As a first pass, we give the dual player the power to choose arbitrary :

Let’s re-interpret this game by swapping and (regularity conditions needed), regroup, and specialize to in the distance:

In this interpretation of the game (minimax guarantees equivalent value), the dual player moves first, and the primal player reacts to minimize their penalty. Note that the value of the and the are the same — the optimizing parameters have changed.

Looking at in : if the dual player chose such that anywhere, then the integrand is negative there and the primal player will happily put infinite mass on it, sending . Mathematically, the inner infimum evaluates to an indicator function of the constraint2

Because the dual player is trying to maximize the overall objective, they’ll avoid any choices that trigger penalty. Therefore, we can safely restrict the domain of the supremum to the subset of functions where the penalty is zero:

This collapses to a single degree of freedom. Setting in the constraint gives , and since the dual player wants to maximize , at optimality . Substituting back: the constraint for all is exactly the 1-Lipschitz condition (by symmetry in ).

Definition 4.

A function is -Lipschitz if for all , . We write for the set of all such functions.

Rearranging, we obtain the Kantorovich-Rubinstein dual of (similar result exists for ):

Remark.

Why does this matter for generative modeling? Consider training a generator that pushes a noise distribution toward data . Let be the generated distribution. Then

This is exactly the Wasserstein-GAN (WGAN) objective: train a neural-network critic to approximate the supremum, and train to minimize the resulting distance. The Lipschitz constraint is enforced by weight clipping or gradient penalty. Compared to the original GAN’s Jensen-Shannon divergence, provides gradients even when the supports of and don’t overlap — which is precisely the mode-collapse pathology that plagued early GANs.

Footnotes

  1. people like to use “dirt” instead of “water,” for the mental model, as suggested by the alternative naming of Wasserstein distance as earth-moving distance. But we’ll be talking momentum and physics later, so water seems to make much more sense.

  2. In convex analysis, equals if everywhere and otherwise. This is the support function of the nonnegative-measure cone, equivalently the (negative) indicator of its dual cone .

Comments