OT for generative modeling 0 — the static perspective
Topic: Machine Learning
Date:
There are two spaces of mathematical objects at the heart of generative modeling. On the sample space
Optimal transport provides one of the most mathematically beautiful bridges between the two spaces. It has also become the backbone of modern generative modeling. My personal acquaintance with the field began with Wasserstein GAN. In school, information theory provided the necessary tools, and I’ve always been intrigued by the connections between e.g. flow matching, wasserstein gradient flow, and even renormalization. Motivated by understanding a highly impressive recent work on drifting models, I decided to write these posts to learn, and unpack, some of the concepts.
This series builds the optimal transport toolkit from scratch, with one eye on the mathematics and the other on the generative models it enables. The posts aspire to be curiosity-driven and hand-waves some epsilons and deltas. To start off, two themes run through everything, we’ve already seen one:
- Bridging sample and distribution geometries. Optimal transport connects Euclidean geometry to statistical divergences. This empowers us to understand training processes with benign
objectives through the lens of proper distributional optimization. - Bridging static and dynamic definitions. The equivalence between static, closed-form quantities (e.g. transport cost) and dynamic quantities (e.g. integral of kinetic energy) hinges on a clean decomposition: solve the single-particle action-extremizing path, then marginalize over a fixed distribution. This is the engine under Benamou-Brenier and the key to making flow matching computable using conditional flow matching.
Layout of the posts:
- Part 0 (this post): the static picture — static definition of Wasserstein distance, and the variational dual that empowers Wasserstein-GAN.
- Part 1: the geometric picture — the Wasserstein distribution manifold, the dynamic definition of
distance, and unification with the static picture. - Part 2: application of the optimal transport perspective to drifting models.
Contents
Wasserstein distance
We work on
We trivially associate the cost of moving mass
Definition 1.
The Wasserstein distance between distributions
Great! Nice definition, let’s go about rigorously defining it.
What kind of mathematical object conveniently describes a transport plan from
Definition 2.
The coupling set
Here
Such a
A joint distribution
Definition 3.
The Wasserstein-2 distance between
We’re taking an infimum over transport plans
Remark.
We call the definition static because the transport plan
Remark.
Also recognize the Wasserstein distance as a linear program: the constraints and cost are both linear in
Remark.
On finite-dimensional spaces,
Variational characterization and W-GAN
When we say variational characterization [^variational] of something, we generally mean writing “something” as a minimum or maximum. Such characterizations are extremely useful because (1) theoretically, they provide bounds, (2) SGD loves extremizing things, and (3) such characterizations provide valuable insight into what the quantity is doing.
Let’s rewrite the Wasserstein distance
Conceptualize this as an adversarial equilibrium between two players
- the primal player controls
and wants to minimize as in subject to a constraint. - Instead of a hard constraint, we equivalently enforce the constraint by introducing an adversarial dual who controls
. - The dual player can crank up any constraint deviation at
cost to the primal player. This theme will appear very shortly, penalty is used consistently in nested extremization to enforce constraints. - Note: in nested extremization, since the inner optimization gets to react to the outer-choice as a constant, so chronologically, the outer operator moves first.
As a first pass, we give the dual player the power to choose arbitrary
Let’s re-interpret this game by swapping
In this interpretation of the game (minimax guarantees equivalent value), the dual player moves first, and the primal player reacts to minimize their penalty. Note that the value of the
Looking at
Because the dual player is trying to maximize the overall objective, they’ll avoid any choices that trigger
This collapses to a single degree of freedom. Setting
Definition 4.
A function
Rearranging, we obtain the Kantorovich-Rubinstein dual of
Remark.
Why does this matter for generative modeling? Consider training a generator
This is exactly the Wasserstein-GAN (WGAN) objective: train a neural-network critic
Footnotes
-
people like to use “dirt” instead of “water,” for the mental model, as suggested by the alternative naming of Wasserstein distance as earth-moving distance. But we’ll be talking momentum and physics later, so water seems to make much more sense. ↩
-
In convex analysis,
equals if everywhere and otherwise. This is the support function of the nonnegative-measure cone, equivalently the (negative) indicator of its dual cone . ↩