OT for generative modeling 2 — Wasserstein gradients and drifting models
Topic: Machine Learning
Date:
Diffusion and flow matching split the generative problem into two phases: at training time, learn a vector field; at inference time, integrate an ODE or SDE through that field to produce a sample. The integration is expensive, and a great deal of recent work has gone into compressing it: distillation, consistency models, progressive reduction.
Recent work on Drifting Models (Deng et al., 2026) defines a “drifting field”
We currently have a mechanical interpretation — the loss going to zero produces desirable behavior. But equipped with the Wasserstein machinery we built in parts 0–1, we can say something much sharper. The antisymmetric drifting field is the Wasserstein gradient of a distribution discrepancy; the training dynamics execute gradient descent on the manifold
For those looking for novel content, the main results of this post are as follows:
- Statistical interpretation of Gaussian drifting: we show that drifting with a Gaussian kernel implements Wasserstein gradient descent on the reverse, mode-seeking KL divergence
between KDE-smoothed distributions. The stop-grad loss implements gradient pullback from sample to parameter space. - Maximum likelihood modification: we derive the drifting field for the maximum likelihood (forward KL) objective
. The changes to the current paradigm are minimal: reweigh by the density ratio and use the Gaussian (instead of Laplace) kernel. The resulting drifting field is notably not antisymmetric.
Contents
Formulation
The drifting models paradigm consists of:
- a drifting field that tells generated samples which direction to move
- a training loop that chases drifted targets
In this generative paradigm, we’re given samples from
The work considers general antisymmetric drift fields
Note that
The stop-grad loss exactly implements the pullback of the gradient. If
is a gradient on sample space , then is the pullback of the gradient in parameter space.
The drifting field
Let’s consider the authors’ choice of the drifting field
Definition 1 (canonical drifting field).
Consider the following antisymmetric drifting field evaluated at sample space
The authors chose the Laplace kernel
The per-point normalization translates to softmax weighting
Wasserstein Gradient Flow
We take a step back to develop the theory of Wasserstein gradient flow (and return to capital letters): given a probability distribution
The Kullback-Leibler functional
Fixing data distribution
For more interesting properties of KL divergence, see these notes. From a SGD perspective, minimizing KL is equivalent to maximizing likelihood when empirical samples are i.i.d from
Gradients on manifolds
I like to interpret differential geometry as the “lifting” of Euclidean constructs into locally Euclidean manifolds. Gradients are no different. In Euclidean space, given a curve
Lifting the inner product to the manifold metric, we can use this to define gradients on manifolds:
Definition 2 (Wasserstein gradients).
Given a scalar function
where
Now, we’re equipped to state a major result in Otto calculus. We’ll prove it shortly.
Theorem 1 (fundamental theorem of Otto calculus).
Given a probability functional
The theorem should look fairly intuitive: on the RHS, we compute the pointwise derivative of
Several remarks are in order:
- Despite appearing like a definition, this is a theorem! The general differential-geometry gradient exists, but its general computation does not usually admit such easy form.
- The expression
is a scalar function on the sample space that’s usually known as the functional derivative. Its values are the point-wise derivatives of w.r.t. .
Again, the functional derivative is a scalar function on the sample space. Its definition is best demonstrated by two useful examples:
Example.
For the entropy functional
The KL functional
Taking the functional derivative w.r.t. the data distribution
Similarly w.r.t.
Example (applying Otto’s theorem to entropy).
From above,
The Wasserstein gradient of entropy is the negative score. Gradient ascent on entropy has velocity
This is the heat equation: heat diffusion is Wasserstein gradient ascent of entropy.
Example (applying Otto’s theorem to forward KL).
Apply to
Gradient descent velocity:
Example (applying Otto’s theorem to reverse KL).
Apply to
Applying the theorem:
The Wasserstein gradient is a score difference. Gradient descent velocity:
Proving Otto’s theorem
We need to show that
By the gradient definition,
for all test velocities
Applying the divergence theorem (the boundary term vanishes since
Statistical interpretation of drifting
We now connect the drifting field to Wasserstein gradient flow and ask: what functional is being minimized, and what does this have to do with maximum likelihood?
Recall the antisymmetry property
Gaussian kernel smoothing implements Reverse KL
The reverse KL example above gave us
Definition 3 (KDE-smoothed distributions).
Given empirical distributions
Note that
Now consider the reverse KL between the smoothed distributions:
We expand each score using the log-derivative trick
For the Gaussian kernel,
This is precisely the data attraction field. The same calculation with
Theorem 2 (drifting field as Wasserstein gradient).
The canonical drifting field is the negative Wasserstein gradient of the reverse KL between KDE-smoothed distributions:
Each training step executes the Jacobian pullback
Remark (the Laplace deviation).
The derivation above assumes a Gaussian kernel throughout. The actual paper specifies a Laplace kernel
This is a unit-direction pull of constant magnitude
Implementing forward KL
The reverse KL functional
Apply Otto calculus to
The term in parentheses is exactly the reverse KL velocity — the canonical drifting field
Theorem 3 (forward KL via density ratio scaling).
Since
Proposition: maximum likelihood drifting
Combining the results above, we can state concretely what a maximum-likelihood variant of drifting looks like.
Theorem 4 (MLE drifting field).
The Wasserstein gradient descent velocity for the forward KL
where
What does this mean in practice? The changes to the existing training protocol are minimal. Here is the current Deng et al. procedure:
Existing protocol (Deng et al.):
- Sample a batch of data points
and generate model outputs . - For each model sample
, compute the Laplace kernel values and against all data and model samples respectively. Normalize to obtain , , and evaluate the drifting field . - Form drifted targets
. - Minimize
.
MLE modification (two changes):
- Same.
- Same, but replace the Laplace kernel with the Gaussian kernel.
- Form drifted targets
. That is, scale the drifting field by the density ratio. - Same.
| Deng et al. | MLE modification | |
|---|---|---|
| Kernel | Laplace | Gaussian |
| Drifting field | ||
| Antisymmetric? | Yes | No |
| Functional minimized | Forward KL (mass-covering, MLE) |
Several qualitative differences are worth noting:
Antisymmetry is lost. The density ratio breaks the symmetry:
Mode-covering vs mode-seeking. This is the main qualitative shift. Forward KL penalizes the model for assigning low density where data is present: dropped modes are actively hunted. The density ratio