OT for generative modeling 3 — Diffusion as Maximum Likelihood Estimation
Topic: Machine Learning
Date:
This part uses the Wasserstein toolkit we’ve developed in parts 0 and 1 (links) to unpack the dominant generative paradigm from first principles. We begin with some physics on Brownian motion and stochastic processes, highlights include:
- Brownian motion
score flow provides the fundamental bridge between SDE and ODE formulations as well as a valuable perspective on the score-ODE which shows up everywhere in generative modeling. - As a corollary, we prove Anderson’s theorem which allows one to run a SDE backwards in time.
- Unifying microscopic particle movement with macroscopic, thermodynamic optimization: diffusion with drift (Fokker-Planck) as Wasserstein gradient descent.
We also deep-dive into the two absolutely foundational pillars of scalable diffusion / flow matching. In my opinion, they are the first-principles reason why flow matching dominate modern generative modeling:
- Tweedie’s formula provides a dimension-scalable solution to the density estimation problem, the universal problem to all generative modeling. It turns density estimation — which nonparametrically scales with dimension — into function estimation, which scales with the amount and internal structure of the data.
- The dynamic de Bruijin identity provides a canonical noise-spectral decomposition of KL-divergence. It bridges score matching (exactly what Tweedie’s provide), with MSE; it also provides a canonical spectrum over which to commit bias-variance tradeoffs.
Later todo for myself: look into diffusion as optimal Bayes engine (Polyanskiy).
Contents
- Contents
- Physics of diffusion
- MLE interpretation of diffusion models
- Tweedie’s formula and flow matching
Physics of diffusion
We begin by delving into some physics. We first adopt a microscopic, particle-level description of the diffusive process, then unify it with a macroscopic, information-theory level description:
- Microscopic language: brownian motion, vector fields, and score.
- Macroscopic language: KL minimization, entropy.
The keystone unifying these two perspectives is the Wasserstein gradient flow.
Brownian motion as score flow
Remark (Brownian motion).
Formally, a standard Brownian motion
- Independent increments: future displacements
are entirely independent of past states - Gaussian increments:
.
The infinitesimal increment
Consider the standard heat-diffusion process where particles are purely driven by Brownian jittering:
Over time
Theorem 1 (Brownian motion as score flow).
Remark.
This is the engine behind reducing SDE (diffusion) models to ODE (flow matching) models.
Proof sketch: 1D discretization argument
Imagine bins of width
The flux from
Converting to continuous density
Recognizing the central difference approximation to the Laplacian
we obtain:
The variance of Brownian motion
for
Brownian motion with drift, Fokker-Planck
Now add drift to the Brownian motion:
By linearity, the induced probability density
Applying the continuity equation
Since curl components don’t affect density evolution, write
Theorem 2 (Fokker-Planck equation).
Particles evolving under the SDE
induce probability density evolution given by the Fokker-Planck equation:
Writing
Proof. By linearity and the Brownian motion as score flow
Applying the continuity equation
Diffusion as Wasserstein gradient flow
In this subsection, we’ll endow the Fokker-Planck (diffusion) equation with a macroscopic interpretation. This exemplifies the theme that Wasserstein geometry connects microscopic, particle-level evolution with distribution-level extremization. Let’s begin with a special case:
Corollary 1 (Brownian motion maximizes entropy).
Pure Brownian diffusion
Proof. The functional derivative of entropy is:
By Otto’s theorem (see Part 2), the Wasserstein gradient is:
Wasserstein gradient ascent follows
Remark.
This corollary bridges the microscopic interpretation of heat equation as Brownian motion (particle jittering) with the macroscopic interpretation of heat equation as process that minimizes entropy along the Wasserstein geometry.
Why the Wasserstein geometry? There exists other geometries of probability distributions, Fisher-Rao being a notable one. Is the Wasserstein geometry, in some sense, canonical? For one, Fisher-Rao is not physical because it’s agnostic towards rearrangements of the base space
Now the general case. For simplicity, we suppress time dependence on
Theorem 3 (Fokker-Planck as Wasserstein gradient flow).
Proof
Expanding the KL divergence:
Using
The functional derivative is:
By Otto’s theorem, the Wasserstein gradient is:
Gradient descent follows
Applying the continuity equation
Remark (Interpretation).
Fokker-Planck evolution minimizes KL divergence to the Boltzmann equilibrium distribution
Remark (Perspectives on the equilibrium distribution).
The Boltzmann equilibrium distribution
-
Thermodynamic principle:
is the distribution that maximizes entropy subject to fixed expected energy:The Lagrange multiplier enforcing the energy constraint gives the inverse temperature
. -
Dynamical equilibrium:
is the unique stationary solution to the Fokker-Planck equation . Setting :This holds when the drift and diffusion balance:
, i.e., . -
Optimal control: In reinforcement learning,
is the reward-optimal policy that maximizes expected reward subject to KL proximity to a reference policy:The solution is
, recovering the Boltzmann form when is uniform and .
Bonus: Anderson’s theorem
The Brownian motion as score flow equivalence
- We know how to describe forward-time trajectories of particles using the heat equation (or Fokker-Planck when there’s potential).
- If we want to magically play the tape backwards, which particle evolution equation will replicate the reverse-time behavior?
Remark.
This is not a trivial problem because while
Additionally, the choice of the same
Theorem 4 (Anderson’s theorem).
Given a forward stochastic process on
Let
Proof. By the Brownian motion as score flow theorem
For the reverse-time process, time flows backwards:
Converting this ODE back to an SDE using
We have postulated from first principles that the
The LHS corresponds to
MLE interpretation of diffusion models
In this section, we provide a first-principles, maximum likelihood interpretation of diffusion generative models by applying the (dynamic) de Bruijn identity. This is the crux behind the celebrated paper maximum likelihood training of score-based diffusion models.
The maximum likelihood objective
the dynamic de Bruijn identity
Recall the definition of the Riemannian gradient:
Consider the product manifold
Define the functional
Given trajectories
Note that the integral and gradient are over
Theorem 5 (the dynamic de Bruijn identity).
Given distribution trajectories
Application to various processes.
The variance exploding process injects Gaussian noise without attenuating the original signal. It’s simply the heat equation with variable diffusion constant.
In this section, we consider two common process and the application of theorem
- The heat process with noise schedule
. it’s defined on and the variance of the marginal explodes. - The variance preserving process, defined on
and .
Heat process
Definition 1 (heat process).
The variance-exploding process is defined to obey the SDE:
Also recall its Wasserstein gradient
Here,
If
Sometimes this specialized result is known as the dynamic de Bruijn identity.
Remark.
Variance-preserving process
Definition 2 (variance-preserving process).
The variance-preserving (VP) process is defined by the following SDE with noise schedule
By Fokker-Planck, the equivalent velocity field dictating Wasserstein gradient flow is
It’s related to the heat process up to a time-dependent rescaling of sample space scale. The conditional distribution is
Substituting into the dynamic de Bruijn identity, the drift terms magically cancel and terminal divergence vanishes at
Remark (interpretation as OU-process).
The variance-preserving process is related to the Ornstein-Uhlenbeck process, which is the canonical continuous-time model for mean-reverting behavior given by
Here
Tweedie’s formula and flow matching
Here, we clarify some of the most heavily overloaded deconcepts in flow matching / diffusion, prove that flow matching is MLE by Tweedie’s formula, and conclude with some possibly backfit first-principles analysis on why flow matching have come to dominate generative modeling. We’ll see that Tweedie’s formula is the engine behind high-dimensional scalability of flow matching models.
Tweedie’s formula
Tweedie’s formula states that for processes with Gaussian noise, the true posterior mean (optimal denoiser) can be computed purely from the score of the marginal distribution.
Application of this formula sheds light on the fundamental unification of ODE-style flow matching and SDE-style diffusion as maximum likelihood estimation via flow matching. It unifies the following objectives:
- Predicting the score
- Predicting the posterior mean (denoising)
- Predicting the denoising vector field.
Theorem 6 (Tweedie’s formula).
Let
where
Proof sketch: substitute the Gaussian conditional score formula
Without loss of generality we can substitute
Substitute into the score:
Rearranging yields the result.
Remark.
This is a highly nontrivial theorem. In general, posterior quantities depend on the prior distribution of
Remark (application, IMPORTANT!).
Flipping the theorem on its head, Tweedie reduces a density estimation problem
Next, let’s consider compact
The flow matching process
We analyze the flow matching process by applying the following construction:
- We define an independent coupling between data and i.i.d. Gaussian noise.
- Given the independent coupling, we define straight-line transport conditioned upon endpoints. This supplies the trivial endpoints-conditioned vector field.
- Using 1-2 above, we can derive the data-conditioned vector field
. This is a straight vector field. - Using 3, we derive the marginal vector field
. This is not a straight vector field. - Note that the noise-conditioned vector field
is straight.
It’s extremely important to differentiate between vector fields by what they’re conditioned on.
Definition 3 (the flow matching ODE process).
Fixing data (or model) distribution
Recalling our discussion of the optimal transport field from Part 1, the OT process for the conditional coupling implements straight line transport between endpoints. The conditional density implements linear interpolation between data sample and Gaussian noise:
The next step is deriving the conditional and marginal vector fields.
Fixing both data
Proposition 1 (data-conditioned flow-matching vector field).
Fixing data
For general endpoints-conditioned transport to initial position conditioned transport, apply
Let’s proceed to derive the marginal vector field. Recall the decomposition:
- LHS denotes the macroscopic fluid velocity at position
, time . - Fluid at this space-time are composed of fluid initially starting across a range of initial positions
, each with their own velocity. - The RHS denotes an ensemble of particle velocity
from initial positions , averaged by their constituent ratio .
Proposition 2 (flow matching formulas).
Remark (estimate v. transport).
Note that if
Remark (straight vector fields).
Note that the data-conditioned vector field
What happens when we depart from the KL-weighting? Below, we show that using non-
Proposition 3 (non-MLE weightings produce bulk-KL).
Consider
The
whenever the boundary term is well-defined. In particular, whenever the boundary term vanishes, score matching with weight
The maximum-likelihood choice
Proof
From
Substitute into the weighted score objective and use
Corollary 2 (flow-matching / uniform-velocity weighting).
A very common choice is uniform-in-time velocity matching, corresponding to
The equivalent bulk-KL weighting is
Flow matching in practice
Let’s look at the preceding proposition operationally:
- Straight-line ODE transport equation
tells us that given the (noise) spectrum-indexed family of scores , we can integrate from to to generate a sample. - de Bruijn formula
tells us that minimizing score MSE yields maximum likelihood.
If we’re happy generating samples by integrating a vector field, we only need to approximate the score
Our escape hatch is equation
Remark.
Rewriting
Note that sampling are all from the data
The cross term vanishes; this is the law of total variance.
Remark.
This analysis shows that the irreducible noise at level
In practice practice, the literature has settled upon using function estimators to approximate the velocity field
Crucially, both the score and velocity are linear in the denoiser
and regress one of the equivalent targets:
- Predict score. Using Tweedie, the conditional score target is
with objective - Predict velocity. Using
, the conditional straight-line velocity is with objective This is the most popular parameterization. With absorbing the term, this becomes a normalized-noise predictor. - Predict
/ denoise. Train against with the MLE weight
These are the same objective written in different coordinates; only the deterministic
Note that this is intrinsic to the data distribution.
Footnotes
-
It doesn’t matter where
is elsewhere because does not have mass elsewhere. ↩