Back to blogs

OT for generative modeling 3 — Diffusion as Maximum Likelihood Estimation

Topic: Machine Learning

Date:

This part uses the Wasserstein toolkit we’ve developed in parts 0 and 1 (links) to unpack the dominant generative paradigm from first principles. We begin with some physics on Brownian motion and stochastic processes, highlights include:

  • Brownian motion score flow provides the fundamental bridge between SDE and ODE formulations as well as a valuable perspective on the score-ODE which shows up everywhere in generative modeling.
    • As a corollary, we prove Anderson’s theorem which allows one to run a SDE backwards in time.
  • Unifying microscopic particle movement with macroscopic, thermodynamic optimization: diffusion with drift (Fokker-Planck) as Wasserstein gradient descent.

We also deep-dive into the two absolutely foundational pillars of scalable diffusion / flow matching. In my opinion, they are the first-principles reason why flow matching dominate modern generative modeling:

  • Tweedie’s formula provides a dimension-scalable solution to the density estimation problem, the universal problem to all generative modeling. It turns density estimation — which nonparametrically scales with dimension — into function estimation, which scales with the amount and internal structure of the data.
  • The dynamic de Bruijin identity provides a canonical noise-spectral decomposition of KL-divergence. It bridges score matching (exactly what Tweedie’s provide), with MSE; it also provides a canonical spectrum over which to commit bias-variance tradeoffs.

Later todo for myself: look into diffusion as optimal Bayes engine (Polyanskiy).

Contents

Physics of diffusion

We begin by delving into some physics. We first adopt a microscopic, particle-level description of the diffusive process, then unify it with a macroscopic, information-theory level description:

  • Microscopic language: brownian motion, vector fields, and score.
  • Macroscopic language: KL minimization, entropy.

The keystone unifying these two perspectives is the Wasserstein gradient flow.

Brownian motion as score flow

Remark (Brownian motion).

Formally, a standard Brownian motion is a continuous-time stochastic process characterized by:

  • Independent increments: future displacements are entirely independent of past states
  • Gaussian increments: .

The infinitesimal increment behaves as a zero-mean Gaussian variable with variance . It is the CLT limit of a discrete binomial random walk. We’ll deep-dive in a later series on the connections between Black-Scholes options pricing, heat diffusion, and quantum mechanics, stay tuned!

Consider the standard heat-diffusion process where particles are purely driven by Brownian jittering:

Over time , particles jump according to .

Theorem 1 (Brownian motion as score flow).

The stochastic evolution is equivalent at the distribution level to deterministic evolution under the velocity field

We’re saying that the net, distribution-level effect of Brownian motion can be described by probability mass evolving deterministically according to the score gradient . Equivalently,

Remark.

This is the engine behind reducing SDE (diffusion) models to ODE (flow matching) models.

Proof sketch: 1D discretization argument

Imagine bins of width at points , each with probability mass . Over time interval , Brownian particles have equal probability of hopping left or right. The net flux of mass from bin to is:

The flux from to is . The net change in mass at bin is:

Converting to continuous density :

Recognizing the central difference approximation to the Laplacian :

we obtain:

The variance of Brownian motion over time is , yielding:

for .

Brownian motion with drift, Fokker-Planck

Now add drift to the Brownian motion:

By linearity, the induced probability density evolves under velocity field:

Applying the continuity equation yields the Fokker-Planck equation:

Since curl components don’t affect density evolution, write , and we’re ready to state the celebrated Fokker-Planck equation.

Theorem 2 (Fokker-Planck equation).

Particles evolving under the SDE

induce probability density evolution given by the Fokker-Planck equation:

Writing (curl components vanish under divergence), this is equivalently deterministic flow under:

Proof. By linearity and the Brownian motion as score flow , the following SDE and ODE are interchangeable:

Applying the continuity equation :

Diffusion as Wasserstein gradient flow

In this subsection, we’ll endow the Fokker-Planck (diffusion) equation with a macroscopic interpretation. This exemplifies the theme that Wasserstein geometry connects microscopic, particle-level evolution with distribution-level extremization. Let’s begin with a special case:

Corollary 1 (Brownian motion maximizes entropy).

Pure Brownian diffusion executes Wasserstein gradient ascent on the entropy functional .

Proof. The functional derivative of entropy is:

By Otto’s theorem (see Part 2), the Wasserstein gradient is:

Wasserstein gradient ascent follows , which matches with and .

Remark.

This corollary bridges the microscopic interpretation of heat equation as Brownian motion (particle jittering) with the macroscopic interpretation of heat equation as process that minimizes entropy along the Wasserstein geometry.

Why the Wasserstein geometry? There exists other geometries of probability distributions, Fisher-Rao being a notable one. Is the Wasserstein geometry, in some sense, canonical? For one, Fisher-Rao is not physical because it’s agnostic towards rearrangements of the base space ; mass can be teleported around. The rigorous physical grounding of the Wasserstein geometry in non-equilibrium dynamics is a deep rabbit hole which we’ll not explore at the moment.

Now the general case. For simplicity, we suppress time dependence on and :

Theorem 3 (Fokker-Planck as Wasserstein gradient flow).

Define the Boltzmann equilibrium distribution from the potential :

Define the scaled KL divergence:

The Fokker-Planck equation is Wasserstein gradient descent on :

Proof

Expanding the KL divergence:

Using :

The functional derivative is:

By Otto’s theorem, the Wasserstein gradient is:

Gradient descent follows . From with :

Applying the continuity equation completes the proof.

Remark (Interpretation).

Fokker-Planck evolution minimizes KL divergence to the Boltzmann equilibrium distribution , where plays the role of energy and acts as temperature. The gradient pulls mass towards low potential while diffusion spreads it out, balancing until .

Remark (Perspectives on the equilibrium distribution).

The Boltzmann equilibrium distribution admits three equivalent interpretations:

  1. Thermodynamic principle: is the distribution that maximizes entropy subject to fixed expected energy:

    The Lagrange multiplier enforcing the energy constraint gives the inverse temperature .

  2. Dynamical equilibrium: is the unique stationary solution to the Fokker-Planck equation . Setting :

    This holds when the drift and diffusion balance: , i.e., .

  3. Optimal control: In reinforcement learning, is the reward-optimal policy that maximizes expected reward subject to KL proximity to a reference policy:

    The solution is , recovering the Boltzmann form when is uniform and .

Bonus: Anderson’s theorem

The Brownian motion as score flow equivalence allows us to elegantly prove a foundational theorem, Anderson’s theorem, which converts between forward and backward processes. Incidentally, it will also be helpful when we consider options pricing as a reverse-time heat diffusion process. To motivate it, consider the following question:

  1. We know how to describe forward-time trajectories of particles using the heat equation (or Fokker-Planck when there’s potential).
  2. If we want to magically play the tape backwards, which particle evolution equation will replicate the reverse-time behavior?

Remark.

This is not a trivial problem because while , the Brownian drift term doesn’t simply reverse: . Our proof consists of converting the forward and backward SDEs into ODEs via and matching the velocity components.

Additionally, the choice of the same is canonical because forward and backward drift should be able to make use of the same amount of diffusion rate. Since we can microscopically tell the differences between Brownian motion with different diffusive constants and forward and reverse-time microscopic dynamics should be indistinguishable, we require in the reverse-time process.

Theorem 4 (Anderson’s theorem).

Given a forward stochastic process on governed by the SDE

Let be the marginal probability density of . The reverse-time process which traces the exact same marginal distributions backwards from to is driven by the reverse SDE

Proof. By the Brownian motion as score flow theorem , the forward SDE is equivalent to the ODE:

For the reverse-time process, time flows backwards: where . The reverse velocity must be the negative of the forward velocity to retrace the path: Converting back to forward time :

Converting this ODE back to an SDE using : the reverse process is driven by an ODE under velocity field .

We have postulated from first principles that the -ODE is equivalent to a SDE with drift and diffusion rate . Applying the gauge degree of freedom to find :

The LHS corresponds to . The RHS comes from ODE reversal. Solving for yields

MLE interpretation of diffusion models

In this section, we provide a first-principles, maximum likelihood interpretation of diffusion generative models by applying the (dynamic) de Bruijn identity. This is the crux behind the celebrated paper maximum likelihood training of score-based diffusion models.

The maximum likelihood objective is prone to divergence when the support of does not cover . The de Bruijn identity decomposes maximum likelihood into score matching over a noise spectrum, allowing us to attenuate divergence in a controllable limit of the integral.

the dynamic de Bruijn identity

Recall the definition of the Riemannian gradient:

Consider the product manifold equipped with the standard product metric, where are tangent vectors on the components:

Define the functional on the product manifold . The Wasserstein gradient factors (note that denotes the Wasserstein gradient w.r.t. ):

Given trajectories , we have

Note that the integral and gradient are over in sample space. This is the general dynamic de Bruijn’s identity: it’s just the gradient expansion of the KL-divergence on product manifold.

Theorem 5 (the dynamic de Bruijn identity).

Given distribution trajectories , the KL divergence rate of change is given by

Application to various processes.

The variance exploding process injects Gaussian noise without attenuating the original signal. It’s simply the heat equation with variable diffusion constant.

In this section, we consider two common process and the application of theorem to each:

  1. The heat process with noise schedule . it’s defined on and the variance of the marginal explodes.
  2. The variance preserving process, defined on and .

Heat process

Definition 1 (heat process).

The variance-exploding process is defined to obey the SDE:

Also recall its Wasserstein gradient

Here, is also known as the noise schedule. Denote the accumulated variance as , then the marginal distribution is

If are both subject to the VE process, substituting the dynamic de Bruijn identity yields

Sometimes this specialized result is known as the dynamic de Bruijn identity.

Remark.

The equation demonstrates a fundamental interpretation of the de Bruijn identity. Note the Euclidean score matching loss . This term is well-defined for all . The divergence of disjoint support is hidden in the tail . We have expanded the KL divergence as score matching over a full noise spectrum.

Variance-preserving process

Definition 2 (variance-preserving process).

The variance-preserving (VP) process is defined by the following SDE with noise schedule :

By Fokker-Planck, the equivalent velocity field dictating Wasserstein gradient flow is

It’s related to the heat process up to a time-dependent rescaling of sample space scale. The conditional distribution is

Substituting into the dynamic de Bruijn identity, the drift terms magically cancel and terminal divergence vanishes at , yielding

Remark (interpretation as OU-process).

The variance-preserving process is related to the Ornstein-Uhlenbeck process, which is the canonical continuous-time model for mean-reverting behavior given by

Here is long-term mean, is restorative strength, and volatility.

Tweedie’s formula and flow matching

Here, we clarify some of the most heavily overloaded deconcepts in flow matching / diffusion, prove that flow matching is MLE by Tweedie’s formula, and conclude with some possibly backfit first-principles analysis on why flow matching have come to dominate generative modeling. We’ll see that Tweedie’s formula is the engine behind high-dimensional scalability of flow matching models.

Tweedie’s formula

Tweedie’s formula states that for processes with Gaussian noise, the true posterior mean (optimal denoiser) can be computed purely from the score of the marginal distribution.

Application of this formula sheds light on the fundamental unification of ODE-style flow matching and SDE-style diffusion as maximum likelihood estimation via flow matching. It unifies the following objectives:

  1. Predicting the score
  2. Predicting the posterior mean (denoising)
  3. Predicting the denoising vector field.

Theorem 6 (Tweedie’s formula).

Let be an unobserved true signal with an arbitrary prior, then

where is the marginal distribution of and .

Proof sketch: substitute the Gaussian conditional score formula

Without loss of generality we can substitute so . First compute the conditional distribution and score:

Substitute into the score:

Rearranging yields the result.

Remark.

This is a highly nontrivial theorem. In general, posterior quantities depend on the prior distribution of . Here, the key property provided the closed-form which can be rearranged into .

Remark (application, IMPORTANT!).

Flipping the theorem on its head, Tweedie reduces a density estimation problem into a mean-estimation problem . Suppose is an arbitrary distribution, we can apply amount of noise and solve the mean-estimation problem to approximate ; here controls the bias-variance tradeoff.

Next, let’s consider compact with being data / model boundary distribution, and being . Recall in Part 1 that the Wasserstein-2 geodesic distance is realized by straight-line transport. Solving this ODE corresponds to flow matching, and we derive the de Bruijin MLE interpretation below:

The flow matching process

We analyze the flow matching process by applying the following construction:

  1. We define an independent coupling between data and i.i.d. Gaussian noise.
  2. Given the independent coupling, we define straight-line transport conditioned upon endpoints. This supplies the trivial endpoints-conditioned vector field.
  3. Using 1-2 above, we can derive the data-conditioned vector field . This is a straight vector field.
  4. Using 3, we derive the marginal vector field . This is not a straight vector field.
  5. Note that the noise-conditioned vector field is straight.

It’s extremely important to differentiate between vector fields by what they’re conditioned on.

Definition 3 (the flow matching ODE process).

Fixing data (or model) distribution and stationary noise distribution , flow matching uses independent coupling between noise and data:

Recalling our discussion of the optimal transport field from Part 1, the OT process for the conditional coupling implements straight line transport between endpoints. The conditional density implements linear interpolation between data sample and Gaussian noise:

The next step is deriving the conditional and marginal vector fields.

Fixing both data and noise , the vector field on valid interpolation 1.

Proposition 1 (data-conditioned flow-matching vector field).

Fixing data , the conditional vector field is straight-line: at , the trajectory endpoint is . The projected displacement, which gives velocity under straight-line transport, is the current displacement scaled by the time progress .

For general endpoints-conditioned transport to initial position conditioned transport, apply

Let’s proceed to derive the marginal vector field. Recall the decomposition:

  • LHS denotes the macroscopic fluid velocity at position , time .
  • Fluid at this space-time are composed of fluid initially starting across a range of initial positions , each with their own velocity.
  • The RHS denotes an ensemble of particle velocity from initial positions , averaged by their constituent ratio .

Proposition 2 (flow matching formulas).

Applying with yields

This looks difficult, but luckily, the independent coupling + straight-line endpoint transport implies conditional Gaussian noise . Applying Tweedie yields

Simplify to obtain

The drift terms cancel in and , yielding the KL decomposition

Remark (estimate v. transport).

Note that if are e.g. images, then are generally sharp images, while the estimates are generally blurry means. In particular, is just the unconditional mean.

Remark (straight vector fields).

Note that the data-conditioned vector field is straight. Similarly, the noise-conditioned vector field is also straight. However, the marginal vector field is not straight.

What happens when we depart from the KL-weighting? Below, we show that using non- score weights produces a bulk-KL over intermediate marginals:

Proposition 3 (non-MLE weightings produce bulk-KL).

Consider evolving according to the independent-coupling process. Given weights , define

The -weighted score-difference integral can be written as

whenever the boundary term is well-defined. In particular, whenever the boundary term vanishes, score matching with weight is equivalent to minimizing the bulk-KL

The maximum-likelihood choice is exceptional: then , so the integral collapses to the endpoint term .

Proof

From , differentiate both sides w.r.t. and rearrange:

Substitute into the weighted score objective and use :

Integrate by parts to obtain .

Corollary 2 (flow-matching / uniform-velocity weighting).

A very common choice is uniform-in-time velocity matching, corresponding to

The equivalent bulk-KL weighting is

Flow matching in practice

Let’s look at the preceding proposition operationally:

  1. Straight-line ODE transport equation tells us that given the (noise) spectrum-indexed family of scores , we can integrate from to to generate a sample.
  2. de Bruijn formula tells us that minimizing score MSE yields maximum likelihood.

If we’re happy generating samples by integrating a vector field, we only need to approximate the score . This is a parameric density estimation problem. But the score target looks untractable.

Our escape hatch is equation . Reparameterize , then the implied score of our generative model is

Remark.

Note that is implicitly defined by the sampling process where we push backwards through the ODE with .

Rewriting using via Tweedie, and using the implicit parameterization above, the de Bruijn equation yields

Note that sampling are all from the data -process. But mean estimation is very easy, applying MSE decomposition while fixing :

The cross term vanishes; this is the law of total variance.

Remark.

This analysis shows that the irreducible noise at level is .

In practice practice, the literature has settled upon using function estimators to approximate the velocity field . Systematically, the three parameterizations are related by the identities

Crucially, both the score and velocity are linear in the denoiser , so the training protocol boils down to:

and regress one of the equivalent targets:

  1. Predict score. Using Tweedie, the conditional score target is with objective
  2. Predict velocity. Using , the conditional straight-line velocity is with objective This is the most popular parameterization. With absorbing the term, this becomes a normalized-noise predictor.
  3. Predict / denoise. Train against with the MLE weight

These are the same objective written in different coordinates; only the deterministic -dependent scaling changes. The irreducible component of the estimation problem at each noise level under MLE weighting is

Note that this is intrinsic to the data distribution.

Footnotes

  1. It doesn’t matter where is elsewhere because does not have mass elsewhere.

Comments