Snakamoto

Research Feed

2026-03-23
Superlinear SGD noise-curvature power law reframes implicit regularization, Bayesian optimality of selective SSMs over Transformers for in-context learning, SLAY's physics-inspired spherical linear attention via Bernstein's theorem, and Rigollet's mean-field PDE framework for transformer training dynamics.
2026-03-16
Q-learning for controlled diffusions with near-optimality rates, a microstructural derivation of rough Bergomi from order flow, exact LQG equilibria with endogenous signals and Volterra information wedges, transformers trapped by simplicity bias on Boolean functions, and Tao & Davis launch a mathematics distillation challenge.
2026-03-09
Softmax gradient flow polarization explains attention sinks, f-divergence policy gradients fix exponential blowup, a 524M-param foundation model for trade microstructure, and Riemannian geometry reveals the optimal AMM rebalancing path.
2026-02-28
Research highlights: A Model-Free Universal AI, Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity, and Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms (+2 more).
2026-02-27
A set of recent papers and talks on model agreement, risk-aware POMDP evaluation, and viscous HJB control with direct implications for practical learning systems.
2026-02-26
Three transport-focused ML/finance papers on optimal transport, diffusion dynamics, and risk-adjusted prediction, plus two OT transport-focused talks.
2026-02-25
This cycle highlights discrete diffusion control advances, distributionally robust online learning, and gap-dependent reinforcement guarantees, plus two transport-focused videos.

Blogs

OT for generative modeling 3 — Diffusion as Maximum Likelihood Estimation
Date: 2026-03-01

We derive the interpretation of physical diffusion as Wasserstein gradient flow, noise-spectrum decomposition of KL-divergence, diffusion models as MLE, and first-principles analysis of flow matching scalability. Honorable mentions to Fokker-Planck equation, Anderson's theorem, de Bruijin identity, and Tweedie's formula.
OT for generative modeling 0 — the static perspective
Date: 2026-02-23

Why we care about optimal transport (OT), the static (Kantorovich) definition of Wasserstein definition, the linear programming (Kantorovich-Rubinstein) dual formulation, and WGAN.
OT for generative modeling 1 — the Wasserstein geometry
Date: 2026-02-23

We construct the Wasserstein manifold from first principles: probability distributions as points, sample-space vector fields as tangent vectors, the density-weighted inner product that endows optimal transport with a rich Riemannian geometry. Physics-intuition on the Benamou-Brenier theorem which unifies static and Riemannian definitions.
OT for generative modeling 2 — Wasserstein gradients and drifting models
Date: 2026-02-23

We look at Kaiming Deng et al.'s Drifting Models, interpret the antisymmetric drifting field as Wasserstein gradient flow on the reverse KL between kernel-smoothed distributions, and develop the connection to maximum likelihood estimation.
Rollout likelihood generalization of maximum likelihood reinforcement learning
Date: 2026-02-16

A highly principled foundational RL paper with easily actionable changes. We derive the continuous generalization.

Browse all blogs ->

Browse all entries -> · Manage subscription