Snakamoto
Research Feed
- 2026-03-23
Superlinear SGD noise-curvature power law reframes implicit regularization, Bayesian optimality of selective SSMs over Transformers for in-context learning, SLAY's physics-inspired spherical linear attention via Bernstein's theorem, and Rigollet's mean-field PDE framework for transformer training dynamics.
- 2026-03-16
Q-learning for controlled diffusions with near-optimality rates, a microstructural derivation of rough Bergomi from order flow, exact LQG equilibria with endogenous signals and Volterra information wedges, transformers trapped by simplicity bias on Boolean functions, and Tao & Davis launch a mathematics distillation challenge.
- 2026-03-09
Softmax gradient flow polarization explains attention sinks, f-divergence policy gradients fix exponential blowup, a 524M-param foundation model for trade microstructure, and Riemannian geometry reveals the optimal AMM rebalancing path.
- 2026-02-28
Research highlights: A Model-Free Universal AI, Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity, and Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms (+2 more).
- 2026-02-27
A set of recent papers and talks on model agreement, risk-aware POMDP evaluation, and viscous HJB control with direct implications for practical learning systems.
- 2026-02-26
Three transport-focused ML/finance papers on optimal transport, diffusion dynamics, and risk-adjusted prediction, plus two OT transport-focused talks.
- 2026-02-25
This cycle highlights discrete diffusion control advances, distributionally robust online learning, and gap-dependent reinforcement guarantees, plus two transport-focused videos.
Blogs
- OT for generative modeling 3 — Diffusion as Maximum Likelihood Estimation
Date:
We derive the interpretation of physical diffusion as Wasserstein gradient flow, noise-spectrum decomposition of KL-divergence, diffusion models as MLE, and first-principles analysis of flow matching scalability. Honorable mentions to Fokker-Planck equation, Anderson's theorem, de Bruijin identity, and Tweedie's formula.
- OT for generative modeling 0 — the static perspective
Date:
Why we care about optimal transport (OT), the static (Kantorovich) definition of Wasserstein definition, the linear programming (Kantorovich-Rubinstein) dual formulation, and WGAN.
- OT for generative modeling 1 — the Wasserstein geometry
Date:
We construct the Wasserstein manifold from first principles: probability distributions as points, sample-space vector fields as tangent vectors, the density-weighted inner product that endows optimal transport with a rich Riemannian geometry. Physics-intuition on the Benamou-Brenier theorem which unifies static and Riemannian definitions.
- OT for generative modeling 2 — Wasserstein gradients and drifting models
Date:
We look at Kaiming Deng et al.'s Drifting Models, interpret the antisymmetric drifting field as Wasserstein gradient flow on the reverse KL between kernel-smoothed distributions, and develop the connection to maximum likelihood estimation.
- Rollout likelihood generalization of maximum likelihood reinforcement learning
Date:
A highly principled foundational RL paper with easily actionable changes. We derive the continuous generalization.