Daily Feed - 2026-03-23

Date: 2026-03-23

1. On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Domain: ML (optimization theory) | Time cost: ~30 min read

Everyone “knows” SGD noise is proportional to the Hessian — that’s why SGD prefers flat minima. This paper shows the standard argument relies on Fisher ≈ Hessian, which fails in deep networks, and discovers the true relationship is a superlinear power law. Using Activity–Weight Duality, the SGD noise covariance decomposes as where is the per-sample Hessian (vs. ). By Jensen’s inequality, and commute approximately but don’t coincide. The diagonal elements follow with exponent , bounded by the per-sample Hessian spectrum. recovers the classical claim; means noise grows quadratically with curvature.

Why it matters: This reframes SGD’s implicit regularization: the bias toward flat minima is stronger than the linear theory predicts ( empirically). Any optimizer that tries to match SGD’s implicit bias (SAM, Sharpness-Aware methods) should target this power-law structure, not the linear proxy.

2. Bayesian Optimality of In-Context Learning with Selective State Spaces

Domain: ML (learning theory / architecture theory) | Time cost: ~35 min read

In-context learning is usually explained as “Transformers do implicit gradient descent on the context.” This paper argues that’s the wrong frame entirely — ICL is better understood as Bayesian posterior inference, and selective SSMs (Mamba-style) are provably better at it than Transformers. For tasks governed by Linear Gaussian State Space Models, a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor (posterior predictive mean). The separation theorem: there exist tasks with temporally correlated noise where the Bayes-optimal predictor strictly beats any ERM estimator in asymptotic risk. Since Transformers perform implicit ERM, selective SSMs achieve lower risk — not by being bigger, but by being statistically more efficient at structured-noise tasks.

Why it matters: Reframes the Transformer-vs-SSM debate from “which scales better” to “which performs the right kind of inference.” For sequence tasks with latent temporal structure (finance, control, time series), this suggests SSMs aren’t just computationally cheaper — they’re doing fundamentally better statistical reasoning.

3. SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Domain: ML (attention theory / kernel methods) | Time cost: ~35 min read | Venue: ICML 2026

Linear attention approximations (Performers, Cosformers) trade accuracy for speed. SLAY takes a different route: constrain queries and keys to the unit sphere so attention depends only on angular alignment, then use classical analysis to build an exact random-feature decomposition. The Yat-kernel is inspired by inverse-square interactions in physics. On the sphere, Bernstein’s theorem guarantees the kernel decomposes as a nonnegative mixture of polynomial-exponential product kernels. This yields a strictly positive random-feature map — so attention scores are always nonneg and well-defined — with time and memory. Empirically nearly indistinguishable from softmax, consistently outperforming all prior linear-attention methods.

Why it matters: This is (reportedly) the tightest linear-time approximation to softmax attention to date. The construction is elegant: physics-inspired kernel → classical analysis (Bernstein) → practical random features. It suggests the geometry of the key-query space matters more than the specific nonlinearity.

4. 🎥 Philippe Rigollet: The Mean-Field Dynamics of Transformers

Domain: ML (mean-field theory / optimization) | Time cost: ~51 min watch | Venue: CIRM

Rigollet (MIT) analyzes transformer training through the mean-field lens: as width → ∞, the discrete parameter updates converge to a PDE governing the evolution of a measure over parameters. This connects transformer optimization to the rich theory of interacting particle systems and Wasserstein gradient flows. The talk develops the mean-field limit for attention layers, where the empirical measure of attention heads converges to a distributional dynamics. The key challenge vs. standard mean-field neural network theory (e.g., for two-layer nets) is that attention introduces data-dependent interactions between particles — the kernel itself depends on the measure, creating a nonlinear PDE rather than a McKean-Vlasov SDE.

Why it matters: Provides a principled PDE framework for understanding why transformer training works — moving beyond the NTK/lazy regime into the feature-learning (mean-field) regime that actually matters in practice. Rigollet is one of the sharpest mathematical statisticians working on ML foundations.

1. On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

2. Bayesian Optimality of In-Context Learning with Selective State Spaces

3. SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

4. 🎥 Philippe Rigollet: The Mean-Field Dynamics of Transformers

Comments