Daily Feed - 2026-03-09

Date: 2026-03-09

1. Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Varre, Rofin, Flammarion · Mar 6, 2026

Analyzes gradient flow dynamics of the value-softmax parameterization — the core building block of self-attention. Shows that gradient flow inherently drives optimization toward low-entropy (peaked) outputs, universally across logistic and square loss. The punchline: this provides a formal mechanism for attention sinks and massive activations. Essentially proves that the geometry of softmax-times-matrix training has a built-in polarization attractor — the implicit bias isn’t toward any particular solution quality, but toward concentration of the attention distribution itself.

Why it matters: Direct theoretical insight into why transformers develop the features they do during training. Connects optimization dynamics to empirical phenomena (attention sinks) that are otherwise hand-waved.

2. Beyond Softmax and Entropy: f-SoftArgmax Policy Gradients with Coupled Regularization

Labbi, Tiapkin, Mangold, Moulines · Jan 18, 2026

Replaces softmax policy parameterization with a generalized f-softargmax family, coupled with an f-divergence regularizer. The coupling creates a Polyak-Łojasiewicz landscape, yielding the first explicit non-asymptotic last-iterate convergence for stochastic policy gradient without any preconditioning. Key result: with Tsallis divergences, f-PG achieves polynomial sample complexity — in contrast to the exponential blow-up that softmax + entropy regularization suffers from.

Why it matters: The softmax parameterization is everywhere in RL, and its exponential convergence pathology is well-known. This paper shows the fix isn’t just “use natural gradient” — it’s to change the parameterization itself. The f-divergence lens connects policy optimization to information geometry in a very clean way.

3. TradeFM: A Generative Foundation Model for Trade-flow and Market Microstructure

Kawawa-Beaudan, Sood, Papasotiriou, Borrajo, Veloso · Feb 27, 2026

524M-parameter generative Transformer trained on billions of trade events across 9K+ equities. The core innovation: scale-invariant features and a universal tokenization scheme that maps heterogeneous order flow into discrete sequences — eliminating per-asset calibration. Generated rollouts reproduce heavy tails, volatility clustering, and absence of return autocorrelation. Achieves - lower distributional error than Compound Hawkes baselines and generalizes zero-shot to APAC markets.

Why it matters: The “foundation model for microstructure” idea has been floating around, and this is the most serious attempt so far. The scale-invariant tokenization is the key insight — it’s essentially asking “what’s the right embedding for order flow?” and answering with something that transfers across markets. Opens paths to synthetic data generation and learning-based trading agents.

4. Riemannian Geometry of Optimal Rebalancing in Dynamic Weight AMMs

Willetts · Mar 5, 2026

Shows that in dynamic-weight AMMs (TFMMs), the per-step arbitrage loss from rebalancing is exactly the KL divergence between weight vectors — so the Fisher-Rao metric is the natural Riemannian metric on the weight simplex. The loss-minimizing trajectory is SLERP (spherical linear interpolation) in Hellinger coordinates , i.e., a geodesic on the positive orthant of the unit sphere. The prior AM-GM heuristic turns out to lie exactly on this geodesic.

Why it matters: A beautiful collision of information geometry and DeFi mechanism design. The fact that the “right” rebalancing path is a Fisher-Rao geodesic means all the machinery of information geometry (exponential families, natural parameters, divergence duality) becomes available for AMM design. Concise paper, clean math.

1. Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

2. Beyond Softmax and Entropy: f-SoftArgmax Policy Gradients with Coupled Regularization

3. TradeFM: A Generative Foundation Model for Trade-flow and Market Microstructure

4. Riemannian Geometry of Optimal Rebalancing in Dynamic Weight AMMs

Comments