Back to past content

Daily Feed - 2026-03-09

Date:

1. Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Varre, Rofin, Flammarion · Mar 6, 2026

Analyzes gradient flow dynamics of the value-softmax parameterization — the core building block of self-attention. Shows that gradient flow inherently drives optimization toward low-entropy (peaked) outputs, universally across logistic and square loss. The punchline: this provides a formal mechanism for attention sinks and massive activations. Essentially proves that the geometry of softmax-times-matrix training has a built-in polarization attractor — the implicit bias isn’t toward any particular solution quality, but toward concentration of the attention distribution itself.

Why it matters: Direct theoretical insight into why transformers develop the features they do during training. Connects optimization dynamics to empirical phenomena (attention sinks) that are otherwise hand-waved.


2. Beyond Softmax and Entropy: f-SoftArgmax Policy Gradients with Coupled Regularization

Labbi, Tiapkin, Mangold, Moulines · Jan 18, 2026

Replaces softmax policy parameterization with a generalized f-softargmax family, coupled with an f-divergence regularizer. The coupling creates a Polyak-Łojasiewicz landscape, yielding the first explicit non-asymptotic last-iterate convergence for stochastic policy gradient without any preconditioning. Key result: with Tsallis divergences, f-PG achieves polynomial sample complexity — in contrast to the exponential blow-up that softmax + entropy regularization suffers from.

Why it matters: The softmax parameterization is everywhere in RL, and its exponential convergence pathology is well-known. This paper shows the fix isn’t just “use natural gradient” — it’s to change the parameterization itself. The f-divergence lens connects policy optimization to information geometry in a very clean way.


3. TradeFM: A Generative Foundation Model for Trade-flow and Market Microstructure

Kawawa-Beaudan, Sood, Papasotiriou, Borrajo, Veloso · Feb 27, 2026

524M-parameter generative Transformer trained on billions of trade events across 9K+ equities. The core innovation: scale-invariant features and a universal tokenization scheme that maps heterogeneous order flow into discrete sequences — eliminating per-asset calibration. Generated rollouts reproduce heavy tails, volatility clustering, and absence of return autocorrelation. Achieves - lower distributional error than Compound Hawkes baselines and generalizes zero-shot to APAC markets.

Why it matters: The “foundation model for microstructure” idea has been floating around, and this is the most serious attempt so far. The scale-invariant tokenization is the key insight — it’s essentially asking “what’s the right embedding for order flow?” and answering with something that transfers across markets. Opens paths to synthetic data generation and learning-based trading agents.


4. Riemannian Geometry of Optimal Rebalancing in Dynamic Weight AMMs

Willetts · Mar 5, 2026

Shows that in dynamic-weight AMMs (TFMMs), the per-step arbitrage loss from rebalancing is exactly the KL divergence between weight vectors — so the Fisher-Rao metric is the natural Riemannian metric on the weight simplex. The loss-minimizing trajectory is SLERP (spherical linear interpolation) in Hellinger coordinates , i.e., a geodesic on the positive orthant of the unit sphere. The prior AM-GM heuristic turns out to lie exactly on this geodesic.

Why it matters: A beautiful collision of information geometry and DeFi mechanism design. The fact that the “right” rebalancing path is a Fisher-Rao geodesic means all the machinery of information geometry (exponential families, natural parameters, divergence duality) becomes available for AMM design. Concise paper, clean math.

Comments