Daily Feed - 2026-02-13
Date:
Research Feed - 2026-02-13 (Fri)
4 research items (3 papers + 1 tool) + 1 video.
FlashSinkhorn: IO-Aware Entropic Optimal Transport
Domain: ML / Systems / Optimal Transport | Time cost: 20 min abstract+figures, 60-90 min real read
Intuition: Entropic optimal transport (EOT) via Sinkhorn iterations is “just” repeated row/column normalizations in the log domain. The key observation here is that (for squared Euclidean costs) stabilized Sinkhorn updates can be rewritten as row-wise LogSumExp reductions of biased dot-product scores - i.e., the same stable-softmax primitive that FlashAttention optimizes. That unlocks a FlashAttention-style tiling/fusion story: stream tiles through on-chip SRAM to kill HBM IO.
Concrete punch: EOT solves
In the dual/log view, the coupling takes the exponential form
with iterative updates that are (schematically) row/column LogSumExp reductions:
where
Significance: If OT is in your training loop (barycenters, OT losses, Wasserstein geometry), runtime is often IO-bound. This paper reports up to 32× forward and 161× end-to-end speedups on A100 versus online baselines, plus streaming kernels for transport application (1st/2nd-order optimization). Also: open-source implementation: https://github.com/ot-triton-lab/ot_triton.
Why it matches: Strong “variational objective → duality/log-sum-exp geometry → kernel fusion” arc. Mechanism-first (HBM traffic) rather than benchmark-only, and directly tied to your OT/transformer interests.
Explainable Patterns in Cryptocurrency Microstructure
Domain: Finance / Market Microstructure / ML | Time cost: 15 min abstract+setup, 45-75 min skim + check execution assumptions
Intuition: They claim stable cross-asset patterns in crypto LOB prediction: the same engineered order-book/trade features show similar predictive importance and similar SHAP dependence shapes across assets spanning ~an order of magnitude in market cap (BTC, LTC, ETC, ENJ, ROSE). The interesting part is the robustness angle: they tie the learned effects back to classic microstructure mechanisms (order-flow imbalance, spread, adverse selection) and probe behavior during a flash crash.
Concrete punch: The canonical microstructure “mechanism” behind many top features is an order-flow imbalance (OFI) → price-change relation:
Their falsifiable empirical claim is that (i) feature rankings and (ii) the sign/shape of partial effects (via SHAP) are stable across assets even with heterogeneous liquidity/volatility, and that maker vs taker performance diverges during a flash crash in a way consistent with adverse selection.
Significance: If “portable representation” holds under stricter costs/latency modeling, it suggests a reusable microstructure state library for short-horizon models: fewer bespoke features per asset/venue; more emphasis on regime detection and execution-aware objectives.
Why it matches: Directly aligned with your microstructure + real-time modeling stack, and it tries to connect ML explainability artifacts to actual market mechanisms (not just “we got AUC”).
Is Flow Matching Just Trajectory Replay for Sequential Data?
Domain: ML / Time-series generative modeling / Continuous-time dynamics | Time cost: 20 min abstract+main derivation, 60-120 min to really digest
Intuition: Flow Matching (FM) objectives hide a simple truth: in the population limit, the optimal learned vector field is a conditional expectation. For sequential data under the common Gaussian conditional path construction, they make this conditional expectation explicit: the implied sampler is an ODE whose dynamics is a nonparametric, memory-augmented continuous-time dynamical system - essentially a similarity-weighted mixture of “past transition velocities.”
Concrete punch: FM typically minimizes
With perfect function approximation, the optimizer satisfies
and sampling is the ODE
Significance: (i) You get a strong “no-training” baseline sampler (closed-form nonparametric field), and (ii) a clean way to reason about memorization/generalization in time-series FM - relevant for regime shift and out-of-support dynamics.
Why it matches: First-principles derivation + dynamical-systems framing (ODE sampler) + a concrete statement about what FM learns, not just that it works.
LOBSIM - deterministic L3 limit order book replay + paper execution engine
Domain: Finance / Systems / Research tooling | Time cost: 15-30 min to scan README + run a demo
Intuition: For microstructure ML/RL, the easiest way to produce false confidence is a sloppy simulator. LOBSIM is a deterministic, per-order (L3) replay engine with a C++20 core and Python bindings. The selling point is inspectability: it emits structured “facts” (fills, event-apply records, diagnostics) via a sink interface, making backtests debuggable.
Concrete punch: The key L3 invariant is that L2 displayed size is the sum of remaining quantities of the active order objects at each price level. They expose a canonical event schema (ADD/DELETE/SUBTRACT/SET/MATCH, plus strategy-side aggressive trades), and a minimal usage loop is literally “apply events, query state”:
from lobsim.engine import PaperTradingSimulator
from lobsim.sink import InMemoryLogSink
from lobsim.types import Side
engine = PaperTradingSimulator()
sink = InMemoryLogSink()
engine.set_log_sink(sink)
# ... engine.update(ev) for each NormalizedLobEvent ...
top10_bids = engine.l2_top_n(Side.BUY, 10) # [(price_ticks, qty_lots), ...]
fills = sink.get_fills()
diagnostics = sink.get_diagnostics()
Significance: This is “mechanics layer” infrastructure: deterministic replay + rich observability is what you need to make execution-aware backtests/RL environments less fragile (especially when injecting strategy events with modeled latency).
Why it matches: Production-grade microstructure substrate (typed event schema, determinism, auditability) that supports first-principles research and real-time systems thinking.
Related HN thread (light but relevant): https://news.ycombinator.com/item?id=46733267
Optimal Transport, part 1 - Marco Cuturi (MLSS 2020)
Domain: Math / ML / Optimal Transport | Time cost: 1h 34m
Intuition: High-signal OT lecture that starts from Monge/Kantorovich and builds the duality/geometric picture that makes OT reusable as a tool (not just a distance). Excellent companion for FlashSinkhorn-style work.
Concrete punch: The Kantorovich dual is the portable variational lens:
Once you have this, entropic regularization and Sinkhorn become controlled approximations rather than magic.
Significance: Directly supports reading IO-aware Sinkhorn papers: you can map (i) primal constraints, (ii) dual potentials, and (iii) where numerical normalization kernels plug in.
Why it matches: Variational/duality structure, rigorous exposition, and under-2-hour, well-paced delivery.
Notes
- For today’s three arXiv papers, I didn’t find obvious author/conference talks in a quick pass; the Cuturi lecture is the mandatory YouTube pick.
Feedback
Content
- FlashSinkhorn (IO-aware OT): only moderately interesting. OT over-explored.
- Crypto microstructure patterns: 30% (below average). Not illuminating.
- MFG: not an interest area.
- LOBSIM: absolutely not interesting — not interested in simulator/tooling recs.
- OT lecture (Cuturi): over-explored topic.
- Yesterday’s (2026-02-12) recs strongly preferred over today’s.
- Additional signals (not feed items): Titans → positive; MIRAS → negative (20%, “abstraction for abstraction’s sake”); MaxRL → very impressive; Stanford CS236 deep generative modeling → very illuminating, key insights conveyable through feed format.
- Interested in unifying perspectives on VAE, GAN, diffusion.
Extrapolated content
- OT is over-explored; deprioritize unless genuinely surprising.
- MFG and simulators/tools are out of scope for the feed.
- Unifying generative model perspectives and concrete algorithmic novelty are high-signal.
- Pedagogical lecture series (CS236-style) can be feed items when key insights are written up.