Back to past content

Daily Feed - 2026-02-13

Date:

Research Feed - 2026-02-13 (Fri)

4 research items (3 papers + 1 tool) + 1 video.


FlashSinkhorn: IO-Aware Entropic Optimal Transport

Domain: ML / Systems / Optimal Transport | Time cost: 20 min abstract+figures, 60-90 min real read

Intuition: Entropic optimal transport (EOT) via Sinkhorn iterations is “just” repeated row/column normalizations in the log domain. The key observation here is that (for squared Euclidean costs) stabilized Sinkhorn updates can be rewritten as row-wise LogSumExp reductions of biased dot-product scores - i.e., the same stable-softmax primitive that FlashAttention optimizes. That unlocks a FlashAttention-style tiling/fusion story: stream tiles through on-chip SRAM to kill HBM IO.

Concrete punch: EOT solves

In the dual/log view, the coupling takes the exponential form

with iterative updates that are (schematically) row/column LogSumExp reductions:

where . This is exactly the normalization kernel family attention accelerators target.

Significance: If OT is in your training loop (barycenters, OT losses, Wasserstein geometry), runtime is often IO-bound. This paper reports up to 32× forward and 161× end-to-end speedups on A100 versus online baselines, plus streaming kernels for transport application (1st/2nd-order optimization). Also: open-source implementation: https://github.com/ot-triton-lab/ot_triton.

Why it matches: Strong “variational objective → duality/log-sum-exp geometry → kernel fusion” arc. Mechanism-first (HBM traffic) rather than benchmark-only, and directly tied to your OT/transformer interests.


Explainable Patterns in Cryptocurrency Microstructure

Domain: Finance / Market Microstructure / ML | Time cost: 15 min abstract+setup, 45-75 min skim + check execution assumptions

Intuition: They claim stable cross-asset patterns in crypto LOB prediction: the same engineered order-book/trade features show similar predictive importance and similar SHAP dependence shapes across assets spanning ~an order of magnitude in market cap (BTC, LTC, ETC, ENJ, ROSE). The interesting part is the robustness angle: they tie the learned effects back to classic microstructure mechanisms (order-flow imbalance, spread, adverse selection) and probe behavior during a flash crash.

Concrete punch: The canonical microstructure “mechanism” behind many top features is an order-flow imbalance (OFI) → price-change relation:

Their falsifiable empirical claim is that (i) feature rankings and (ii) the sign/shape of partial effects (via SHAP) are stable across assets even with heterogeneous liquidity/volatility, and that maker vs taker performance diverges during a flash crash in a way consistent with adverse selection.

Significance: If “portable representation” holds under stricter costs/latency modeling, it suggests a reusable microstructure state library for short-horizon models: fewer bespoke features per asset/venue; more emphasis on regime detection and execution-aware objectives.

Why it matches: Directly aligned with your microstructure + real-time modeling stack, and it tries to connect ML explainability artifacts to actual market mechanisms (not just “we got AUC”).


Is Flow Matching Just Trajectory Replay for Sequential Data?

Domain: ML / Time-series generative modeling / Continuous-time dynamics | Time cost: 20 min abstract+main derivation, 60-120 min to really digest

Intuition: Flow Matching (FM) objectives hide a simple truth: in the population limit, the optimal learned vector field is a conditional expectation. For sequential data under the common Gaussian conditional path construction, they make this conditional expectation explicit: the implied sampler is an ODE whose dynamics is a nonparametric, memory-augmented continuous-time dynamical system - essentially a similarity-weighted mixture of “past transition velocities.”

Concrete punch: FM typically minimizes

With perfect function approximation, the optimizer satisfies

and sampling is the ODE . Their contribution is deriving a closed-form expression for on sequential data (Gaussian conditional paths), where becomes a similarity-weighted mixture over dataset transitions - making “replay vs structure learning” a concrete, analyzable question.

Significance: (i) You get a strong “no-training” baseline sampler (closed-form nonparametric field), and (ii) a clean way to reason about memorization/generalization in time-series FM - relevant for regime shift and out-of-support dynamics.

Why it matches: First-principles derivation + dynamical-systems framing (ODE sampler) + a concrete statement about what FM learns, not just that it works.


LOBSIM - deterministic L3 limit order book replay + paper execution engine

Domain: Finance / Systems / Research tooling | Time cost: 15-30 min to scan README + run a demo

Intuition: For microstructure ML/RL, the easiest way to produce false confidence is a sloppy simulator. LOBSIM is a deterministic, per-order (L3) replay engine with a C++20 core and Python bindings. The selling point is inspectability: it emits structured “facts” (fills, event-apply records, diagnostics) via a sink interface, making backtests debuggable.

Concrete punch: The key L3 invariant is that L2 displayed size is the sum of remaining quantities of the active order objects at each price level. They expose a canonical event schema (ADD/DELETE/SUBTRACT/SET/MATCH, plus strategy-side aggressive trades), and a minimal usage loop is literally “apply events, query state”:

from lobsim.engine import PaperTradingSimulator
from lobsim.sink import InMemoryLogSink
from lobsim.types import Side

engine = PaperTradingSimulator()
sink = InMemoryLogSink()
engine.set_log_sink(sink)

# ... engine.update(ev) for each NormalizedLobEvent ...

top10_bids = engine.l2_top_n(Side.BUY, 10)  # [(price_ticks, qty_lots), ...]
fills = sink.get_fills()
diagnostics = sink.get_diagnostics()

Significance: This is “mechanics layer” infrastructure: deterministic replay + rich observability is what you need to make execution-aware backtests/RL environments less fragile (especially when injecting strategy events with modeled latency).

Why it matches: Production-grade microstructure substrate (typed event schema, determinism, auditability) that supports first-principles research and real-time systems thinking.

Related HN thread (light but relevant): https://news.ycombinator.com/item?id=46733267


Optimal Transport, part 1 - Marco Cuturi (MLSS 2020)

Domain: Math / ML / Optimal Transport | Time cost: 1h 34m

Intuition: High-signal OT lecture that starts from Monge/Kantorovich and builds the duality/geometric picture that makes OT reusable as a tool (not just a distance). Excellent companion for FlashSinkhorn-style work.

Concrete punch: The Kantorovich dual is the portable variational lens:

Once you have this, entropic regularization and Sinkhorn become controlled approximations rather than magic.

Significance: Directly supports reading IO-aware Sinkhorn papers: you can map (i) primal constraints, (ii) dual potentials, and (iii) where numerical normalization kernels plug in.

Why it matches: Variational/duality structure, rigorous exposition, and under-2-hour, well-paced delivery.


Notes

  • For today’s three arXiv papers, I didn’t find obvious author/conference talks in a quick pass; the Cuturi lecture is the mandatory YouTube pick.

Feedback

Content

  • FlashSinkhorn (IO-aware OT): only moderately interesting. OT over-explored.
  • Crypto microstructure patterns: 30% (below average). Not illuminating.
  • MFG: not an interest area.
  • LOBSIM: absolutely not interesting — not interested in simulator/tooling recs.
  • OT lecture (Cuturi): over-explored topic.
  • Yesterday’s (2026-02-12) recs strongly preferred over today’s.
  • Additional signals (not feed items): Titans → positive; MIRAS → negative (20%, “abstraction for abstraction’s sake”); MaxRL → very impressive; Stanford CS236 deep generative modeling → very illuminating, key insights conveyable through feed format.
  • Interested in unifying perspectives on VAE, GAN, diffusion.

Extrapolated content

  • OT is over-explored; deprioritize unless genuinely surprising.
  • MFG and simulators/tools are out of scope for the feed.
  • Unifying generative model perspectives and concrete algorithmic novelty are high-signal.
  • Pedagogical lecture series (CS236-style) can be feed items when key insights are written up.

Comments