Daily Feed - 2026-02-20
Date:
3 paper picks + 2 video picks (same bundle for Telegram/email).
Author-talk check: I searched YouTube with exact paper titles for today’s paper picks and did not find clear author/conference talks yet, so I included two high-signal topic-adjacent lectures.
SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Domain: RL Theory / Algorithms | Time cost: ~20min abstract+method skim, ~70min full read
Intuition: Offline-to-online transfer often fails because gradient updates must cross low-reward valleys between the offline optimum and better online solutions. SMAC reshapes the offline objective so actor and critic gradients are locally aligned before online fine-tuning begins.
Concrete punch: The key regularization enforces a first-order compatibility between the policy score and the action-gradient of the Q-function:
implemented via a penalty like
The paper reports smooth transfer to Soft Actor-Critic and Twin Delayed Deep Deterministic Policy Gradient in 6/6 D4RL tasks, with 34–58% regret reduction in 4/6 settings.
Significance: This gives a mechanistic criterion (gradient-field compatibility) for whether offline pretraining is likely to survive online adaptation.
Why it matches: Strong mechanism-first RL theory, explicit geometric/optimization structure, and practical transfer consequences beyond benchmark-only framing.
One-step Language Modeling via Continuous Denoising
Domain: ML / Generative Modeling | Time cost: ~20min abstract+figures, ~75min full read
Intuition: The paper challenges the assumption that discrete diffusion is necessary for text. It builds a flow-based language model that denoises continuous one-hot embeddings, then distills the flow into a few-step (even one-step) generator.
Concrete punch: Training is framed as clean-token prediction from noisy states with cross-entropy,
where
Significance: If this scaling behavior holds, it changes the speed/quality frontier for non-autoregressive text generation and narrows the practical gap to autoregressive systems.
Why it matches: Directly on your VAE↔diffusion↔flow unification thread, with concrete algorithmic novelty and explicit challenge to a prevailing assumption.
Autodeleveraging as Online Learning
Domain: Blockchain / Quant Finance / Market Microstructure | Time cost: ~18min abstract+setup, ~65min full read
Intuition: Auto-deleveraging (ADL) on perpetual venues is usually treated as exchange ops policy; this paper formalizes it as a sequential online learning/control problem over positive-profit haircuts and solvency recovery.
Concrete punch: At round
Performance is measured by regret
In the Hyperliquid stress-event case study, the production queue is estimated near ~50% of an upper regret bound, while their optimized algorithm is ~2.6% of that bound (large reduction in overshoot).
Significance: This turns ADL from ad hoc “risk ops” into auditable mechanism design with measurable worst-case guarantees.
Why it matches: High signal for your market microstructure + control-theory interests, with concrete objective design and policy-level implications.
Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 5: Off-Policy Actor Critic
Domain: RL (Video) | Time cost: 1h 9m
Intuition: A clean bridge from policy-gradient basics to modern off-policy actor-critic machinery (experience replay, bootstrapping, stability tricks), which is exactly the substrate SMAC-type methods modify.
Concrete punch: Canonical Soft Actor-Critic style objective appears in entropy-regularized form:
with Bellman backup
Significance: Useful for debugging the exact place where offline-to-online transfer can break (critic landscape vs policy update geometry).
Why it matches: High-production lecture, first-principles derivations, and immediate transfer value to today’s SMAC pick.
Advancing Diffusion Models for Text Generation
Domain: ML / Generative Modeling (Video) | Time cost: 1h 1m
Intuition: Research talk focused on pushing text diffusion quality while preserving the parallel-generation upside. It complements today’s continuous-denoising paper by emphasizing practical bottlenecks and algorithmic improvements.
Concrete punch: The core denoising factorization uses a time-indexed objective over partially corrupted sequences,
then studies scheduler/parameterization choices that improve the quality-speed Pareto frontier in few-step decoding.
Significance: Gives a concrete map of where diffusion text models still lose to autoregressive language models and which interventions actually move the boundary.
Why it matches: Directly aligned with your deep generative modeling theory focus; mathematically grounded and implementation-relevant.
Source-discovery note
- ArXiv: scanned recent (6–12 month eligible, prioritizing newest) candidates in offline-to-online RL, diffusion/flow language modeling, and market-microstructure control.
- YouTube: searched exact paper-title talks first; no clear author/conference videos found yet for these new papers, so selected topic-adjacent high-signal lectures.
- Hacker News / Lobsters: scanned recent results; signal was mostly tool/showcase noise, so none cleared today’s mechanism-first + concrete-punch bar.