@sheriyuo on Backlist

54.

On-policy distillation has the same systems bottleneck as RL: rollouts dominate training time on reasoning worklo…

On-policy distillation has the same systems bottleneck as RL: rollouts dominate training time on reasoning workloads. Going async fixes throughput but feeds the learner stale-policy data, and what staleness does to OPD specifically was unst

by @sheriyuo (Xiuyu Li) · backlist 2026-06-24 · rubric 94.5

72.

Was chatting with friends across different AI companies about the classic algorithm vs infra debate.

Was chatting with friends across different AI companies about the classic algorithm vs infra debate. One thing people often miss is survivorship bias. The algorithm researchers you see are usually the ones with strong papers, strong projec

by @sheriyuo (Xiuyu Li) · backlist 2026-06-22 · rubric 95.5

66.

This paper trains RLVR reasoning models on token-level distributional deviations rather than uniform token update…

This paper trains RLVR reasoning models on token-level distributional deviations rather than uniform token updates, to avoid the entropy collapse that uniform updates cause. RLVR improves reasoning but suffers an optimization instability:

by @sheriyuo (Xiuyu Li) · backlist 2026-06-19 · rubric 67.5

54.

I only recently realized that Zhipu is far from the only lab that has moved away from GRPO. Some teams working on… (x.com)

I only recently realized that Zhipu is far from the only lab that has moved away from GRPO. Some teams working on long horizon tasks still rely heavily on PPO or even REINFORCE, and a few have never seriously adopted GRPO at all. It is int

by @sheriyuo (Xiuyu Li) · backlist 2026-06-17 · rubric 76.0

75.

Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation.

Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation. Their key observation is that the distillation signal often concentrates on stylistic tokens rather than task critical reasoning tokens. As a r

by @sheriyuo (Xiuyu Li) · backlist 2026-06-11 · rubric 86.0

61.

SARDI introduces a training-free self-augmenting retrieval framework for dLLMs.

SARDI introduces a training-free self-augmenting retrieval framework for dLLMs. Instead of treating low-confidence tokens as noise to be discarded, it shows that these tokens often contain useful lookahead signals about the final answer.

by @sheriyuo (Xiuyu Li) · backlist 2026-06-07 · rubric 82.0

40.

GRPO has a known dead-zone: when all sampled trajectories are all correct or all wrong, group-relative advantage …

GRPO has a known dead-zone: when all sampled trajectories are all correct or all wrong, group-relative advantage collapses and learning stalls. On-Policy Self-Distillation tried to give dense token-level guidance but its token preferences

by @sheriyuo (Xiuyu Li) · backlist 2026-06-02 · rubric 79.0

36.

A single Meta engineer burned roughly $500K/month in Token consumption (about 300 billion tokens / month) on the …

A single Meta engineer burned roughly $500K/month in Token consumption (about 300 billion tokens / month) on the company's internal "Claudeonomics" leaderboard that ranked employees by Token usage. The leaderboard ran from March, employee

by @sheriyuo (Xiuyu Li) · backlist 2026-05-27 · rubric 94.0

70.

Someone debugged for half a day, only to find their RL was forever stuck at

Someone debugged for half a day, only to find their RL was forever stuck at (EntropyTaskRunner pid=x) self.use_critic = need_critic(self.config) Turns out this pig very thoughtfully reused the same submit_task.sh, allocating a full 16

by @sheriyuo (Xiuyu Li) · backlist 2026-05-27 · rubric 88.0