On-policy distillation has the same systems bottleneck as RL: rollouts dominate training time on reasoning worklo…
On-policy distillation has the same systems bottleneck as RL: rollouts dominate training time on reasoning workloads. Going async fixes throughput but feeds the learner stale-policy data, and what staleness does to OPD specifically was unst