@louieworth on Backlist

72.

OPD is on-policy, but its supervision is still post-hoc and one-step.

OPD is on-policy, but its supervision is still post-hoc and one-step. The student generates a rollout. The teacher then supervises that fixed trajectory token by token. Our new paper argues that this can fail at the wrong scale. When the

by @louieworth (Li Jiang) · backlist 2026-06-09 · rubric 72.0