19.
Why critic-free RL works in LLM post-training
Sequence-level rewards can assign hidden token credit because gradients from positive and negative rollouts cancel in structured ways
1 appearance on the backlist front page in the last 30 days.
Sequence-level rewards can assign hidden token credit because gradients from positive and negative rollouts cancel in structured ways