17.
John Schulman on PPO’s second wave in the LLM era
PPO’s importance-ratio and clipping objectives became useful for LLM training because of numeric error, async training, forward-pass noise, and entropy effects not anticipated in the original paper