50.
In RL, the ability to *reset* to an arbitrary state is powerful (see, e.g., Go-Explore), but often unrealistic. (x.com)
In RL, the ability to *reset* to an arbitrary state is powerful (see, e.g., Go-Explore), but often unrealistic. For LLMs though, states are tokens, so resets are natural! In work led by @Ankur_Samanta_ , we propose a GRPO variant where