53.
Any task whether you call it an eval or an environment decomposes into three parts: a dataset of task instances, …
Any task whether you call it an eval or an environment decomposes into three parts: a dataset of task instances, a harness/rollout that lets the model act ( multi-turn, with tools and state), and a verifier/reward function that scores the t