31.
SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run it…
SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test. The problem is that passing a unit test is only one part of writing production-ready co