33.
SWE-Marathon exposes whether agents actually solve the task, or start searching for exploits in the verifier/envi…
SWE-Marathon exposes whether agents actually solve the task, or start searching for exploits in the verifier/environment. Across 100 GLM 5.2 rollouts, we saw only 3% shortcut-seeking behavior and no shipped exploit code.