88.
excellent blog on how to actually make agents better instead of just benchmaxxing evals. some imp points:
excellent blog on how to actually make agents better instead of just benchmaxxing evals. some imp points: -> benchmaxxing fits tools where a human stays in control and catches mistakes. floor raising fits agents that work alone with no one