1/ One big reason not to trust benchmarks: agentic benchmarks are riddled with defects right now. (x.com)
1/ One big reason not to trust benchmarks: agentic benchmarks are riddled with defects right now. How much? When Terminal-Bench fixed 31% of its tasks (2.0 → 2.1), every model's score jumped 6–12 points — Opus 4.6 +12.1. (Credit to TB,