1.
Agentic benchmarks are riddled with defects (x.com)
Fixing 31% of Terminal-Bench tasks moved every model’s score by 6–12 points, showing that benchmark maintenance can look like model progress
3 appearances on the backlist front page in the last 30 days.
Fixing 31% of Terminal-Bench tasks moved every model’s score by 6–12 points, showing that benchmark maintenance can look like model progress
1/ One big reason not to trust benchmarks: agentic benchmarks are riddled with defects right now. How much? When Terminal-Bench fixed 31% of its tasks (2.0 → 2.1), every model's score jumped 6–12 points — Opus 4.6 +12.1. (Credit to TB,
1/ Two great drops this week, both turning real repos into RL environments: - MAI-Thinking-1 ( @MicrosoftAI ) — an in-house SWE env pipeline feeding a frontier RL climb - Repo2RLEnv ( @adithya_s_k ) — open-source, repo → verifiable RL data