@JongwonPar9958 on Backlist

1.

Agentic benchmarks are riddled with defects (x.com)

Fixing 31% of Terminal-Bench tasks moved every model’s score by 6–12 points, showing that benchmark maintenance can look like model progress

by @JongwonPar9958 (Jongwon Park) · backlist 2026-06-10 · rubric 88.0

1.

1/ One big reason not to trust benchmarks: agentic benchmarks are riddled with defects right now. (x.com)

1/ One big reason not to trust benchmarks: agentic benchmarks are riddled with defects right now. How much? When Terminal-Bench fixed 31% of its tasks (2.0 → 2.1), every model's score jumped 6–12 points — Opus 4.6 +12.1. (Credit to TB,

by @JongwonPar9958 (Jongwon Park) · backlist 2026-06-09 · rubric 88.0

69.

1/ Two great drops this week, both turning real repos into RL environments: (x.com)

1/ Two great drops this week, both turning real repos into RL environments: - MAI-Thinking-1 ( @MicrosoftAI ) — an in-house SWE env pipeline feeding a frontier RL climb - Repo2RLEnv ( @adithya_s_k ) — open-source, repo → verifiable RL data

by @JongwonPar9958 (Jongwon Park) · backlist 2026-06-04 · rubric 78.0