@OfirPress on Backlist

2 appearances on the backlist front page in the last 30 days.

65.

People frequently ask me how many tasks a benchmark should have. There's no exact answer but here's my intuition-

People frequently ask me how many tasks a benchmark should have. There's no exact answer but here's my intuition- (tl;dr aim for 300-500 tasks)

by @OfirPress (Ofir Press) · backlist 2026-06-12 · rubric 84.0

77.

"SWE-bench/ProgramBench are based on publicly-available data, so they're invalid cause the models were trained on…

"SWE-bench/ProgramBench are based on publicly-available data, so they're invalid cause the models were trained on the answers" Nope: 1. Scores are ~0% at first, showing models don't memorize answers. 2. Cheating by post-training on answers

by @OfirPress (Ofir Press) · backlist 2026-06-04 · rubric 74.0