2.
FrontierCode: a coding benchmark for mergeable code
The benchmark tests whether maintainers would actually merge an agent’s code, not just whether it passes prewritten unit tests
1 appearance on the backlist front page in the last 30 days.
The benchmark tests whether maintainers would actually merge an agent’s code, not just whether it passes prewritten unit tests