8.
A Benchmark for Evaluation Awareness in Frontier Models (x.com)
The benchmark measures whether models can tell they are being evaluated, a failure mode that matters for real deployments
1 appearance on the backlist front page in the last 30 days.
The benchmark measures whether models can tell they are being evaluated, a failure mode that matters for real deployments