@arbdwj on Backlist

A Benchmark for Evaluation Awareness in Frontier Models (x.com)

The benchmark measures whether models can tell they are being evaluated, a failure mode that matters for real deployments

by @arbdwj (Ram Bharadwaj) · backlist 2026-06-24 · rubric 98.5