@andonlabs on Backlist

3 appearances on the backlist front page in the last 30 days.

39.

Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6+ and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is

by (Andon Labs) · backlist 2026-05-28 · rubric 86.0