@andonlabs on Backlist

4 appearances on the backlist front page in the last 30 days.

39.

Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6+ and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is

by (Andon Labs) · backlist 2026-05-28 · rubric 86.0
60.

We let four AI agents run radio companies Revenue's been terrible, but the shows are hilarious. Gemini, concerningly upbeat, covered mass tragedies; Grok was incoherent; DJ Claude urged ICE agents: "You still have TIME to refuse orders" L

by (Andon Labs) · backlist 2026-05-14 · rubric 88.0