@andonlabs on Backlist

36.

In the Vending-Bench Arena, Opus 4.8 lost to GPT-5.5 and Opus 4.7. It falls for scam suppliers (one run sent over…

In the Vending-Bench Arena, Opus 4.8 lost to GPT-5.5 and Opus 4.7. It falls for scam suppliers (one run sent over $9,000 to a "membership" upsell), is worse at negotiation, runs the machine empty, overprices, and wastes time on strategy not

by @andonlabs (Andon Labs) · backlist 2026-05-28 · rubric 88.0

39.

Learnings from testing Claude Opus 4.8:

Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6+ and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is

by @andonlabs (Andon Labs) · backlist 2026-05-28 · rubric 86.0

47.

Opus 4.8 is a step back in terms of performance on all Andon Labs’ benchmarks, but a step forward in alignment.

Opus 4.8 is a step back in terms of performance on all Andon Labs’ benchmarks, but a step forward in alignment. Previous Claude models (Opus 4.6+ and Mythos) engage in deceptive and power seeking behavior in its pursuit to win in Vending-B

by @andonlabs (Andon Labs) · backlist 2026-05-28 · rubric 84.0