Backlist — 28 May 2026 UTC

Balanced toward durable technical artifacts and concrete current-news analysis while avoiding the many near-duplicate Opus 4.8 reactions

28.

OK FIRST EVAL: CODEX RUNNING /goal VS. CLAUDE CODE ORCHESTRATING CODEX AGENTS I have an ACTUAL long form tasks I have to finish. I created two separate worktrees This one is a full migration of services from Supabase to self-hosted Po

by (BOOTOSHI ) · backlist 2026-05-28 · rubric 92.0
31.

Production agents also change state. If an agent claims it updated a CRM, opened a PR, changed cloud config, or triggered a workflow, the eval should verify what actually happened. Agent Judge can inspect tool evidence, database logs, aud

by (Judgment Labs) · backlist 2026-05-28 · rubric 90.0
35.

(x.com)

Congrats to the @liquidai team on LFM2.5-8B-A1B! Day-0 support is now live in SGLang. - 8B MoE, 1.5B active - Fast tool calling, punches 4x its size - 128K context + better non-Latin support - Runs local, no API keys, no data leaving

by (LMSYS Org) · backlist 2026-05-28 · rubric 88.0
39.

Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6+ and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is

by (Andon Labs) · backlist 2026-05-28 · rubric 86.0
46.

democratizing compute with RLMs you don't need a frontier model with a giant context window. even relatively small models get massive gains (they trained an 8B RLM-Qwen3 that beats its base model by ~28% and gets close to much larger mode

by (spacy) · backlist 2026-05-28 · rubric 84.0
51.

3 weeks ago we open-sourced HALO this led to talking with dozens of teams running agents at scale we realized the current agent monitoring tools aren't built for the future that we so clearly see ahead of us today we’re releasing native

by (Sam Hogan ) · backlist 2026-05-28 · rubric 84.0
53.

(x.com)

3 weeks after launch, the feedback on @lightseekorg TokenSpeed’s scheduler and kernel design has been encouraging. Kimi K2.5 and Qwen 3.5 reaching speed-of-light performance is amazing. Long road ahead — the lean and small team with high

by (zhyncs) · backlist 2026-05-28 · rubric 83.0
58.

MiniMax M3 >200B+ MoE 1M context window MSA (MiniMax Sparse Attention) architecture released in a few days 𝐨𝐩𝐞𝐧-𝐬𝐨𝐮𝐫𝐜𝐞𝐝 From a tweet by an official MiniMax team member: Not inside info just public stuff online. Open source mod

by (Elaina) · backlist 2026-05-28 · rubric 82.0
80.

State machines are The first POC I did with agent driven UIs was literally just giving the agent a reference to the reducer dispatch action and the serialized JSON schema to describe the payload. Worked incredibly well

by (Jonas) · backlist 2026-05-28 · rubric 74.0
81.

(x.com)

New post from @iapsAI on Cyber Superstorms My colleagues argue that counting zero-days is not the way to measure the consequences of AI-accelerated vulnerablility Instead, they propose that the community should focus on how often AI-acc

by (Dave Banerjee) · backlist 2026-05-28 · rubric 74.0
87.

How far behind are open models? Across 17 selected benchmarks, private ones show a gap of 8-10 months today, almost 2x the gap on public ones (4-6 mo). More discussion (including limitations), code and blog in the thread.

by (Håvard Ihle) · backlist 2026-05-28 · rubric 74.0