@htihle on Backlist

39.

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago.

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago. It uses ~22k output tokens on average, compared to ~12k for the (high) setting. This gives a fairly clear but modest increase (3%) in score, showi

by @htihle (Håvard Ihle) · backlist 2026-06-20 · rubric 86.8

37.

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago.

GLM 5.2 (max) scores 70.1% on WeirdML, narrowly beating to Genini 3 Pro, from 7 months ago. It uses ~22k output tokens on average, compared to ~12k for the (high) setting. This gives a fairly clear but modest increase (3%) in score, showi

by @htihle (Håvard Ihle) · backlist 2026-06-19 · rubric 86.8

40.

Why does GPT write 5x more code than Claude?

Why does GPT write 5x more code than Claude? As its last act, I had Fable analyze WeirdML data, and the short answer (link to full analysis in reply): "The gap is real code, not comments. Recent GPT models (since GPT-5) build portfolio

by @htihle (Håvard Ihle) · backlist 2026-06-15 · rubric 82.0

33.

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on a…

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task. It uses about 8k output tokens on average, almost as much as Opus 4.7 (high). EDIT: This post

by @htihle (Håvard Ihle) · backlist 2026-06-11 · rubric 90.0

38.

Nemotron 3 Ultra 505b a55b scores 43.5% on WeirdML, comparable to Mistral Medium 3.5 128b or o3 mini.

Nemotron 3 Ultra 505b a55b scores 43.5% on WeirdML, comparable to Mistral Medium 3.5 128b or o3 mini. It can sometimes do well on some of the hard tasks, but it's not very reliable. It also often emitted the "stop" token when done with

by @htihle (Håvard Ihle) · backlist 2026-06-08 · rubric 84.0

70.

Claude Opus 4.8 (xhigh) scores 82.9% on WeirdML, right behind GPT 5.5.

Claude Opus 4.8 (xhigh) scores 82.9% on WeirdML, right behind GPT 5.5. We now also (unlike 4.7) see a clear scaling with output token use: - no thinking: 2.4k tokens, 70.5% - medium: 4.3k, 76.0% - xhigh:

by @htihle (Håvard Ihle) · backlist 2026-06-01 · rubric 88.0

87.

How far behind are open models?

How far behind are open models? Across 17 selected benchmarks, private ones show a gap of 8-10 months today, almost 2x the gap on public ones (4-6 mo). More discussion (including limitations), code and blog in the thread.

by @htihle (Håvard Ihle) · backlist 2026-05-28 · rubric 74.0