@AlfredoAndere on Backlist

59.

We take very careful care of our benchmarks not leaking and oss models perform badly on them.

We take very careful care of our benchmarks not leaking and oss models perform badly on them. I see other benchmarks using a trust model (“its tagged with GUID, labs won’t train on them”) and oss models match the frontier. Diff might co

by @AlfredoAndere (Alfredo Andere) · backlist 2026-06-18 · rubric 74.0