59.
We take very careful care of our benchmarks not leaking and oss models perform badly on them.
We take very careful care of our benchmarks not leaking and oss models perform badly on them. I see other benchmarks using a trust model (“its tagged with GUID, labs won’t train on them”) and oss models match the frontier. Diff might co