61.
We are measuring directionally similar, but even more striking difference: 5.5 is a better base model, but the dr…
We are measuring directionally similar, but even more striking difference: 5.5 is a better base model, but the drastically reduced thinking budget (at the same xhigh) makes it worse for high-complexity tasks, like bug finding. We need to be