For days, many folks here are citing DeepSWE as the benchmark that restores reality only because it shows GPT 5.5…
For days, many folks here are citing DeepSWE as the benchmark that restores reality only because it shows GPT 5.5 on top. But actually, it almost gets a single entry right: the top one, and all the rest is shuffled.