48.
Flipping the loop order in the attention kernel, iterating over KV blocks as the outer loop instead of queries m…
Flipping the loop order in the attention kernel, iterating over KV blocks as the outer loop instead of queries made it 4× faster than open-source sparse attention kernels Damn!!