7. The custom allreduce kernel behind modern LLM decode by @SzymonOzog_ (SzymonOzog) · backlist 2026-05-08 · rubric 96.0