18.
KV-cache compression: 74x smaller, 99% of tokens evicted
A compression prototype looked spectacular until long-context quality tests revealed it had simply deleted almost all tokens that future tasks might need
2 appearances on the backlist front page in the last 30 days.
A compression prototype looked spectacular until long-context quality tests revealed it had simply deleted almost all tokens that future tasks might need
Built a speculative decoding inference engine in Triton. (more tests still ongoing) For now, I've tested with the GPT family because thats what i can run on my personal GPU (4gb) and I'm outperforming SGLang on both throughput and correctn