Topology-aware operators for physical ML
Encoding geometry into the operator itself can make physical ML both faster and more accurate
5 appearances on the backlist front page in the last 30 days.
Encoding geometry into the operator itself can make physical ML both faster and more accurate
The proposed method attacks BPTT’s sequential, unstable O(T) gradient path and reframes how expressive RNNs can be trained
Benign training text can now steer a model’s internal weights to carry a functional hidden artifact, blurring data curation and model supply-chain security
Merging Q, K, and V projections challenges a core transformer assumption and could reduce memory pressure in long-context models
1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place? Turns out, they don't. Merging them unlocks