@HeMuyu0327 on Backlist

35.

Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream…

Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK. Translated to finding: if we learn coeffi

by @HeMuyu0327 (Muyu He) · backlist 2026-06-02 · rubric 84.0

52.

In our paper, we also find another interesting angle to see how much deep attention layers hate to compute from w…

In our paper, we also find another interesting angle to see how much deep attention layers hate to compute from what is in their residual stream: If you learn coefficients for standard value vectors in final attention layers, they will be

by @HeMuyu0327 (Muyu He) · backlist 2026-05-30 · rubric 78.0

57.

Thanks Lucas! Yes we prefetch the full V in deep layers. The early layers need standard V computation, so when th…

Thanks Lucas! Yes we prefetch the full V in deep layers. The early layers need standard V computation, so when they are doing computation we already know the complete list of token indices to prefetch from.

by @HeMuyu0327 (Muyu He) · backlist 2026-05-30 · rubric 78.0

86.

In our new paper, we naturally derive a new attention variant based on the surprising finding that deep layers be…

In our new paper, we naturally derive a new attention variant based on the surprising finding that deep layers benefit the most from learning a context-free value vectors, without the input from the residual stream. The attention variant:

by @HeMuyu0327 (Muyu He) · backlist 2026-05-30 · rubric 72.0