Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream…
Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK. Translated to finding: if we learn coeffi