52.
This may not be broadly known, but if instead of causal attention
This may not be broadly known, but if instead of causal attention yᵢ = xᵢ + attn(norm(x)) you do causal EMA yᵢ = xᵢ + α ∑ⱼ βⁱ⁻ʲxⱼ where α, β are fixed scalars, eg α=0.1, β=0.9, it still works — with a healthy loss curve that converg