@zhaisf on Backlist

52.

This may not be broadly known, but if instead of causal attention

This may not be broadly known, but if instead of causal attention yᵢ = xᵢ + attn(norm(x)) you do causal EMA yᵢ = xᵢ + α ∑ⱼ βⁱ⁻ʲxⱼ where α, β are fixed scalars, eg α=0.1, β=0.9, it still works — with a healthy loss curve that converg

by @zhaisf (Shuangfei Zhai) · backlist 2026-06-04 · rubric 81.0