78.
Excited to release Delta Attention Residuals!
Excited to release Delta Attention Residuals! A simple & powerful idea: route over layer deltas instead of cumulative hidden states to avoid routing collapse in deep transformers. Sharper cross-layer routing, lower perplexity, efficient fi