38.
1/4 (x.com)
1/4 New paper with @weijie444 ! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU