72.
I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after …
I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising