17.
Megaprop: preconditioned optimization across GPUs
Megaprop extends Megatron and TransformerEngine with distributed support for Muon, FOOF, KFAC, Newton-Muon, and MuP across width and depth
2 appearances on the backlist front page in the last 30 days.
Megaprop extends Megatron and TransformerEngine with distributed support for Muon, FOOF, KFAC, Newton-Muon, and MuP across width and depth
Megaprop's PSGD implementation calculates preconditioning matrices along with the gradient, collecting and communicating X.T @ X and dY.T @ dY at the same time we do the gradient on the weights: dY.T @ X, and has first-class support for dia