22.
Synthetic Data Playbook: G-Vendi predicts quality where other diversity metrics do not
Across 83 synthetic-pretraining experiments, most diversity metrics failed to predict data quality while G-Vendi stood out
2 appearances on the backlist front page in the last 30 days.
Across 83 synthetic-pretraining experiments, most diversity metrics failed to predict data quality while G-Vendi stood out
Two billion web pages are now directly accessible as training data, lowering the friction for pretraining experiments