89.
Pre-training is increasingly data-constrained: compute outruns text, models repeat tokens many times, and how muc… (x.com)
Pre-training is increasingly data-constrained: compute outruns text, models repeat tokens many times, and how much repetition you can afford is an open question. In "Mix, Don't Tune" (my @Apple MLR internship), we run ~1000 pre-training