FrontierMath v2 audit fixed errors in 42% of problems
A major math benchmark raised scores across the board after finding errors in 42% of its problems, underscoring how fragile frontier evals remain
9 appearances on the backlist front page in the last 30 days.
A major math benchmark raised scores across the board after finding errors in 42% of its problems, underscoring how fragile frontier evals remain
Epoch’s audit corrected errors in 42% of FrontierMath Tiers 1–4 problems, raising scores while leaving rankings broadly similar
We’ve backfilled FrontierMath: Tiers 1–4 (v2) scores for a selection of notable models, including recent Claude Opus models. You can find these on our website. We will add scores for Claude Fable 5 and GPT Pro models shortly.
Epoch AI’s tracking places Colossus 1, Anthropic-Amazon New Carlisle, and Meta Prometheus in a rapid sequence of single-site compute records
Looking ahead, our research suggests that no data center will have meaningfully greater capacity than Colossus 2 until the second half of 2027. However, we expect a reversion to trend in late-2027/early-2028 when QTS Cedar Rapids and Meta
Are we nearing a compute crunch? In our latest Gradient Update, @luke__emberson and @Jsevillamol estimate how many tokens all the Blackwell chips on Earth could serve, and compare this to total token demand. Direct comparisons are diff
Our supply estimate is based on serving Kimi K2.6, a trillion-parameter model with 32B active parameters. Using 8k:1k input-to-output token requests, we estimate it would be possible to serve ~20B output tok/s, enough to serve every person
Epoch estimates high-bandwidth memory rose from 52% to 63% of AI chip component spending between Q1 2024 and Q4 2025
Based on disclosures about data center power capacity and compute spend, the compute used by OpenAI, Anthropic, and xAI is likely <30% of the world total. Google and Meta are giant hyperscalers, but much of their compute goes to cloud and