10.
Training Common Crawl chronologically reduces recency failure
Models trained on web data ordered from 2018 to 2025 performed much better on recent facts than models seeing the same data in a nonsequential order
3 appearances on the backlist front page in the last 30 days.
Models trained on web data ordered from 2018 to 2025 performed much better on recent facts than models seeing the same data in a nonsequential order
Some cool work that I co-mentored with @NeelNanda5 I recommend the appendix section on practical AO evaluation details. In particular, consensus sampling significantly reduces hallucinations, and eval performance majorly improves with
I am truly blown away by Qwen-3.5-27B. It's doing better than Haiku 4.5 on my OOD interp task that involves 50k context in an agentic setting. Such a great cheap model for research tasks.