30.
We gave frontier LLMs your daily interaction history — they still score below 0.5.
We gave frontier LLMs your daily interaction history — they still score below 0.5. Adding memory makes it worse. Findings from our VitaBench 2.0 — the first agent benchmark for long-term dynamic user modeling, evaluating Personalized & P