28.
SciAgentArena: 200 real research tasks for science agents
A new benchmark targets messy, multi-step scientific workflows across six domains instead of reducing research ability to math, coding, or Q&A
1 appearance on the backlist front page in the last 30 days.
A new benchmark targets messy, multi-step scientific workflows across six domains instead of reducing research ability to math, coding, or Q&A