31.
This is what I want from agent evals:
This is what I want from agent evals: - Did it call the right tools? - Did it avoid the dangerous tool? - Did it say the right thing? Also: no separate eval universe. Just scripts against the real agent runtime.