33.
We evaluated Gandalf, our agentic judge, on a new meta-evaluation dataset called BankerVerifierBench (BVB), built…
We evaluated Gandalf, our agentic judge, on a new meta-evaluation dataset called BankerVerifierBench (BVB), built on top of BankerToolBench (BTB), a long-time-horizon investment-banking benchmark. Gandalf achieves the highest performance an