1 post
Agent evaluation: measuring multi-step trajectories, tool use, and open-ended outputs. Why benchmarks alone don't tell you whether an agent works in production.