Customer service: measuring the agent beyond the right answer

A benchmark submitted on March 30 takes real cloud support data and measures what correctness-only tests ignore: how much it costs to reach the solution. Not whether the answer is right, but how many turns it takes and how many escalations land on a human for no reason. That's the metric that counts when you put an agent in front of real customers, not in a demo.

The practical result: the agent closes linear requests on its own, where there's one path and the tools respond. It stalls on multi-turn and on tool use, exactly where the customer has already run out of patience. There it doesn't fail silently, it fails by escalating badly or spinning in circles. Two different failures: the useless escalation burns an operator, the spinning burns the customer.

For anyone designing automated customer service, the lesson is about scope. You judge the agent on turns and avoidable escalations, not on average accuracy. And you keep the human at the exact boundary where the agent loses the thread.

Why this matters for anyone building enterprise AI: a support agent is judged on turns and escalations avoided, not on the average right answer.

◆ ◆ ◆

Source

https://arxiv.org/abs/2603.28569