DRIP-R is out, a benchmark that puts LLM agents in front of return requests where company policy is ambiguous and no single correct resolution exists. Not the easy case (defective product, return window open, full refund), but the edge case: customer a few days past the window, product half-used, a plausible reason the rules don't cover. The test measures three things at once, policy adherence, dialogue quality, resolution quality, with realistic customer personas and real tool-calling.
It matters because it's exactly the line anyone putting an agent in production has to draw. Clean requests an agent closes on its own, already today. The damage happens on the ambiguous cases: there it either hands out refunds the policy doesn't allow, or denies a legitimate return and burns the customer. DRIP-R tries to quantify that gray area instead of leaving it to intuition.
Why this matters for anyone building enterprise AI: an agent's autonomy isn't decided on the easy case, but on how well it holds up under ambiguity.