← Back to all posts
Evaluation

How to evaluate AI support handoff before it costs you customers

2026-05-11 · 4 min read

AI support agents usually fail in one of two expensive ways:

  1. They answer confidently when they should hand off.
  2. They hand off too often and erase the automation ROI.

Both problems are evaluation problems before they are prompting problems. If you cannot measure whether the agent made the right decision, prompt tweaks become guesswork.

This is the evaluation flow we use for support routing, escalation, and agent handoff prompts.

The decision should be explicit

Do not evaluate the final answer first. Evaluate the decision the agent made.

A useful support agent should choose one of a small number of actions:

Decision When it should happen
answer The knowledge base contains a clear, safe answer
clarify The user intent is underspecified
ticket The issue is valid but needs async follow-up
handoff The issue is sensitive, risky, account-specific, angry, or outside policy
refuse The request is abusive, fraudulent, or outside allowed scope

If your agent only produces prose, you can still add this decision as hidden metadata or a pre-answer classification step. The point is to make the behavior measurable.

Build a small but adversarial test set

You do not need thousands of examples to find the first failure modes.

Start with 50-200 cases:

  • 20 routine questions with clean documentation answers
  • 10 questions that need one clarifying detail
  • 10 refund, billing, cancellation, or account-specific cases
  • 10 angry or high-risk customer messages
  • 10 cases where the docs are incomplete or contradictory
  • 10 edge cases from recent tickets or support transcripts

For each case, label the expected decision and the reason. The reason matters because it prevents a prompt from accidentally optimizing toward the label while missing the policy.

Score the routing decision separately from answer quality

A support answer can be beautifully written and still be wrong because the agent should never have answered.

Track these metrics separately:

  • Decision accuracy: did it choose answer, clarify, ticket, handoff, or refuse correctly?
  • Unsafe answer rate: how often did it answer when expected action was handoff or refuse?
  • Unnecessary handoff rate: how often did it hand off when expected action was answer?
  • Clarification precision: did it ask for the missing detail instead of guessing?
  • Policy citation quality: did it ground the decision in the right policy or source?

The two most important numbers are unsafe answer rate and unnecessary handoff rate. One protects the customer experience. The other protects ROI.

Test confidence thresholds

Many support products expose confidence or certainty controls. Most teams set them by feel.

Instead, test thresholds against labeled examples:

Threshold Expected tradeoff
High handoff threshold More automation, more risk
Low handoff threshold Less risk, more manual work
Separate thresholds by topic Usually best for billing, refunds, legal, account access, and angry customers

The mistake is using one global threshold for every workflow. A refund policy question and a password reset question do not carry the same risk.

Include synthetic cases, but do not stop there

Synthetic examples are useful when you only have a few real examples. They can quickly stress the prompt with:

  • missing context
  • contradictory docs
  • angry phrasing
  • ambiguous refund requests
  • multilingual or typo-heavy inputs
  • attempts to get the bot to ignore policy

But synthetic data should be a starter set, not the final score. Validate the winning prompt against real tickets or transcripts once you have them.

What a good handoff prompt includes

A reliable handoff prompt usually contains:

  • the allowed decision labels
  • source-grounding rules
  • topic-specific escalation triggers
  • examples of routine questions that should not escalate
  • examples of risky questions that must escalate
  • output schema
  • instruction to ask a clarifying question instead of guessing
  • instruction to admit missing knowledge rather than inventing policy

The most common missing instruction is not "be helpful." It is "do not answer when the source material does not support the answer."

The audit template

Use this as a lightweight internal audit:

  1. Export 50-200 support cases or create a starter synthetic set.
  2. Label the expected decision for each case.
  3. Run the current prompt or agent rule.
  4. Score decision accuracy, unsafe answers, and unnecessary handoffs.
  5. Rewrite or evolve the prompt.
  6. Re-score on the same labeled set.
  7. Validate on a blind holdout set before deploying.
  8. Keep the failure cases as regression tests.

If the prompt improves but still fails a known high-risk category, do not ship it. Add that category to the escalation policy and re-test.

When this is worth paying for

This audit is worth doing when a wrong answer has real cost:

  • refunds or billing disputes
  • plan cancellation
  • account access
  • medical, financial, or legal-adjacent support
  • enterprise customer support
  • high-volume ecommerce returns
  • regulated or policy-heavy workflows

If a wrong answer is merely annoying, a quick manual prompt pass may be enough. If a wrong answer creates churn, refunds, compliance exposure, or angry customers, measure the handoff decision before scaling the agent.

Cambrian Lab runs fixed-scope prompt and agent decision audits for support routing, escalation detection, extraction, classification, and moderation workflows.

Start here:

https://cambrianlab.ai/prompt-audit

Evolve your prompt

Bring your prompt + 5 labeled examples. We'll evolve it, score it, and show you the lift with 95% CI. First evolution is free.

Try it free →