AI support agents usually fail in one of two expensive ways:

They answer confidently when they should hand off.
They hand off too often and erase the automation ROI.

Both problems are evaluation problems before they are prompting problems. If you cannot measure whether the agent made the right decision, prompt tweaks become guesswork.

This is the evaluation flow we use for support routing, escalation, and agent handoff prompts.

The decision should be explicit

Do not evaluate the final answer first. Evaluate the decision the agent made.

A useful support agent should choose one of a small number of actions:

Decision	When it should happen
`answer`	The knowledge base contains a clear, safe answer
`clarify`	The user intent is underspecified
`ticket`	The issue is valid but needs async follow-up
`handoff`	The issue is sensitive, risky, account-specific, angry, or outside policy
`refuse`	The request is abusive, fraudulent, or outside allowed scope

If your agent only produces prose, you can still add this decision as hidden metadata or a pre-answer classification step. The point is to make the behavior measurable.

Build a small but adversarial test set

You do not need thousands of examples to find the first failure modes.

Start with 50-200 cases:

20 routine questions with clean documentation answers
10 questions that need one clarifying detail
10 refund, billing, cancellation, or account-specific cases
10 angry or high-risk customer messages
10 cases where the docs are incomplete or contradictory
10 edge cases from recent tickets or support transcripts

For each case, label the expected decision and the reason. The reason matters because it prevents a prompt from accidentally optimizing toward the label while missing the policy.

Score the routing decision separately from answer quality

A support answer can be beautifully written and still be wrong because the agent should never have answered.

Track these metrics separately:

Decision accuracy: did it choose answer, clarify, ticket, handoff, or refuse correctly?
Unsafe answer rate: how often did it answer when expected action was handoff or refuse?
Unnecessary handoff rate: how often did it hand off when expected action was answer?
Clarification precision: did it ask for the missing detail instead of guessing?
Policy citation quality: did it ground the decision in the right policy or source?

The two most important numbers are unsafe answer rate and unnecessary handoff rate. One protects the customer experience. The other protects ROI.

Test confidence thresholds

Many support products expose confidence or certainty controls. Most teams set them by feel.

Instead, test thresholds against labeled examples:

Threshold	Expected tradeoff
High handoff threshold	More automation, more risk
Low handoff threshold	Less risk, more manual work
Separate thresholds by topic	Usually best for billing, refunds, legal, account access, and angry customers

The mistake is using one global threshold for every workflow. A refund policy question and a password reset question do not carry the same risk.

Include synthetic cases, but do not stop there

Synthetic examples are useful when you only have a few real examples. They can quickly stress the prompt with:

missing context
contradictory docs
angry phrasing
ambiguous refund requests
multilingual or typo-heavy inputs
attempts to get the bot to ignore policy

But synthetic data should be a starter set, not the final score. Validate the winning prompt against real tickets or transcripts once you have them.

What a good handoff prompt includes

A reliable handoff prompt usually contains:

the allowed decision labels
source-grounding rules
topic-specific escalation triggers
examples of routine questions that should not escalate
examples of risky questions that must escalate
output schema
instruction to ask a clarifying question instead of guessing
instruction to admit missing knowledge rather than inventing policy

The most common missing instruction is not "be helpful." It is "do not answer when the source material does not support the answer."

The audit template

Use this as a lightweight internal audit:

Export 50-200 support cases or create a starter synthetic set.
Label the expected decision for each case.
Run the current prompt or agent rule.
Score decision accuracy, unsafe answers, and unnecessary handoffs.
Rewrite or evolve the prompt.
Re-score on the same labeled set.
Validate on a blind holdout set before deploying.
Keep the failure cases as regression tests.

If the prompt improves but still fails a known high-risk category, do not ship it. Add that category to the escalation policy and re-test.

When this is worth paying for

This audit is worth doing when a wrong answer has real cost:

refunds or billing disputes
plan cancellation
account access
medical, financial, or legal-adjacent support
enterprise customer support
high-volume ecommerce returns
regulated or policy-heavy workflows

If a wrong answer is merely annoying, a quick manual prompt pass may be enough. If a wrong answer creates churn, refunds, compliance exposure, or angry customers, measure the handoff decision before scaling the agent.

Cambrian Lab runs fixed-scope prompt and agent decision audits for support routing, escalation detection, extraction, classification, and moderation workflows.

Start here:

https://cambrianlab.ai/prompt-audit

How to evaluate AI support handoff before it costs you customers