If you have an LLM classifier in production, you have a problem you probably don't talk about: you don't know whether your prompt is good.
You wrote it six months ago. It seemed to work in the first 10 test cases. It's been running against real traffic ever since. Accuracy is a number you've never measured end-to-end. And changing the prompt terrifies you, because "better" is a vibe check — there's no statistical way to tell whether the new version is actually an improvement or just feels different today.
That's the problem Cambrian Lab solves.
What it does
You bring a prompt and some labeled examples (as few as 5, we'll synthesize 50 more). We run an evolutionary search over prompt variants:
- Generation 0: 15 candidate prompts seeded from diverse reasoning angles
- Each candidate is scored against your labeled data
- Top performers ("elites") survive to the next generation
- A frontier LLM acts as the crossover/mutation operator, breeding elites into new candidates
- Repeat for up to 25 generations (~7,200 total evaluations, 10–15 minutes)
- The final winner is re-evaluated on a blind holdout set, reported with a 95% confidence interval
You get back the evolved prompt as plain text, along with the full lineage of how it got there.
What the numbers look like
Across our internal benchmark suite (11 tasks: classification, routing, moderation, extraction, triage), the average absolute accuracy lift is +19 points. Starting baselines average 71%; evolved prompts average 89%.
Some specific wins:
- Cold sales email quality classification: 60% → 87% (+26.7 pts)
- YouTube title CTR prediction: 53% → 87% (+34 pts, peak lift)
- Support ticket priority routing: 89% → 100% on holdout
- Brand safety screening: 100% maintained across 8 generations
The wins are not uniform. On tasks with already-strong baselines (hook strength detection at 97% hand-written), lift is modest (+2.5 pts). On tasks where hand-written prompts underspecify the problem (sales email, sentiment nuance, code review severity), lift is large.
Why this works — what we learned
The interesting finding is what the evolver discovers that hand-written prompts miss. Not vocabulary tweaks. Structural rules. Cascading conditionals. Label boundary definitions. Tie-breakers for ambiguous cases. Output format enforcement.
A hand-written prompt for email intent classification looks like:
"Classify this email as urgent, normal, or low priority."
An evolved version of the same prompt on the same data, gen 6 winner:
"Classify this email. URGENT if: sender in {VIP_LIST} OR contains deadline <24h OR subject matches /RE:|FWD:/ with thread age >3d. LOW if: mass-marketing patterns OR no specific ask OR auto-generated. Otherwise NORMAL. Output exactly one label."
Nobody wrote those rules. The evolutionary search found them by running the prompt against labeled data and keeping what worked. The evolved prompt reads like a senior ops person wrote it — because the selection pressure mimics exactly that: the rules that actually work on this data, in this domain, for this task.
This is the pattern behind every win. Hand-written prompts trust the model's prior. Evolved prompts teach the model what the right answer looks like in the customer's specific context.
The synthetic data wedge
The biggest reason LLM features die in production is that nobody has labeled data. Labeling 200 emails by hand to validate a prompt is work nobody wants to do, so teams skip it and ship by feel.
Cambrian Lab removes the step. Give us 5 labeled examples. Our synthesis pipeline generates 50 more that match the distribution, with diversity across edge cases. We run evolution on the full set of 55, and you validate on whatever small real holdout you can spare.
Performance with 5 real + 50 synthetic is meaningfully worse than with 200 real labels — call it 60–75% of the lift. It's still enough to win in production, and it eliminates the adoption blocker for teams that don't have a data science workflow.
What's different from [existing tools]
Honest differentiation:
vs. hand-writing prompts with Claude's help — same starting point, but we run a population-based search against your labeled data with a scoring harness, instead of a single-threaded "does this feel better?" loop.
vs. LangSmith / PromptLayer / Braintrust — those are observability and evaluation platforms. You run the tests, you make the changes, they surface the data. We run the search itself.
vs. DSPy / GEPA — research-grade frameworks that optimize prompts through bootstrapped few-shot or genetic-Pareto search. Different search strategy (we do LLM-driven crossover/mutation) and different workflow (we're a web upload, not a Python framework). If you're already fluent in DSPy, we're not going to unseat you today — we're for the teams who aren't.
vs. built-in provider prompt tools — the tools inside Claude Workbench and the OpenAI Playground help you iterate. They don't score against your data, don't report statistical confidence, and don't produce a model-portable output. We do all three.
What it costs
- First evolution: free. No credit card.
- Single runs after that: $25. One-off, no subscription.
- Unlimited: $99/month. Best for teams re-evolving quarterly as data drifts.
- Enterprise: custom — get in touch.
Your evolved prompt is yours. It's plain text. You can take it anywhere and run it on any model.
What we'd like from you
Try it. Bring a prompt you've been afraid to touch and run the first evolution free. If it works, tell us. If it breaks, tell us louder. Our public benchmark numbers are honest but internal — the next batch of case studies should be yours.
If you're a prompt engineer who thinks what we're doing is impossible or trivial, I'd really like to hear from you. Either way.