This post walks through one actual evolution run from our benchmark suite. The task: classify cold sales emails as HIGH, MEDIUM, LOW, or NONE based on quality signals. The starting prompt is a reasonable hand-written one. The evolved prompt is the gen-7 winner after 8 generations of selection pressure. Fitness numbers and both prompts are reproduced verbatim from the logs.

The task

Given the full text of a cold sales email, output a single label: HIGH, MEDIUM, LOW, or NONE. Labels were hand-assigned by the dataset author based on personalization depth, value proposition clarity, social proof quality, and CTA specificity.

Dataset: 30 labeled examples (20 for evolution, 10 for the blind holdout).

Configuration: population size 10, 8 generations, 4 elites per generation, samples per gen 15.

The starting prompt (gen 0 winner — fitness 0.600)

You are a personalization detector. Generic salutations ('Hey there') and fill-in-the-blank token personalization drop to LOW. Deep personalization that reflects reading the recipient's actual work stays HIGH.

This is the kind of prompt most teams actually ship. It's two sentences, it captures the author's intuition, it doesn't waste tokens. It gets 60% accuracy on the eval set. In production, with nobody measuring, a team might run this prompt for months and never know they were leaving 27 percentage points on the floor.

What happened during evolution

The fitness trajectory of the best candidate per generation:

Gen 0: 0.600
Gen 1: 0.667
Gen 2: 0.633
Gen 3: 0.667
Gen 4: 0.667
Gen 5: 0.700
Gen 6: 0.600
Gen 7: 0.867

Two things worth noting. First, the win was not monotonic — gen 6's best candidate was actually worse than gen 4's. This is normal in evolutionary search; diversity-preserving mechanics trade current fitness for future exploration. Second, the big jump happened at the last generation. The crossover operator found something in gen 6's losers that, combined with gen 5's elites, produced a 17-point leap.

The evolved prompt (gen 7 winner — fitness 0.867)

You are a cold email classifier. Prioritize personalization depth: does the email reference the recipient's actual work, role, or context? Generic openers lower the score. Evaluate social proof quality — named, verifiable references raise it; vague claims lower it. Assess value proposition clarity, credibility, and call-to-action specificity. Generic mass-blast emails get LOW/NONE; targeted, relevant emails with clear value earn HIGH/MEDIUM. Output exactly one label: HIGH, MEDIUM, LOW, or NONE.

What actually changed

Side by side, the diff is instructive. The original prompt has one criterion (personalization). The evolved prompt has four (personalization, social proof, value prop, CTA specificity). It adds explicit label mapping ("generic mass-blast emails get LOW/NONE"). And it enforces output format ("output exactly one label").

The wins, in rank order of likely impact:

Multi-criteria rubric. The model stops relying on its prior about "what personalized means" and starts evaluating four orthogonal signals.
Explicit label boundaries. The prompt now tells the model what maps to which label instead of trusting the model to guess the author's calibration.
Output format enforcement. "Output exactly one label" cuts parsing errors to zero on the eval set.
Social proof specificity. "Named, verifiable references" vs. "vague claims" is the kind of boundary a senior reviewer adds after seeing failures. The evolver found it without being told it mattered.

The wording itself — "personalization depth," "value proposition clarity" — is not particularly clever. The win is almost entirely structural.

What this tells us about prompt engineering

Hand-written prompts almost always underspecify. The human writes the obvious criteria and trusts the model to infer the rest. That works for simple tasks. On anything with real edge cases — labeling boundaries, cross-category ambiguity, output format — inference breaks down, and the model regresses toward the most plausible interpretation of the instructions, which is usually wrong.

Evolutionary pressure surfaces the underspecification fast. The fitness function penalizes every misclassification. Candidates that spell out more of the decision boundary score higher, so the population evolves toward explicit criteria. Four generations in, the surviving prompts look nothing like what a human would write — they're longer, more structured, more rule-based. They also win.

What we'd have needed to replicate this by hand

To get from 60% to 87% without Cambrian Lab, a practitioner would need to:

Build a labeled dataset with consistent rubric (≥20 examples, we had 30)
Write 8-15 prompt variants spanning different criterion sets
Run each against the dataset, track per-criterion accuracy
Identify which criterion combinations win
Merge the winning structures into a final prompt
Re-test against a blind holdout to confirm the lift is real
Compute statistical significance to know whether the difference is noise

Rough estimate: 4-8 hours of focused work by a senior engineer, plus the cost of running several hundred LLM calls manually. Cambrian Lab did it in 10 minutes of wall-clock time and $2 of inference.

Caveats

One run, one task, one data size. We'd expect the absolute lift to vary by task. On tasks with strong hand-written baselines, expect smaller absolute gains. On tasks with known underspecification (which is most real-world classification), expect gains in this range.

The evolved prompt is also not infinitely transferable. If your cold email distribution shifts substantially (e.g., a new industry, new language, new product context), we'd recommend re-evolving on data from the new distribution. That's a feature, not a bug — the evolved prompt is specifically good at your data, not generically good at all data.

Try it on your task

This is what every case study we write will look like. Real before, real after, same model, same data, honest numbers. If you have a prompt in production you've been scared to touch — cold email classification, support ticket routing, content moderation, extraction — the first evolution is free.

Evolve your first prompt →

Case study — evolving a cold email classifier from 60% to 87% accuracy