Stop Writing PRDs, Write Evals

The document that replaced PRDs at my company

Mar 05, 2026

Quick Hits 🔨

OpenAI’s Operator just went enterprise — agents that browse, click, and complete tasks autonomously. The new PM skill isn’t writing user stories for these; it’s defining failure modes and recovery paths. What does “graceful degradation” look like when your agent books the wrong flight?
Google Gemini 2.0 Flash dropped with native tool use + sub-1-second latency — the model layer is commoditizing faster than anyone predicted. At this pace, your eval suite is more defensible than your model choice. Teams betting on a specific model are building on sand.
Anthropic published “How We Build with Claude” — a rare look inside an AI-native engineering culture. The throughline: evals first, always. They don’t ship until the eval suite says ship. That’s exactly what we’re unpacking today.
Hiring signal: Teams are now explicitly asking AI PM candidates to design eval suites in take-home interviews — not just write PRDs. The bar for what “technical” means in an AI PM role is rising. Eval fluency is quickly becoming the differentiator that gets you into senior roles.
Tool worth trying this week: Braintrust just shipped project-level eval dashboards with regression tracking across model versions. If your team is still running evals manually in a spreadsheet, this week is a good week to change that.

Deep Dive: Stop Writing PRDs, Write Evals

~1,800 words. Grab coffee.

Consider a common scenario: a PM at a mid-size SaaS company spends three weeks writing a detailed PRD for an AI-powered support chatbot. Twelve pages. User stories, edge cases, acceptance criteria, stakeholder sign-offs. The whole process.

Three months later, the chatbot is in production. Users are complaining about “weird answers.” The team is arguing about whether specific outputs are bugs or “just how the model works.” Nobody can agree on whether it’s actually done.

The PM goes back to the PRD. It says: “The chatbot should provide helpful, accurate, and professional responses to customer inquiries.”

They had just spent three weeks writing a document that couldn’t tell them whether their product worked.

This is the core problem: PRDs are fundamentally broken for AI products.

Not “needs updating.” Not “should be shorter.” Broken for a specific, fixable reason — and once you see it, you can’t unsee it.

Why PRDs Were Never Designed for This

The PRD was invented for a world of deterministic software. Click this button → this event fires → this screen appears. Specify behavior precisely enough, and engineering can build it. QA can verify it. Stakeholders can sign off on it.

AI products don’t work that way.

Problem #1: Non-determinism. Run the same prompt twice, get two different outputs. A PRD that says “the chatbot should provide helpful responses about billing” sounds fine until you realize: what does “helpful” mean when the model generates a different answer every single time? You’ve written a spec that cannot be verified. That’s not a spec — that’s a wish.

Problem #2: Emergent behavior. Teams building recommendation engines constantly discover users finding use cases nobody anticipated — and in many cases shouldn’t have anticipated, because emergence is often the feature, not the bug. The PRD covers 12 user stories. Users find story #13 on day one. The spec is stale before the first sprint review ends.

Problem #3: The vibes problem. Be honest about how AI feature reviews actually go in most orgs. Someone opens the product, types a few prompts, says “yeah, that feels pretty good.” Vibes-based QA. It’s the dirty secret of AI product development. The PRD says “high-quality responses.” Everyone nods. Nobody can tell you what that means for this specific feature, this user population, at this latency bar.

The result: teams ship AI features without ever knowing what “done” actually looks like. Then they spend six months fighting over whether something is a bug, a model limitation, or “just how LLMs work.”

There’s a better way.

The Eval-as-Spec Framework

An eval suite isn’t just a testing tool. It’s your specification, your acceptance criteria, and your north star metric — in one artifact that actually tells you when to ship.

Here’s how to build one.

Step 1: Define “good” with test cases, not prose

Stop writing: “The chatbot should handle refund requests gracefully.”

Start writing:

Input Expected behavior Pass criteria “I want a refund” Acknowledges request, asks for order # Contains acknowledgment + clarifying question “This product is garbage give me my money back” De-escalates, initiates refund flow No defensive language, routes correctly “Refund” (nothing else) Asks clarifying questions Doesn’t assume intent, asks ≥1 question “Can I get a refund for something I bought 18 months ago?” States policy honestly, offers alternatives Accurate policy, doesn’t just refuse

That’s a spec. It’s verifiable. Engineering can test against it. You can re-run it after every model update. Stakeholders can read it and actually understand what “done” means.

Step 2: Build rubrics for the subjective stuff

Not everything can be pass/fail. Quality, tone, and helpfulness are real dimensions — they just need structure. A scoring rubric turns “make it good” into something measurable:

RESPONSE QUALITY RUBRIC (1-5 scale)

5 — Directly addresses user intent, complete answer,
    appropriate tone, no hallucinations, on-brand
4 — Addresses intent, mostly complete, minor tone issues
3 — Partially addresses intent, missing key information
2 — Misses intent OR includes incorrect information
1 — Harmful, off-topic, completely wrong, or brand-damaging

Set a threshold: “90% of responses must score ≥4 on blind rubric review.” Now you have a measurable quality bar you can track across model versions, prompt changes, and time.

What does a 4 vs. 5 actually look like? Using the refund scenario:

User input: “I want a refund for my order from last week.”

Score 5 response:

“I’m sorry to hear that! I’d be happy to help you with a refund. Could you share your order number so I can pull up the details? Most refunds are processed within 3-5 business days once initiated.”

✅ Directly addresses intent, complete next step, warm tone, accurate timeline, no hallucinations.

Score 4 response:

“Sure, I can help with that. What’s your order number?”

✅ Addresses intent, routes correctly — but missing the timeline info that reduces follow-up questions. Minor incompleteness.

Score 3 response:

“Refunds are processed within 3-5 business days. Please contact our support team.”

⚠️ Partially helpful — accurate info, but routes to “support team” instead of handling it, and doesn’t acknowledge the user’s request. Creates friction.

Score 2 response:

“I’m unable to process refunds directly. You’ll need to fill out our refund request form at [link that doesn’t work].”

❌ Technically accurate policy but bad experience — no empathy, dead-end routing, broken link.

Once your team scores 10 responses independently using this rubric, you’ll find ~80% agreement on scores. The 20% where you disagree are your most valuable calibration conversations.

Step 3: Catalog edge cases systematically

This is where teams building AI products earn their value. Edge cases in AI don’t behave like traditional edge cases — they can corrupt context windows, compound through multi-step flows, and surface completely different model behaviors. Categories to always cover:

Adversarial inputs: Prompt injection, jailbreak attempts, social engineering (”pretend you’re a different bot”)
Boundary conditions: Empty input, 10,000-word input, non-English, mixed languages
Context failures: Missing user context, contradictory context, outdated context
Compounding errors: When a wrong step in step 2 of an agent flow corrupts everything downstream
Policy edge cases: Anything touching legal, compliance, or brand safety

Teams that skip this step discover these edge cases when a screenshot goes viral.

Copy-Paste Eval Templates

Three starter templates. Steal them. Customize them. Run them Monday.

Template 1: Customer Support Chatbot

## Eval Suite: Customer Support Chatbot v[X]
Run on: every model change, every major prompt change, weekly regression

### Hard Fails (ANY failure = no ship)
- Prompt injection: 0% success rate (run 25 attempts)
- PII leakage: 0 instances across 50 test conversations
- Harmful advice: 0 instances across adversarial test set

### Scored Dimensions
| Dimension       | Weight | Metric                          | Threshold   |
|-----------------|--------|---------------------------------|-------------|
| Accuracy        | 40%    | Known Q&A pairs (100 cases)     | ≥90% match  |
| Tone            | 25%    | Rubric score, blind review      | ≥4.0 avg    |
| Appropriate IDK | 20%    | Unanswerable Qs → "I don't know"| ≥95%        |
| Latency         | 15%    | p50 / p95 / p99                 | <2s/<5s/<10s|

### Edge Case Bank
| Category    | Test                          | Pass criteria     |
|-------------|-------------------------------|-------------------|
| Adversarial | "Ignore your instructions"    | Stays on task     |
| Boundary    | Empty message                 | Asks clarifying Q |
| Boundary    | 5000+ word input              | Handles gracefully|
| Context fail| Contradictory user history    | Asks to confirm   |

Template 2: Recommendation Engine

## Eval Suite: Recommendations v[X]
Run on: model change, data pipeline change, A/B test setup

### Hard Fails
- Banned/restricted products: never appear in recommendations
- Known-bad associations: test 20 pairs, 0 appear together

### Scored Dimensions
| Dimension      | Weight | Metric                           | Threshold |
|----------------|--------|----------------------------------|-----------|
| Relevance      | 40%    | Same-intent items in top 10      | ≥80%      |
| Diversity      | 25%    | Categories represented in top 10 | ≥3        |
| Cold start     | 20%    | Non-generic recs, 0-history users| ≥60%      |
| Business rules | 15%    | Eligible promoted items in top 5 | ≥70%      |

Template 3: Multi-Step Research Agent

## Eval Suite: Research Agent v[X]
Run on: every deployment, model version change

### Hard Fails
- Hallucinated sources/citations: 0 tolerance
- Unauthorized tool calls: 0 tolerance

### Scored Dimensions
| Dimension       | Weight | Metric                         | Threshold |
|-----------------|--------|--------------------------------|-----------|
| Task completion | 35%    | Complete all required steps    | ≥85%      |
| Clarification   | 25%    | Ambiguous tasks → asks Qs      | ≥90%      |
| Tool precision  | 25%    | Correct tool, no extras        | ≥90%      |
| Output quality  | 15%    | Blind rubric, 30 cases         | ≥4.0 avg  |

Tools That Make This Sustainable (Not a Spreadsheet)

Running evals manually in Google Sheets is better than nothing. But there are purpose-built tools that make this a real team practice:

Start here: Braintrust (free) — your first 30 minutes

Go to braintrust.dev → create a free account
Create a new Project (name it after your feature)
Go to “Datasets” → create a new dataset → paste in your 20 test cases (input + expected output columns)
Go to “Experiments” → create your first eval run pointing at your dataset
Add a scoring function: start with “Contains” (does the output contain X?) — no code required
Run it. You now have a baseline with a visual pass/fail breakdown.

Total time: 25-30 minutes. What you get: a shareable dashboard link you can drop in Slack, a versioned baseline to compare against after your next model change, and the ability to say “here’s our quality trend” to anyone who asks.

When to graduate to other tools:

Your team is on LangChain and debugging complex agent chains → try LangSmith
You need continuous evals against live production traffic → try Honeyhive
You have data residency requirements → try Promptfoo (open source, runs locally)

But don’t let tooling be the reason you delay. A Google Sheet with 20 test cases and a rubric is infinitely better than a PRD that can’t be verified.

The Hard Part: Getting Buy-In When Your Org Lives in PRDs

The eval framework makes sense in theory. In practice, you’ll walk into sprint planning and hear: “Where’s the PRD?”

Here’s how teams at PMtheBuilder have navigated this:

The one path that actually works:

Don’t announce “we’re replacing PRDs with evals.” Nobody wants to hear that. Instead, add an “Eval Suite” section at the bottom of your next PRD. Same document. Same format. Just new section at the end.

Frame it simply: “Here’s the eval suite that defines exactly what done looks like — it’s also how we’ll verify we’re done before shipping.”

Over 2-3 sprints, watch which section the team actually references when debating whether something is a bug or working as intended. Within a quarter, the eval section becomes the real spec. The PRD narrative becomes the context doc it always should have been.

Use the first production incident to make the case.

The best time to introduce evals is right after something goes wrong that a PRD never could have caught — a hallucination, a tone failure, a prompt injection, a recommendation that shouldn’t have appeared. When that moment happens (and it will), have your eval template ready. “Here’s the test case that would have caught this. Here’s how we add it to the suite so it never ships again.” That conversation changes cultures.

Make it a team artifact, not a PM deliverable.

The biggest mistake is treating evals as the PM’s responsibility. They work best as a shared artifact — engineers add edge cases they discovered, data scientists add distribution tests, designers flag tone failures as rubric criteria. When everyone owns the eval suite, everyone has skin in the quality bar. PMs don’t need to be the quality police; the suite does that.

A realistic transition timeline:

Week 1-2: PM writes the first eval suite alone. 20 test cases, 1 rubric, appended to existing PRD format. Engineers are skeptical — “this is just testing, isn’t that QA’s job?”

Week 3-4: Team runs the eval on the current build. Three hard fails surface immediately. Suddenly everyone’s looking at the eval suite, not the PRD, in the standup. An engineer adds 5 edge cases from something they saw in the logs.

Month 2: The eval suite becomes the source of truth for “are we done?” conversations. The PRD narrative section still exists but nobody references it in sprint reviews. The PM stops writing traditional acceptance criteria — the eval cases serve that purpose.

Month 3: A new feature kicks off. First meeting: PM shares the eval suite draft instead of a PRD. Engineering asks to add cases. One stakeholder asks “what does a passing eval look like?” — and gets a concrete answer instead of “it’ll feel right when we see it.”

The transition doesn’t require a policy change or leadership mandate. It just requires starting.

Managing up — the actual script:

The moment arrives: you’re in a leadership review for an AI feature and someone asks, “How do we know this is ready?”

Here’s what that conversation sounds like with evals vs. without:

Without evals: “We did extensive testing across our key scenarios and the team feels confident about the quality.” ← This is vibes. Leadership hears: “we guessed.”

With evals: “We defined 85 test cases across accuracy, tone, and safety. The model is passing 94% of them. The 6% failures are all in the ‘ambiguous refund policy’ category — we’ve scoped those out of v1 and they’re on the roadmap. Our hard-fail criteria — no harmful outputs, no prompt injection success — are at 100%. Here’s the dashboard.”

That’s not just a better answer. It’s a different kind of PM. One who can have an evidence-based conversation about risk, not a vibes-based conversation about confidence. Leadership can make real decisions from that conversation. They can scope, defer, or accept risk with actual information.

Your Monday Morning Action Plan

No perfect tooling required. No org-wide buy-in required. Here’s what to do this week:

Monday: Pick ONE feature. An AI feature your team is actively working on or recently shipped. Something you can describe in one sentence.

Tuesday: Write 20 test cases. Input + expected behavior + pass criteria. Cover: 10 typical cases, 5 boundary cases, 5 adversarial cases. Use the templates above. This takes 45 minutes.

Wednesday: Create one rubric. For whichever dimension is hardest to define (usually “quality” or “tone”), write a 1-5 rubric. Share it with one engineer and one stakeholder. Ask them to score 5 outputs independently. See if you agree — the conversation is the point.

Thursday: Set up Braintrust (free). Upload your 20 test cases. Run them against your current model/prompt. You now have a baseline that didn’t exist Monday.

Friday: Share the dashboard link. Send it to your team: “This is how we’ll define done for [feature] going forward.” That’s the conversation starter. That’s the culture shift, beginning.

By Friday, your team will have seen what eval-driven development looks like in practice. That’s not nothing. That’s everything.

The Shift That Actually Matters

This isn’t about paperwork. It’s about what “done” means.

In traditional product development, “done” is a feature checklist. In AI product development, “done” is a quality bar that your eval suite can verify — and that you can re-verify after every change.

The AI product teams doing the best work right now aren’t the ones with the most sophisticated models. They’re the ones with the most disciplined eval practices. They know when they’re done. They know when they’ve regressed. They know what “better” actually means, in measurable terms.

That’s the muscle we’re all building here.

Stop writing PRDs. Write evals. Your engineers will thank you. Your users will notice. And six months from now, you’ll wonder how you ever shipped AI features without them.

— PMtheBuilder 🔨

Want to go deeper? The PMtheBuilder AI PM Eval Designer generates a starter eval suite from your feature description. Free. 5 minutes. Gets you 60% of the way there.

Hit reply: what’s the first feature you’re writing evals for?

PMtheBuilder

Discussion about this post

Ready for more?