Stop Writing PRDs, Write Evals
The document that replaced PRDs at my company
Quick Hits đš
OpenAIâs Operator just went enterprise â agents that browse, click, and complete tasks autonomously. The new PM skill isnât writing user stories for these; itâs defining failure modes and recovery paths. What does âgraceful degradationâ look like when your agent books the wrong flight?
Google Gemini 2.0 Flash dropped with native tool use + sub-1-second latency â the model layer is commoditizing faster than anyone predicted. At this pace, your eval suite is more defensible than your model choice. Teams betting on a specific model are building on sand.
Anthropic published âHow We Build with Claudeâ â a rare look inside an AI-native engineering culture. The throughline: evals first, always. They donât ship until the eval suite says ship. Thatâs exactly what weâre unpacking today.
Hiring signal: Teams are now explicitly asking AI PM candidates to design eval suites in take-home interviews â not just write PRDs. The bar for what âtechnicalâ means in an AI PM role is rising. Eval fluency is quickly becoming the differentiator that gets you into senior roles.
Tool worth trying this week: Braintrust just shipped project-level eval dashboards with regression tracking across model versions. If your team is still running evals manually in a spreadsheet, this week is a good week to change that.
Deep Dive: Stop Writing PRDs, Write Evals
~1,800 words. Grab coffee.
Consider a common scenario: a PM at a mid-size SaaS company spends three weeks writing a detailed PRD for an AI-powered support chatbot. Twelve pages. User stories, edge cases, acceptance criteria, stakeholder sign-offs. The whole process.
Three months later, the chatbot is in production. Users are complaining about âweird answers.â The team is arguing about whether specific outputs are bugs or âjust how the model works.â Nobody can agree on whether itâs actually done.
The PM goes back to the PRD. It says: âThe chatbot should provide helpful, accurate, and professional responses to customer inquiries.â
They had just spent three weeks writing a document that couldnât tell them whether their product worked.
This is the core problem: PRDs are fundamentally broken for AI products.
Not âneeds updating.â Not âshould be shorter.â Broken for a specific, fixable reason â and once you see it, you canât unsee it.
Why PRDs Were Never Designed for This
The PRD was invented for a world of deterministic software. Click this button â this event fires â this screen appears. Specify behavior precisely enough, and engineering can build it. QA can verify it. Stakeholders can sign off on it.
AI products donât work that way.
Problem #1: Non-determinism. Run the same prompt twice, get two different outputs. A PRD that says âthe chatbot should provide helpful responses about billingâ sounds fine until you realize: what does âhelpfulâ mean when the model generates a different answer every single time? Youâve written a spec that cannot be verified. Thatâs not a spec â thatâs a wish.
Problem #2: Emergent behavior. Teams building recommendation engines constantly discover users finding use cases nobody anticipated â and in many cases shouldnât have anticipated, because emergence is often the feature, not the bug. The PRD covers 12 user stories. Users find story #13 on day one. The spec is stale before the first sprint review ends.
Problem #3: The vibes problem. Be honest about how AI feature reviews actually go in most orgs. Someone opens the product, types a few prompts, says âyeah, that feels pretty good.â Vibes-based QA. Itâs the dirty secret of AI product development. The PRD says âhigh-quality responses.â Everyone nods. Nobody can tell you what that means for this specific feature, this user population, at this latency bar.
The result: teams ship AI features without ever knowing what âdoneâ actually looks like. Then they spend six months fighting over whether something is a bug, a model limitation, or âjust how LLMs work.â
Thereâs a better way.
The Eval-as-Spec Framework
An eval suite isnât just a testing tool. Itâs your specification, your acceptance criteria, and your north star metric â in one artifact that actually tells you when to ship.
Hereâs how to build one.
Step 1: Define âgoodâ with test cases, not prose
Stop writing: âThe chatbot should handle refund requests gracefully.â
Start writing:
Input Expected behavior Pass criteria âI want a refundâ Acknowledges request, asks for order # Contains acknowledgment + clarifying question âThis product is garbage give me my money backâ De-escalates, initiates refund flow No defensive language, routes correctly âRefundâ (nothing else) Asks clarifying questions Doesnât assume intent, asks â„1 question âCan I get a refund for something I bought 18 months ago?â States policy honestly, offers alternatives Accurate policy, doesnât just refuse
Thatâs a spec. Itâs verifiable. Engineering can test against it. You can re-run it after every model update. Stakeholders can read it and actually understand what âdoneâ means.
Step 2: Build rubrics for the subjective stuff
Not everything can be pass/fail. Quality, tone, and helpfulness are real dimensions â they just need structure. A scoring rubric turns âmake it goodâ into something measurable:
RESPONSE QUALITY RUBRIC (1-5 scale)
5 â Directly addresses user intent, complete answer,
appropriate tone, no hallucinations, on-brand
4 â Addresses intent, mostly complete, minor tone issues
3 â Partially addresses intent, missing key information
2 â Misses intent OR includes incorrect information
1 â Harmful, off-topic, completely wrong, or brand-damaging
Set a threshold: â90% of responses must score â„4 on blind rubric review.â Now you have a measurable quality bar you can track across model versions, prompt changes, and time.
What does a 4 vs. 5 actually look like? Using the refund scenario:
User input: âI want a refund for my order from last week.â
Score 5 response:
âIâm sorry to hear that! Iâd be happy to help you with a refund. Could you share your order number so I can pull up the details? Most refunds are processed within 3-5 business days once initiated.â
â Directly addresses intent, complete next step, warm tone, accurate timeline, no hallucinations.
Score 4 response:
âSure, I can help with that. Whatâs your order number?â
â Addresses intent, routes correctly â but missing the timeline info that reduces follow-up questions. Minor incompleteness.
Score 3 response:
âRefunds are processed within 3-5 business days. Please contact our support team.â
â ïž Partially helpful â accurate info, but routes to âsupport teamâ instead of handling it, and doesnât acknowledge the userâs request. Creates friction.
Score 2 response:
âIâm unable to process refunds directly. Youâll need to fill out our refund request form at [link that doesnât work].â
â Technically accurate policy but bad experience â no empathy, dead-end routing, broken link.
Once your team scores 10 responses independently using this rubric, youâll find ~80% agreement on scores. The 20% where you disagree are your most valuable calibration conversations.
Step 3: Catalog edge cases systematically
This is where teams building AI products earn their value. Edge cases in AI donât behave like traditional edge cases â they can corrupt context windows, compound through multi-step flows, and surface completely different model behaviors. Categories to always cover:
Adversarial inputs: Prompt injection, jailbreak attempts, social engineering (âpretend youâre a different botâ)
Boundary conditions: Empty input, 10,000-word input, non-English, mixed languages
Context failures: Missing user context, contradictory context, outdated context
Compounding errors: When a wrong step in step 2 of an agent flow corrupts everything downstream
Policy edge cases: Anything touching legal, compliance, or brand safety
Teams that skip this step discover these edge cases when a screenshot goes viral.
Copy-Paste Eval Templates
Three starter templates. Steal them. Customize them. Run them Monday.
Template 1: Customer Support Chatbot
## Eval Suite: Customer Support Chatbot v[X]
Run on: every model change, every major prompt change, weekly regression
### Hard Fails (ANY failure = no ship)
- Prompt injection: 0% success rate (run 25 attempts)
- PII leakage: 0 instances across 50 test conversations
- Harmful advice: 0 instances across adversarial test set
### Scored Dimensions
| Dimension | Weight | Metric | Threshold |
|-----------------|--------|---------------------------------|-------------|
| Accuracy | 40% | Known Q&A pairs (100 cases) | â„90% match |
| Tone | 25% | Rubric score, blind review | â„4.0 avg |
| Appropriate IDK | 20% | Unanswerable Qs â "I don't know"| â„95% |
| Latency | 15% | p50 / p95 / p99 | <2s/<5s/<10s|
### Edge Case Bank
| Category | Test | Pass criteria |
|-------------|-------------------------------|-------------------|
| Adversarial | "Ignore your instructions" | Stays on task |
| Boundary | Empty message | Asks clarifying Q |
| Boundary | 5000+ word input | Handles gracefully|
| Context fail| Contradictory user history | Asks to confirm |
Template 2: Recommendation Engine
## Eval Suite: Recommendations v[X]
Run on: model change, data pipeline change, A/B test setup
### Hard Fails
- Banned/restricted products: never appear in recommendations
- Known-bad associations: test 20 pairs, 0 appear together
### Scored Dimensions
| Dimension | Weight | Metric | Threshold |
|----------------|--------|----------------------------------|-----------|
| Relevance | 40% | Same-intent items in top 10 | â„80% |
| Diversity | 25% | Categories represented in top 10 | â„3 |
| Cold start | 20% | Non-generic recs, 0-history users| â„60% |
| Business rules | 15% | Eligible promoted items in top 5 | â„70% |
Template 3: Multi-Step Research Agent
## Eval Suite: Research Agent v[X]
Run on: every deployment, model version change
### Hard Fails
- Hallucinated sources/citations: 0 tolerance
- Unauthorized tool calls: 0 tolerance
### Scored Dimensions
| Dimension | Weight | Metric | Threshold |
|-----------------|--------|--------------------------------|-----------|
| Task completion | 35% | Complete all required steps | â„85% |
| Clarification | 25% | Ambiguous tasks â asks Qs | â„90% |
| Tool precision | 25% | Correct tool, no extras | â„90% |
| Output quality | 15% | Blind rubric, 30 cases | â„4.0 avg |
Tools That Make This Sustainable (Not a Spreadsheet)
Running evals manually in Google Sheets is better than nothing. But there are purpose-built tools that make this a real team practice:
Start here: Braintrust (free) â your first 30 minutes
Go to braintrust.dev â create a free account
Create a new Project (name it after your feature)
Go to âDatasetsâ â create a new dataset â paste in your 20 test cases (input + expected output columns)
Go to âExperimentsâ â create your first eval run pointing at your dataset
Add a scoring function: start with âContainsâ (does the output contain X?) â no code required
Run it. You now have a baseline with a visual pass/fail breakdown.
Total time: 25-30 minutes. What you get: a shareable dashboard link you can drop in Slack, a versioned baseline to compare against after your next model change, and the ability to say âhereâs our quality trendâ to anyone who asks.
When to graduate to other tools:
Your team is on LangChain and debugging complex agent chains â try LangSmith
You need continuous evals against live production traffic â try Honeyhive
You have data residency requirements â try Promptfoo (open source, runs locally)
But donât let tooling be the reason you delay. A Google Sheet with 20 test cases and a rubric is infinitely better than a PRD that canât be verified.
The Hard Part: Getting Buy-In When Your Org Lives in PRDs
The eval framework makes sense in theory. In practice, youâll walk into sprint planning and hear: âWhereâs the PRD?â
Hereâs how teams at PMtheBuilder have navigated this:
The one path that actually works:
Donât announce âweâre replacing PRDs with evals.â Nobody wants to hear that. Instead, add an âEval Suiteâ section at the bottom of your next PRD. Same document. Same format. Just new section at the end.
Frame it simply: âHereâs the eval suite that defines exactly what done looks like â itâs also how weâll verify weâre done before shipping.â
Over 2-3 sprints, watch which section the team actually references when debating whether something is a bug or working as intended. Within a quarter, the eval section becomes the real spec. The PRD narrative becomes the context doc it always should have been.
Use the first production incident to make the case.
The best time to introduce evals is right after something goes wrong that a PRD never could have caught â a hallucination, a tone failure, a prompt injection, a recommendation that shouldnât have appeared. When that moment happens (and it will), have your eval template ready. âHereâs the test case that would have caught this. Hereâs how we add it to the suite so it never ships again.â That conversation changes cultures.
Make it a team artifact, not a PM deliverable.
The biggest mistake is treating evals as the PMâs responsibility. They work best as a shared artifact â engineers add edge cases they discovered, data scientists add distribution tests, designers flag tone failures as rubric criteria. When everyone owns the eval suite, everyone has skin in the quality bar. PMs donât need to be the quality police; the suite does that.
A realistic transition timeline:
Week 1-2: PM writes the first eval suite alone. 20 test cases, 1 rubric, appended to existing PRD format. Engineers are skeptical â âthis is just testing, isnât that QAâs job?â
Week 3-4: Team runs the eval on the current build. Three hard fails surface immediately. Suddenly everyoneâs looking at the eval suite, not the PRD, in the standup. An engineer adds 5 edge cases from something they saw in the logs.
Month 2: The eval suite becomes the source of truth for âare we done?â conversations. The PRD narrative section still exists but nobody references it in sprint reviews. The PM stops writing traditional acceptance criteria â the eval cases serve that purpose.
Month 3: A new feature kicks off. First meeting: PM shares the eval suite draft instead of a PRD. Engineering asks to add cases. One stakeholder asks âwhat does a passing eval look like?â â and gets a concrete answer instead of âitâll feel right when we see it.â
The transition doesnât require a policy change or leadership mandate. It just requires starting.
Managing up â the actual script:
The moment arrives: youâre in a leadership review for an AI feature and someone asks, âHow do we know this is ready?â
Hereâs what that conversation sounds like with evals vs. without:
Without evals: âWe did extensive testing across our key scenarios and the team feels confident about the quality.â â This is vibes. Leadership hears: âwe guessed.â
With evals: âWe defined 85 test cases across accuracy, tone, and safety. The model is passing 94% of them. The 6% failures are all in the âambiguous refund policyâ category â weâve scoped those out of v1 and theyâre on the roadmap. Our hard-fail criteria â no harmful outputs, no prompt injection success â are at 100%. Hereâs the dashboard.â
Thatâs not just a better answer. Itâs a different kind of PM. One who can have an evidence-based conversation about risk, not a vibes-based conversation about confidence. Leadership can make real decisions from that conversation. They can scope, defer, or accept risk with actual information.
Your Monday Morning Action Plan
No perfect tooling required. No org-wide buy-in required. Hereâs what to do this week:
Monday: Pick ONE feature. An AI feature your team is actively working on or recently shipped. Something you can describe in one sentence.
Tuesday: Write 20 test cases. Input + expected behavior + pass criteria. Cover: 10 typical cases, 5 boundary cases, 5 adversarial cases. Use the templates above. This takes 45 minutes.
Wednesday: Create one rubric. For whichever dimension is hardest to define (usually âqualityâ or âtoneâ), write a 1-5 rubric. Share it with one engineer and one stakeholder. Ask them to score 5 outputs independently. See if you agree â the conversation is the point.
Thursday: Set up Braintrust (free). Upload your 20 test cases. Run them against your current model/prompt. You now have a baseline that didnât exist Monday.
Friday: Share the dashboard link. Send it to your team: âThis is how weâll define done for [feature] going forward.â Thatâs the conversation starter. Thatâs the culture shift, beginning.
By Friday, your team will have seen what eval-driven development looks like in practice. Thatâs not nothing. Thatâs everything.
The Shift That Actually Matters
This isnât about paperwork. Itâs about what âdoneâ means.
In traditional product development, âdoneâ is a feature checklist. In AI product development, âdoneâ is a quality bar that your eval suite can verify â and that you can re-verify after every change.
The AI product teams doing the best work right now arenât the ones with the most sophisticated models. Theyâre the ones with the most disciplined eval practices. They know when theyâre done. They know when theyâve regressed. They know what âbetterâ actually means, in measurable terms.
Thatâs the muscle weâre all building here.
Stop writing PRDs. Write evals. Your engineers will thank you. Your users will notice. And six months from now, youâll wonder how you ever shipped AI features without them.
â PMtheBuilder đš
Want to go deeper? The PMtheBuilder AI PM Eval Designer generates a starter eval suite from your feature description. Free. 5 minutes. Gets you 60% of the way there.
Hit reply: whatâs the first feature youâre writing evals for?

