Where Prompts Belong in an Evaluated System

Learn why prompt changes require rigor, versioning, and tight integration with evaluation to prevent regressions.

We'll cover the following...

Why do prompts become a bottleneck once evaluation starts working?
- What happens when multiple people modify prompts?
Should prompts live close to the code or be accessible to non-technical stakeholders?
- How does storing prompts in Git help?
- What goes wrong when prompts live in external tools instead?
Why do prompt management tools struggle with real systems?
- Are prompt tools useless?
What’s next?

This is where many teams lose rigor, and prompts drift away from the rest of the system. They reside in dashboards, admin panels, notebooks, or shared documents, while evaluation findings are stored elsewhere. When a failure appears in a trace, it becomes hard to answer basic questions: which prompt version caused this, who changed it, and whether the fix you discussed actually shipped. This lesson focuses on that breakdown and on how to treat prompts as part of the evaluable system, not as isolated text.

Why do prompts become a bottleneck once evaluation starts working?

Once you start reviewing traces regularly and running evaluations against real failures, prompt changes accelerate. A single failure may trigger several iterations, such as tightening an instruction, adding a constraint, clarifying refusal behavior, or restructuring the context. At this stage, many teams discover that they can no longer answer basic questions, such as which prompt produced a given trace, when it changed, or why a ...

Ask

Foundations of AI Evaluation

Building the Evaluation Workflow

Scaling Evaluation Beyond the Basics

Evaluating Real Systems in Production

Wrap Up

Where Prompts Belong in an Evaluated System

Why do prompts become a bottleneck once evaluation starts working?