Evaluation as a Core Part of Development

Explore continuous, investigation-driven evaluation that improves assistants through real trace analysis and targeted fixes.

We'll cover the following...

How much development effort should evaluation take?
- When does automation actually pay off?
How do you persuade your team to invest in evaluation work?
When should you automate evaluators, and when should you rely on manual review?
- When should you rely on LLM-based judges?
What’s next?

Evaluation is not a separate phase introduced after an assistant is built. It is the ongoing work of understanding how the system behaves, why it makes its decisions, and which fixes improve outcomes. Inspecting a trace, reviewing a conversation, or verifying that a fix holds up in production all constitute evaluation. In practice, this is the development process for LLM-based products.

As teams scale their assistants, they often expect most effort to be invested in prompt design or model selection. In practice, the opposite happens. Most progress comes from diagnosing failures, refining workflows, and validating that changes behave consistently in the wild. When you treat evaluation as core engineering rather than overhead, the product becomes easier to reason about and far more predictable.

How much development effort should evaluation take?

Rather than treating evaluation as a separate budget line, approach it the way you approach debugging or QA: as something that happens continuously while you build. A significant share of development time naturally goes into understanding why the assistant behaved a certain way. You examine misrouted intents, misread tool responses, incorrect intermediate steps, and unclear user messages. This investigative work is not optional. It is how the product advances.

Teams that build robust assistants consistently find that most of their time ends up in this investigative layer. In many projects, more than half of the engineering effort is spent on understanding real traces rather than rewriting prompts. The reason is ...

Ask

Foundations of AI Evaluation

Building the Evaluation Workflow

Scaling Evaluation Beyond the Basics

Evaluating Real Systems in Production

Wrap Up

Evaluation as a Core Part of Development

How much development effort should evaluation take?