AI Features

Evaluation as a Core Part of Development

Explore continuous, investigation-driven evaluation that improves assistants through real trace analysis and targeted fixes.

Evaluation is not a separate phase introduced after an assistant is built. It is the ongoing work of understanding how the system behaves, why it makes its decisions, and which fixes improve outcomes. Inspecting a trace, reviewing a conversation, or verifying that a fix holds up in production all constitute evaluation. In practice, this is the development process for LLM-based products.

As teams scale their assistants, they often expect most effort to be invested in prompt design or model selection. In practice, the opposite happens. Most progress comes from diagnosing failures, refining workflows, and validating that changes behave consistently in the wild. When you treat evaluation as core engineering rather than overhead, the product becomes easier to reason about and far more predictable.

How much development effort should evaluation take?

Rather than treating evaluation as a separate budget line, approach it the way you approach debugging or QA: as something that happens continuously while you build. A significant share of development time naturally goes into understanding why the assistant behaved a certain way. You examine misrouted intents, misread tool responses, incorrect intermediate steps, and unclear user messages. This investigative work is not optional. It is how the product advances.

Teams that build robust assistants consistently find that most of their time ends up in this investigative layer. In many projects, more than half of the engineering effort is spent on understanding real traces rather than rewriting prompts. The reason is ...

Ask