> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/4.-scoring-and-error-analysis.md).

# Scoring & error analysis

There are two different forms of evaluation in AI development:

* **Offline Evaluation**: Also referred to as lab testing, this phase of evaluation happens during development, before your AI product or solution reaches users. You are testing your pipeline against a fixed "Golden Dataset" to see if it meets your target performance. This is a controlled environment used to measure baseline performance and identify and analyze errors in your AI system.
* **Online Evaluation**: The process of analyzing your AI system in the real-world starts after your solution is deployed to users. This workflow involves measuring system performance on tasks created by real users, in real-time.

#### Offline evaluation

Once your golden dataset is ready, you can begin to run your scorer code, which typically compares golden input/output pairs with the AI system’s response to each input. The result is a set of metrics that average performance across all inputs received. These evaluation scores are not a final grade; they are a diagnostic tool to reveal areas for improvement and guide refinement of your AI system. Where there are issues (like a poor score or performance regression), engineers can conduct error analysis to identify the root causes and develop potential solutions.

Error analysis is implemented by inspecting traces. A trace is the complete, end-to-end record of a single user request as it moves through each component of the AI system. A trace typically includes:

* **Each component’s inputs and outputs**: For the “answer generation” component of an AI agronomist product, this may include the raw user query (in Marathi), its english translation (the actual input to this component), system prompt, data retrieved from a knowledge base, and the answer generated.
* **Model selection and parameters**: If the system can call multiple models, the trace will include which foundation model was used (e.g. GPT-4o), along with the settings or configuration (e.g., temperature: 0.7, max\_tokens: 1024).
* **Usage, cost, latency**: Trace data also include the number of tokens used to produce a given output, the corresponding cost of generation, and the time required to deliver the output.

Since a modern AI solution has many components, a poor score may indicate the issue but not its source. Engineers must identify which component(s) have contributed to a failure; it could be ineffective document retrieval in a RAG system, a poorly structured prompt, or something else.

The Product Manager is the primary consumer of an error analysis, prioritizing modifications and refinements for testing. They must weigh the business impact of a given metric (e.g. an improvement in "Hallucination rate" or "Accuracy") against the engineering cost to address it. Most solutions will require multiple cycles of the measurement-refinement loop on your golden dataset before deploying to real users.

As you engage in cycles of evaluation and analysis, it is tempting to endlessly tweak the AI system to maximize its evaluation score. However, metrics are coarse proxies for how an AI pipeline will perform in the real world. This is especially true if you lack the historical transaction data needed to build a truly representative Golden Dataset. Instead of chasing incremental gains (e.g., trying to move accuracy from 93% to 95%), consider establishing a performance threshold for each metric (e.g. accuracy > 90%). Once the AI system passes this threshold, stop optimizing in the lab and move quickly toward a real-world deployment.

Shifting to a threshold-based approach accelerates your transition from a controlled environment to the real world, offering two critical advantages:

1. **Access to authentic behavior**: Real user behavior is often drastically different from what developers anticipate. Shipping the AI product early allows you to gather high-value user data to update your Golden Dataset, making it representative of actual edge cases.
2. **Prioritization of Real Problems**: The issues that frustrate users in the wild are rarely the same ones solved by chasing a 2% improvement on an internal metric. Real-world exposure helps you identify the most pressing failure modes so you can prioritize the fixes that actually improve user experience.

#### Online evaluation

Once your solution has been debugged and is ready for deployment to real users, you will need to continuously monitor performance metrics as well as guardrail metrics. This enables you to manage the trade-offs between accuracy, safety and broader service-level performance (e.g. latency). Evaluation results should be actively monitored over time, and unexpected behavior (i.e. weak scores, performance regressions) should be flagged automatically.

To do this, we advise integrating an observability tool (e.g. Langfuse) into your AI system to implement logging or “tracing” of the inputs and outputs for various components in your AI system once you launch. Many of these platforms allow you to track and visualize your evaluation results. By reviewing the user interaction "traces" captured by these services, you can spot novel patterns, user request types, and (critically) failure modes. Add these new examples to your "golden dataset" to improve its coverage and representativeness.

By monitoring user traces during online evaluation, you can identify new and unexpected ways your users may be interacting with your solution, including failure modes not encountered during lab testing (offline evaluation). The resulting insights should be used to augment or modify your golden dataset, update metrics, and refine your product or solution in a continuous feedback loop.

Though we will reference online evaluations where relevant, we do not provide detailed guidance on LLM tracing. We will leave this for future extensions of this playbook, and [recommend this guide for reference](https://hamel.dev/blog/posts/evals-faq/).

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F4.-scoring-and-error-analysis>" %}

</details>