# Automate your evaluations

When you first score an AI system’s performance against your Golden Dataset, it is reasonable to use a notebook, where you can quickly test code, see the results, and make changes. However, manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow.

Automated evals can be continuous (e.g. on every AI response on production), or they can be triggered by certain events (e.g. a change in the system prompt). The engineering team is responsible for the technical implementation of the evaluation pipeline. This includes managing the execution and frequency of evaluation, ensuring results are reliable and accessible, and integrating the evals in your deployment flow. We recommend the following practices:

* **Find the right evaluation frequency**: Evaluation methods vary significantly in reliability, computational cost and latency. A tiered approach balances cost and information:
  * Low-cost evals: Statistical scorers and model-based scorers (covered earlier) are fast and inexpensive. These can be run frequently to provide rapid feedback. But as noted above, they are limited in what they can measure.
  * High-cost evals: LLM-as-judge scorers can be more comprehensive but can incur significant token costs. Their execution can become less frequent once you have a stable version deployed. Common triggers include nightly builds, weekly schedules, or as a final validation step before a major release.
* **Check periodic alignment**: You may decide to run your LLM judge on the output of every response on production (or a sample of production data) for monitoring online evaluation. However, it is important to ensure that your LLM judge continues to be aligned with human experts so that its judgements remain relevant. Similar to the initial alignment exercise explained earlier, it is strongly recommended to periodically sample your production data (once in a month/quarter/year, depending on the maturity and stability of your AI system) and repeat the alignment exercise.
* **Track performance over time**: Evaluation is a continuous process. Use your observability tool to plot your online evaluation metric scores over time. This dashboard provides a critical view for product owners to track progress against rubric goals and verify that solution changes are yielding measurable improvements.
* **Perform A/B tests**: Instead of releasing every change to all your users, it would be more prudent to release it to a small subset of users (e.g. 1%) to ensure it is stable and works equally as good or better than the previous version by comparing your metrics on
* **Integrate with CI/CD**: Once the evaluation suite is stable, it can be integrated into your deployment pipeline to ensure that all code/prompt/model changes are validated before being deployed, preventing regressions.

Once you start generating real user inputs and feedback, you can update your metrics and golden datasets, test different system configurations, and deploy the version of your AI pipeline that performs best.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F5.-automate-your-evaluations>" %}

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/5.-automate-your-evaluations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
