# How is Level 1 evaluation performed?

End-to-end, the entire Level 1 evaluation workflow is both complex and highly iterative (see Figure 7). However, we encourage you to start with a [Minimum Viable Evaluation](/additional-resources/minimum-viable-evaluations.md), and build incrementally as the product matures.

<figure><img src="/files/0hRYTw9iSvYpegyU6UQj" alt=""><figcaption><p>Figure 7: Level 1 Evals Workflow</p></figcaption></figure>

### 6-step process for evaluating AI systems. <a href="#what-is-the-minimum-viable-evaluation-for-level-1" id="what-is-the-minimum-viable-evaluation-for-level-1"></a>

We will elaborate on each of these steps in turn. You can apply this process to each of the non-deterministic models in your AI system, individually at first (if needed) but eventually as an ensemble:

{% stepper %}
{% step %}

#### [Decide on an evaluation rubric](/model-behaviour/how-to-evaluate/1.-decide-on-an-evaluation-rubric.md)

The first step in Level 1 evals is to come up with your evaluation rubric.
{% endstep %}

{% step %}

#### [Decide on metrics](/model-behaviour/how-to-evaluate/2.-decide-on-metrics.md)

Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric.
{% endstep %}

{% step %}

#### [Develop a golden dataset](/model-behaviour/how-to-evaluate/3.-develop-a-golden-dataset.md)

To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system.
{% endstep %}

{% step %}

#### [Scoring & error analysis](/model-behaviour/how-to-evaluate/4.-scoring-and-error-analysis.md)

Run online and offline evaluations and conduct error analysis
{% endstep %}

{% step %}

#### [Automate your evaluations](/model-behaviour/how-to-evaluate/5.-automate-your-evaluations.md)

Manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow.
{% endstep %}

{% step %}

#### [Red-teaming](/model-behaviour/how-to-evaluate/6.-red-teaming.md)

Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild.
{% endstep %}
{% endstepper %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed>" %}

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/how-is-level-1-evaluation-performed.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
