> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/how-is-level-1-evaluation-performed.md). # How is Level 1 evaluation performed? End-to-end, the entire Level 1 evaluation workflow is both complex and highly iterative (see Figure 7). However, we encourage you to start with a [Minimum Viable Evaluation](/additional-resources/minimum-viable-evaluations.md), and build incrementally as the product matures.

### 6-step process for evaluating AI systems. We will elaborate on each of these steps in turn. You can apply this process to each of the non-deterministic models in your AI system, individually at first (if needed) but eventually as an ensemble: {% stepper %} {% step %} #### [Decide on an evaluation rubric](/model-behaviour/how-to-evaluate/1.-decide-on-an-evaluation-rubric.md) The first step in Level 1 evals is to come up with your evaluation rubric. {% endstep %} {% step %} #### [Decide on metrics](/model-behaviour/how-to-evaluate/2.-decide-on-metrics.md) Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric. {% endstep %} {% step %} #### [Develop a golden dataset](/model-behaviour/how-to-evaluate/3.-develop-a-golden-dataset.md) To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system. {% endstep %} {% step %} #### [Scoring & error analysis](/model-behaviour/how-to-evaluate/4.-scoring-and-error-analysis.md) Run online and offline evaluations and conduct error analysis {% endstep %} {% step %} #### [Automate your evaluations](/model-behaviour/how-to-evaluate/5.-automate-your-evaluations.md) Manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow. {% endstep %} {% step %} #### [Red-teaming](/model-behaviour/how-to-evaluate/6.-red-teaming.md) Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild. {% endstep %} {% endstepper %} ***

💬 Want to suggest edits or provide feedback?

{% embed url="" %}