How is Level 1 evaluation performed?
6-step process for evaluating AI systems.
Last updated
Was this helpful?
End-to-end, the entire Level 1 evaluation workflow is both complex and highly iterative (see Figure 7). However, we encourage you to start with a Minimum Viable Evaluation, and build incrementally as the product matures.
We will elaborate on each of these steps in turn. You can apply this process to each of the non-deterministic models in your AI system, individually at first (if needed) but eventually as an ensemble:
The first step in Level 1 evals is to come up with your evaluation rubric.
Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric.
To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system.
Run online and offline evaluations and conduct error analysis
Manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow.
Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild.
Last updated
Was this helpful?