Who is most involved in this level of evaluation?

Execute 🟢
Support (as product owners) 🟡

AI Engineers

ML Researchers

Domain Experts

Product Owners

User Researchers

Your engineering team will be driving the process from driving technical development (e.g. implementing metrics and setting up automated evaluation pipelines) and working together with domain experts to finalise the rubrics and developing the golden dataset.

Domain experts and product owners must support the engineering team as product owners by deciding the rubrics, validating if the metrics proposed measure those rubrics accurately and helping inform the design of the golden dataset.

Why is this level of evaluation important?

Level 1 evaluations focus on the AI system (see What is an AI system?) that form the “smarts” of your product. And while these AI systems are powerful, they have inherent blind spots. Large language models (LLMs) like GPT, Claude and Gemini do not understand content in the way humans do. Given an input, they generate output by predicting the next word in a sequence. Their predictions mimic the data used in model training—usually a vast collection of information published to the internet, including textbooks and computer code, as well as misinformation, unverified claims, and conspiracy theories. This is why they can appear fluent and convincing while remaining inaccurate, irrelevant, or harmful—a phenomenon known as hallucination.

Because of the way they are trained, AI models face several limitations:

Cover

Static Knowledge

Used alone, they cannot access real-time information (e.g., current weather in a rural village) so are limited to the training data they have received.

Cover

Limited Context

The model will not have access to personal information or your proprietary documents unless explicitly engineered to do so. As a result, models may lack the context to generate actionable, personalized, or even accurate outputs for a given task.

Cover

Instruction Following

Models may struggle to adhere to complex instructions or fail to follow constraints consistently, leading to results that do not fully meet expected criteria.

Cover

Task Mismatch

AI models are not the right “tool” for every task; for example, they may confidently make errors in math calculations which are trivial for a calculator. Understanding where they shine and augmenting them with capabilities they lack is key to using them well.

Product developers can often address these limitations, but it requires a structured, continuous evaluation process: a set of iterative workflows to verify that the AI system is useful, accurate, and safe; and that it reliably exhibits desirable behaviors and characteristics. For instance, an effective AI tutor will follow pedagogical best practices – like withholding answers to encourage self-directed learning, or gauging a student’s abilities to better tailor instruction.

Level 1 evaluation verifies that the AI system performs reliably and is appropriate to the context. This is non-negotiable in sectors like education, health, and agriculture, where misalignment or unverified claims can cause real-world harm to vulnerable users. We recommend starting early with Level 1 evaluation, to prevent wasted effort and time.

You can begin by engaging key stakeholders, including users and domain experts, to define success criteria and a continuous evaluation strategy. This allows you to shape system behavior throughout the development process, and to avoid the high costs (and delays) of fixing a misaligned system after it has already been built.


💬 Want to suggest edits or provide feedback?

Last updated

Was this helpful?