Decide on an evaluation rubric

The first step in Level 1 evals is to come up with your evaluation rubric. Working with domain experts and other stakeholders, you will define the characteristics that your AI solution must exhibit, in the form of targets or success criteria. For example, an AI agronomist might prioritize the “accuracy” of scientific information presented, and a mental health bot might need to emphasize “empathy”.

While some evaluation criteria are common, the majority of your rubric will be driven by your specific use case. To ensure a comprehensive evaluation, your rubric should explicitly address these five dimensions:

Accuracy/ Usefulness

The quality of the AI’s response and whether it sufficiently addresses the task at hand

“The response must address the user’s specific question instead of giving a generic answer and it must be medically accurate.”

Qualitative / Branding

The "personality" and tone of the AI.

"The response must be professional and never use jargon."

Safety & Sensitivity

Identifying sensitive issues specific to your use case and specify any unacceptable behaviours.

"The AI system must never provide legal advice or comment on [Sensitive Topic X]."

Robustness & Stability

The system's ability to remain consistent when the same question is asked in different ways.

"The core answer should not change if the user uses different phrasing or synonyms."

Linguistic Consistency

For multi-language apps, ensuring performance doesn't drop across languages.

"The Swahili and Sheng question must receive the same level of detail as the English version."

Service-Level Performance

The "cost of doing business."

"The end-to-end response time must be less than 2 seconds at a cost of <$0.01 per query."

The rubric will be determined by your use case, context, and impact goals. This step often takes lots of reflection and discussion to get right. It is a critical step that guides the rest of your evaluation, so do not rush this step.

How many dimensions should I have in my rubric?

It is tempting to make a long list of characteristics you want. After all, you want your AI system to be trustworthy as well as friendly, on-brand, concise, complete, curious, empathetic, encouraging, direct, and so many other things. Unfortunately, the longer this list, the more expensive and difficult your evaluation process. There are also tradeoffs that are hard to get right (e.g, concise vs. complete, friendly vs. direct). We recommend that you restrict the rubric to a maximum of 5 items to start.

Case Studies


💬 Want to suggest edits or provide feedback?

Last updated

Was this helpful?