> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/1.-decide-on-an-evaluation-rubric.md).

# Decide on an evaluation rubric

The first step in Level 1 evals is to come up with your evaluation rubric. Working with domain experts and other stakeholders, you will define the characteristics that your AI solution must exhibit, in the form of targets or success criteria. For example, an AI agronomist might prioritize the “accuracy” of scientific information presented, and a mental health bot might need to emphasize “empathy”.

While some evaluation criteria are common, the majority of your rubric will be driven by your specific use case. To ensure a comprehensive evaluation, your rubric should explicitly address these five dimensions:

<table data-header-hidden><thead><tr><th width="152.31640625">Dimension</th><th>What to Measure</th><th>Target</th></tr></thead><tbody><tr><td>Accuracy/ Usefulness</td><td>The quality of the AI’s response and whether it sufficiently addresses the task at hand</td><td>“The response must address the user’s specific question instead of giving a generic answer and it must be medically accurate.”</td></tr><tr><td>Qualitative / Branding</td><td>The "personality" and tone of the AI.</td><td>"The response must be professional and never use jargon."</td></tr><tr><td>Safety &#x26; Sensitivity</td><td>Identifying sensitive issues specific to your use case and specify any unacceptable behaviours.</td><td>"The AI system must never provide legal advice or comment on [Sensitive Topic X]."</td></tr><tr><td>Robustness &#x26; Stability</td><td>The system's ability to remain consistent when the same question is asked in different ways.</td><td>"The core answer should not change if the user uses different phrasing or synonyms."</td></tr><tr><td>Linguistic Consistency</td><td>For multi-language apps, ensuring performance doesn't drop across languages.</td><td>"The Swahili and Sheng question must receive the same level of detail as the English version."</td></tr><tr><td>Service-Level Performance</td><td>The "cost of doing business."</td><td>"The end-to-end response time must be less than 2 seconds at a cost of &#x3C;$0.01 per query."</td></tr></tbody></table>

The rubric will be determined by your use case, context, and impact goals. This step often takes lots of reflection and discussion to get right. It is a critical step that guides the rest of your evaluation, so do not rush this step.

#### How many dimensions should I have in my rubric?

It is tempting to make a long list of characteristics you want. After all, you want your AI system to be trustworthy as well as friendly, on-brand, concise, complete, curious, empathetic, encouraging, direct, and so many other things. Unfortunately, the longer this list, the more expensive and difficult your evaluation process. There are also tradeoffs that are hard to get right (e.g, concise vs. complete, friendly vs. direct). We recommend that you restrict the rubric to a maximum of 5 items to start.

{% hint style="success" %}

### Case Studies

[Jacaranda Health (JH)](https://www.google.com/url?q=https://jacarandahealth.org/\&sa=D\&source=editors\&ust=1770879886943808\&usg=AOvVaw0nVZiASt1YHvQ0QpeJu4z-) pioneers the use of generative AI to transform how underserved mothers in Sub-Saharan Africa access, understand, and act on vital maternal and newborn health information. Their product (PROMPTS) is a two-way SMS service designed to promote positive care-seeking behaviors amongst new and expectant mothers through timely health information and support throughout the pregnancy and postpartum journey. Responses are generated by Jacaranda’s customized LLM, UlizaLlama, which is based on Meta’s Llama 2 and fine-tuned for use in Swahili and English ([Stanford Center for Digital Health, 2025](https://www.google.com/url?q=https://cdh.stanford.edu/our-research-portfolio/generative-ai-health-low-middle-income-countries\&sa=D\&source=editors\&ust=1770879886944918\&usg=AOvVaw1eFSz30uQd-nZcYNQEzTye)). The evaluation of PROMPT’s LLM responses at Level 1 is based on rubrics for “medical accuracy and appropriateness, personability, and simplicity.”

\
Another example comes from [Digital Green (DG)](https://www.google.com/url?q=https://digitalgreen.org/\&sa=D\&source=editors\&ust=1770879886945632\&usg=AOvVaw2jB00YounbCX7945760PK0), which uses GenAI to democratize access to localized, actionable agricultural knowledge for smallholder farmers across Africa and Asia. Their product (Farmer.Chat) is a multilingual, multimodal conversational platform that delivers personalized, context-aware agricultural advice through familiar messaging apps like WhatsApp and Telegram. Built using a Retrieval-Augmented Generation (RAG) architecture and integrated with a dynamic knowledge base of expert-vetted documents, videos, and real-time data, Farmer.Chat provides reliable guidance on more than 40 crops across four countries (Kenya, India, Ethiopia, and Nigeria). Responses are generated by Digital Green’s custom large language model pipeline, optimized for low-literacy users and localized languages including Swahili, Amharic, Hausa, Hindi, Odiya, Telugu, and English. Their AI system synthesizes structured and unstructured agricultural data to produce clear, trustworthy, and culturally relevant information delivered via text, voice, and video formats. Hence, the evaluation of Farmer.Chat’s performance at Level 1 is based on rubrics for “faithfulness, relevance, and accessibility” ([Singh et al., 2024](https://www.google.com/url?q=https://arxiv.org/abs/2409.08916\&sa=D\&source=editors\&ust=1770879886947391\&usg=AOvVaw2SYezWTyiUl0qF6cl35RSL)).
{% endhint %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F1.-decide-on-an-evaluation-rubric>" %}

</details>
Dimension	What to Measure	Target
Accuracy/ Usefulness	The quality of the AI’s response and whether it sufficiently addresses the task at hand	“The response must address the user’s specific question instead of giving a generic answer and it must be medically accurate.”
Qualitative / Branding	The "personality" and tone of the AI.	"The response must be professional and never use jargon."
Safety & Sensitivity	Identifying sensitive issues specific to your use case and specify any unacceptable behaviours.	"The AI system must never provide legal advice or comment on [Sensitive Topic X]."
Robustness & Stability	The system's ability to remain consistent when the same question is asked in different ways.	"The core answer should not change if the user uses different phrasing or synonyms."
Linguistic Consistency	For multi-language apps, ensuring performance doesn't drop across languages.	"The Swahili and Sheng question must receive the same level of detail as the English version."
Service-Level Performance	The "cost of doing business."	"The end-to-end response time must be less than 2 seconds at a cost of <$0.01 per query."