# About this playbook

From math tutors to farmer advisory tools, generative AI (GenAI) is rapidly expanding across low- and middle-income countries. This playbook provides a 4-level framework and recommends practices for evaluating these GenAI tools.

<br>

<figure><img src="/files/xw2iIcSt0ixsxaN2IGTJ" alt=""><figcaption></figcaption></figure>

### Why we need this playbook

Evaluating GenAI products can mean different things depending on who you ask. Tech teams prioritize performance, often overlooking impact, while impact evaluators focus on outcomes but may neglect the underlying technology. Even within disciplines, the sophistication and quality of evaluations can differ.

This playbook establishes a unified set of expectations and practices for evaluating GenAI products in global development.

<table data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-handshake-angle">:handshake-angle:</i></h4></td><td><strong>Create Shared Practices</strong></td><td>Use consistent, credible, and comparable practices to assess what works and drive learning across the industry.</td><td></td></tr><tr><td><h4><i class="fa-lightbulb-gear">:lightbulb-gear:</i></h4></td><td><strong>Improve Products and Programs</strong></td><td>Identify issues early through continuous evaluation and build better products over time.</td><td></td></tr><tr><td><h4><i class="fa-clipboard-check">:clipboard-check:</i></h4></td><td><strong>Demonstrate Accountability</strong></td><td>Show stakeholders measurable progress from model performance to impact.</td><td></td></tr></tbody></table>

### Who is this playbook for

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><h4><i class="fa-user-gear">:user-gear:</i></h4></td><td><strong>Implementors and Program Managers</strong></td><td>Improve your products and programs with credible evaluation practices.</td><td><a href="/files/K3RO2hRVfb7kdPAuvmJ9">/files/K3RO2hRVfb7kdPAuvmJ9</a></td></tr><tr><td><h4><i class="fa-dollar-sign">:dollar-sign:</i></h4></td><td><strong>Funders and Policy Makers</strong></td><td>Make informed investments by assessing an organization’s ability to evaluate and improve their product.</td><td><a href="/files/x4g04elKvnWNAMp5kT9g">/files/x4g04elKvnWNAMp5kT9g</a></td></tr></tbody></table>

### How to use this playbook

The playbook is organized around a 4-level framework that asks the following evaluation questions:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-type="content-ref"></th><th data-hidden></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-head-side-circuit">:head-side-circuit:</i></h4></td><td><strong>Models Evaluation</strong></td><td>Does the AI system perform as intended?<br><br><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC">Level 1 →</a></td><td><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC">/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC</a></td><td><a href="#how-the-framework-works">Models &#x26; Behaviour</a></td><td></td></tr><tr><td><h4><i class="fa-laptop-code">:laptop-code:</i></h4></td><td><strong>Product Evaluation</strong></td><td>Does the overall product engage and retain users?<br><br><a href="/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl">Level 2 →</a></td><td><a href="/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl">/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl</a></td><td><a href="#how-the-framework-works">Implementors &#x26; Program Managers</a></td><td></td></tr><tr><td><h4><i class="fa-user-gear">:user-gear:</i></h4></td><td><strong>User Evaluation</strong></td><td>Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome?<br><br><a href="/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew">Level 3 →</a></td><td><a href="/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew">/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew</a></td><td><a href="#how-the-framework-works">User Experience</a></td><td></td></tr><tr><td><h4><i class="fa-hand-holding-seedling">:hand-holding-seedling:</i></h4></td><td><strong>Impact Evaluation</strong></td><td>Do users with access to the product improve development outcomes?<br><br><a href="/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE">Level 4 →</a></td><td><a href="/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE">/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE</a></td><td><a href="#how-the-framework-works">Social Impact</a></td><td></td></tr></tbody></table>

Each level outlines detailed evaluation practices for organizations building AI products to pursue.

The four levels form a logical progressio&#x6E;*.* Users are unlikely to stay engaged (Level 2) if the GenAI system fails to perform (Level 1), and development outcomes are unlikely to improve (Level 4) if users disengage or their feelings, knowledge, and behaviors are harmed (Level 3).

The playbook helps implementers conduct continuous evaluation across levels. Often, the results of one level of evaluation may require revisiting the performance of a preceding level. Amidst evolving technology, this enables rapid iteration, maintains expected behavior, and improves performance and impact over time.

### Setting the Foundation

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-type="content-ref"></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-users">:users:</i></h4></td><td><strong>Build your team</strong></td><td>To build a GenAI product for social impact, you need the right team that brings together development sector expertise with skillsets that are newer to the field. This section describes the relevant skillsets.<br><br><a href="/pages/V91mgmS1QmGOVyIVTszQ">Learn more →</a></td><td><a href="/pages/V91mgmS1QmGOVyIVTszQ">/pages/V91mgmS1QmGOVyIVTszQ</a></td><td></td></tr><tr><td><h4><i class="fa-shield-keyhole">:shield-keyhole:</i></h4></td><td><strong>Build the infrastructure</strong></td><td>Before diving into the four levels, teams should establish several key conceptual and technical building blocks that ease evaluations. This section describes what should be developed before diving in.<br><br><a href="/pages/tSN6S6uJ2o6t9Y6o4LYF">Learn more →</a></td><td><a href="/pages/tSN6S6uJ2o6t9Y6o4LYF">/pages/tSN6S6uJ2o6t9Y6o4LYF</a></td><td></td></tr></tbody></table>

#### Additional Resources

[FAQs](/additional-resources/frequently-asked-questions) | [Glossary](/additional-resources/glossary) | [Minimal Viable Evaluations](/additional-resources/minimum-viable-evaluations) | [Tools & Templates](/additional-resources/additional-resources)

#### Stay involved

* [See the process behind the playbook](/overview/the-process-behind-this-playbook)
* [Contribute to this playbook](/overview/how-to-contribute-to-the-playbook)<br>

***

This is a living playbook. It will be updated regularly, with deeper collaboration with specialists to co-create shared evaluation tools, refine methodologies, and support their practical use in real-world settings.


# About this playbook

From math tutors to farmer advisory tools, generative AI (GenAI) is rapidly expanding across low- and middle-income countries. This playbook provides a 4-level framework and recommends practices for evaluating these GenAI tools.

<br>

<figure><img src="/files/NxwZtqXAhA3XAfAWsqoO" alt=""><figcaption></figcaption></figure>

### Why we need this playbook

Evaluating GenAI products can mean different things depending on who you ask. Tech teams prioritize performance, often overlooking impact, while impact evaluators focus on outcomes but may neglect the underlying technology. Even within disciplines, the sophistication and quality of evaluations can differ.

This playbook establishes a unified set of expectations and practices for evaluating GenAI products in global development.

<table data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-handshake-angle">:handshake-angle:</i></h4></td><td><strong>Create Shared Practices</strong></td><td>Use consistent, credible, and comparable practices to assess what works and drive learning across the industry.</td><td></td></tr><tr><td><h4><i class="fa-lightbulb-gear">:lightbulb-gear:</i></h4></td><td><strong>Improve Products and Programs</strong></td><td>Identify issues early through continuous evaluation and build better products over time.</td><td></td></tr><tr><td><h4><i class="fa-clipboard-check">:clipboard-check:</i></h4></td><td><strong>Demonstrate Accountability</strong></td><td>Show stakeholders measurable progress from model performance to impact.</td><td></td></tr></tbody></table>

### Who is this playbook for

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><h4><i class="fa-user-gear">:user-gear:</i></h4></td><td><strong>Implementors and Program Managers</strong></td><td>Improve your products and programs with credible evaluation practices.</td><td><a href="/files/K3RO2hRVfb7kdPAuvmJ9">/files/K3RO2hRVfb7kdPAuvmJ9</a></td></tr><tr><td><h4><i class="fa-dollar-sign">:dollar-sign:</i></h4></td><td><strong>Funders and Policy Makers</strong></td><td>Make informed investments by assessing an organization’s ability to evaluate and improve their product.</td><td><a href="/files/x4g04elKvnWNAMp5kT9g">/files/x4g04elKvnWNAMp5kT9g</a></td></tr></tbody></table>

### How to use this playbook

The playbook is organized around a 4-level framework that asks the following evaluation questions:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-type="content-ref"></th><th data-hidden></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-head-side-circuit">:head-side-circuit:</i></h4></td><td><strong>Models Evaluation</strong></td><td>Does the AI system perform as intended?<br><br><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC">Level 1 →</a></td><td><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC">/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC</a></td><td><a href="#how-the-framework-works">Models &#x26; Behaviour</a></td><td></td></tr><tr><td><h4><i class="fa-laptop-code">:laptop-code:</i></h4></td><td><strong>Product Evaluation</strong></td><td>Does the overall product engage and retain users?<br><br><a href="/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl">Level 2 →</a></td><td><a href="/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl">/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl</a></td><td><a href="#how-the-framework-works">Implementors &#x26; Program Managers</a></td><td></td></tr><tr><td><h4><i class="fa-user-gear">:user-gear:</i></h4></td><td><strong>User Evaluation</strong></td><td>Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome?<br><br><a href="/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew">Level 3 →</a></td><td><a href="/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew">/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew</a></td><td><a href="#how-the-framework-works">User Experience</a></td><td></td></tr><tr><td><h4><i class="fa-hand-holding-seedling">:hand-holding-seedling:</i></h4></td><td><strong>Impact Evaluation</strong></td><td>Do users with access to the product improve development outcomes?<br><br><a href="/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE">Level 4 →</a></td><td><a href="/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE">/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE</a></td><td><a href="#how-the-framework-works">Social Impact</a></td><td></td></tr></tbody></table>

Each level outlines detailed evaluation practices for organizations building AI products to pursue.

The four levels form a logical progressio&#x6E;*.* Users are unlikely to stay engaged (Level 2) if the GenAI system fails to perform (Level 1), and development outcomes are unlikely to improve (Level 4) if users disengage or their feelings, knowledge, and behaviors are harmed (Level 3).

The playbook helps implementers conduct continuous evaluation across levels. Often, the results of one level of evaluation may require revisiting the performance of a preceding level. Amidst evolving technology, this enables rapid iteration, maintains expected behavior, and improves performance and impact over time.

### Setting the Foundation

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-type="content-ref"></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><h4><i class="fa-users">:users:</i></h4></td><td><strong>Build your team</strong></td><td>To build a GenAI product for social impact, you need the right team that brings together development sector expertise with skillsets that are newer to the field. This section describes the relevant skillsets.<br><br><a href="/pages/V91mgmS1QmGOVyIVTszQ">Learn more →</a></td><td><a href="/pages/V91mgmS1QmGOVyIVTszQ">/pages/V91mgmS1QmGOVyIVTszQ</a></td><td></td></tr><tr><td><h4><i class="fa-shield-keyhole">:shield-keyhole:</i></h4></td><td><strong>Build the infrastructure</strong></td><td>Before diving into the four levels, teams should establish several key conceptual and technical building blocks that ease evaluations. This section describes what should be developed before diving in.<br><br><a href="/pages/tSN6S6uJ2o6t9Y6o4LYF">Learn more →</a></td><td><a href="/pages/tSN6S6uJ2o6t9Y6o4LYF">/pages/tSN6S6uJ2o6t9Y6o4LYF</a></td><td></td></tr></tbody></table>

#### Additional Resources

[FAQs](/additional-resources/frequently-asked-questions) | [Glossary](/additional-resources/glossary) | [Minimal Viable Evaluations](/additional-resources/minimum-viable-evaluations) | [Tools & Templates](/additional-resources/additional-resources)

#### Stay involved

* [See the process behind the playbook](/overview/the-process-behind-this-playbook)
* [Contribute to this playbook](/overview/how-to-contribute-to-the-playbook)<br>

***

This is a living playbook. It will be updated regularly, with deeper collaboration with specialists to co-create shared evaluation tools, refine methodologies, and support their practical use in real-world settings.


# The Process Behind it

This playbook draws on real-world evaluation practices developed during the 2025 [AI for Global Development (AI4GD) accelerator](https://agencyfund.notion.site/ai-for-global-development). The Accelerator—led by The Agency Fund (TAF) in collaboration with OpenAI and experts at the Center for Global Development (CGD)—invested $5 million in eight non-profits building GenAI products and services across education, health, and agricultural livelihoods.

CGD convened a Technical Working Group to refine these evaluation lessons into this living playbook. The group included more than 30 experts across computer science, economics, gender studies, health, education, and agriculture, with representation from Asia, Africa, North America, and Europe. IDinsight also interviewed non-profits building or deploying generative AI in the social sector to understand their current evaluation approaches and what guidance they find most actionable. More than 300 comments from experts and nonprofits informed the next version of the playbook. The convening and development of this Playbook was funded by the Gates Foundation.

The Playbook is now a living document with a development roadmap that will evolve as evidence and AI capabilities advance, with TAF, IDinsight, and CGD stewarding updates and incorporating community feedback.<br>

<figure><img src="/files/mJfGx0iIV103DQt4MPL2" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Note: The version Working Group Members first published in March 2026 can be found here \[PDF]. Given this is a living playbook with multiple additional contributors since the initial publication, this website's online version does not necessarily reflect the positions of the original working group.​ With the exception of Steering Committee Members, Working Group Members and Core Contributors served in their individual capacities, not as official representatives of any organization.
{% endhint %}

<figure><img src="/files/KSpLYFm2U1eWBb2THgQq" alt=""><figcaption></figcaption></figure>

### Community of Contributors

{% columns %}
{% column %}
Suzin You, IDinsight

Isha Fuletra, IDinsight
{% endcolumn %}

{% column %}
Yolanda Yang, CGD
{% endcolumn %}
{% endcolumns %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fthe-process-behind-this-playbook>" %}

</details>


# How to Contribute to the Playbook

How to provide feedback, suggest edits, or contribute to the AI Evaluation Playbook.

This is a living playbook maintained by a Steering Committee staffed by the Center for Global Development, IDinsight, and The Agency Fund. AI evaluation is an active area of research, with new methods and best practices being explored all the time. We welcome edits, suggestions, and feedback to help keep this resource relevant and useful, and we hope to keep updating it as the space evolves.

We aim to release new versions of the playbook incorporating community feedback every quarter.

## Ways to contribute

### 1. Send us an email

Reach out to us at <ai-eval-playbook@idinsight.org> with your feedback, suggestions, or questions.

### 2. Use the feedback form

Every page in the playbook includes a feedback form at the bottom. Use it to leave quick comments or suggestions about the content on that page.

### 3. Raise a pull request on GitHub

For direct edits or contributions, open a pull request in our GitHub repository: <https://github.com/IDinsight/ai-eval-playbook>.

## What happens next

### 1. We review contributions monthly

The Steering Committee does this at least monthly but could respond faster if urgent

### 2. We may reach out for clarification

If we think there's something actionable but need more information we'll reach out. Please engage with us!

### 3. We'll list you as a contributor!

We're keen to grow our community of contributors. Join us!

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=overview%2Fhow-to-contribute-to-the-playbook>" %}

</details>


# Building Blocks for GenAI Evaluation

To move from a promising AI prototype to a scalable tool for social impact, you need more than just sophisticated code—edging toward real-world change requires a deliberate combination of people and process.

This section of the Playbook outlines the two foundational pillars of your evaluation journey: assembling a multidisciplinary team and establishing the technical and conceptual infrastructure to measure success.

***

### Building the Team

Success in the development sector depends on breaking down silos. A great GenAI product isn't just "built by engineers" and "checked by researchers"; it is the result of a cross-functional dance.

In this section, we define the specific roles required—from AI Engineers and Data Scientists to Social Scientists and Domain Experts. You’ll find:

* Role Definitions: Who leads which level of evaluation (from model performance to long-term impact).
* Collaboration Best Practices: How to pair technical staff with domain experts early to ensure "accuracy" aligns with "human need."
* Shared Language: Tools for creating a unified vocabulary to avoid the "jargon trap."

<a href="/pages/V91mgmS1QmGOVyIVTszQ" class="button primary">Learn more -></a>

***

### Building the Infrastructure

Beyond the people, you need a repeatable system. We define five core building blocks that shift your team from static design to continuous, data-driven improvement.

This section provides a technical and strategic roadmap for:

1. The Foundation: Using formative research and a Theory of Change (TOC) to map how an AI output becomes a social outcome.
2. The User Funnel: Mapping the journey from the first "Hello" to the "North Star" metric, ensuring you don't fall into the "engagement trap."
3. Data Pipelines: Setting up the "Extract, Transform, Load" (ETL) systems necessary to handle complex, unstructured GenAI data.
4. Hypothesis Targeting: A disciplined approach to diagnosing why users drop off or why metrics underperform.
5. Experimentation: Moving from intuition to evidence through A/B testing and rigorous version control.

<a href="/pages/tSN6S6uJ2o6t9Y6o4LYF" class="button primary">Learn more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=overview%2Fbuilding-blocks-for-genai-evaluation>" %}

</details>


# Building the Team

To build a GenAI product for social impact, you need to start with the right team. Success depends on cross-functional teams where responsibilities are clear and complementary. Effective teams typically involve AI engineers, data engineers, data scientists, user researchers, social scientists, and product managers. They demonstrate strong cross-disciplinary communication and use minimal jargon. Adopting this model in the development sector reduces silos and ensures gains at one level of evaluation translate to the others. Only some of the roles will need to be permanent or in-house, while others may be temporary or external. Below, we outline roles by level, how they collaborate, the tools they use, and how teams align goals with evaluation outcomes. In addition to the assigned roles, we recommend a review of all stages by domain experts as well as persons with used experience of the intervention’s topic.

<table><thead><tr><th width="195.2578125">Area of Expertise</th><th width="199.37890625">Roles in Evaluation</th><th>Responsibilities</th></tr></thead><tbody><tr><td><p>Engineers</p><p>(AI, Backend/Data, MLOps)</p></td><td><p><strong>Lead:</strong> Level 1</p><p><strong>Support:</strong> Level 2, Level 3, Level 4</p></td><td>Orchestrate prompts, knowledge bases and other components of a modern AI system; Create/maintain benchmark datasets and set up automated metrics/human judges/LLM judges to run offline and online tests; Track and improve model performance; Perform error analysis and ensure data quality; Build and fine-tune models if necessary; ensure relevance and safety; log outputs for downstream use. Domain-specific inputs (e.g., educators for tutor bots) are also essential.</td></tr><tr><td>Product Managers</td><td><p><strong>Lead:</strong> Level 2</p><p><strong>Support:</strong> Level 1, Level 3</p></td><td>Integrate AI into workflows; define product metrics, maintain shared dashboards; design/implement experiments in collaboration with Domain Experts and User Researchers and track outcomes of A/B tests; manage product versions and releases; align product metrics with user behavior research.</td></tr><tr><td>Data Scientists</td><td><strong>Support:</strong> Level 2, Level 3, Level 4</td><td>Analyze data from Level 2, including definition of metrics. Contribute to both routine monitoring and analysis of A/B tests.</td></tr><tr><td>User researchers (can include behavioral/psychological scientists)</td><td><p><strong>Lead:</strong> Level 3</p><p><strong>Support:</strong> Level 2, Level 4</p></td><td>Measure user outcomes (cognitive, affective, and behavioral) and run A/B tests on these outcomes; run surveys and interviews; co-design metrics with end users; and integrate qualitative insights from interviews, focus groups, and direct observation with Level 2 product metrics.</td></tr><tr><td>Social scientists</td><td><strong>Lead:</strong> Level 4</td><td>Evaluate long-term outcomes (e.g., learning, health, income); define theory of change; run impact evaluations</td></tr><tr><td>Domain Experts</td><td><strong>Support:</strong> Level 1, Level 2, Level 3, Level 4</td><td>Help to define rubrics for Level 1, and validate Level 1 metrics. Support definition of Level 2 and Level 4 metrics and their real-world relevance. Contribute to the theory of change.</td></tr></tbody></table>

{% hint style="info" %}
In small teams, individuals may span multiple levels, but all four perspectives must be represented. Engineers may collect user feedback but still need behavioral or domain input; product managers should understand model metrics, and researchers should look at product analytics. The team should jointly define what “enough evaluation” means at each stage—later in the Playbook, we outline a set of Minimum Viable Evaluations.
{% endhint %}

## Best practices for cross-level collaboration

### Look Beyond Your Slices of Evaluation

Each team member should understand how their work shapes other evaluation levels. Engineers should look beyond benchmarks to user experience, and data scientists analyzing engagement (Level 2) can gain insight from behavioral experts (Level 3). Regular cross-functional check-ins anchored in the user journey help surface these links and prevent tunnel vision.

### Pair Engineers with Domain Experts Early

Involve domain experts in Level 1 from the outset. Engineers need their input to define success beyond technical metrics, ensuring model evaluation reflects real user needs.

### Identify a Cross-Functional Lead

Product managers (or cross-functional leads) should connect roles, coordinate timelines, run experiments, and translate insights into decisions. A clear evaluation plan spanning Levels 1–4 keeps teams aligned on goals and evidence.

### Use a Shared Evaluation Language

Adopt a shared vocabulary across levels (e.g., Level 1 accuracy, Level 2 engagement, Level 3 learning gains, Level 4 outcomes). Explain jargon as needed and document tests and lessons in a shared space to build alignment, shared goals, and avoid rework.

### Use Tools that Support Collaboration

* **Evaluation pipeline**: An automated evaluation pipeline for your AI system can help identify cases where it currently fails and track its behavior as you make improvements
* **Dashboards & Data Pipelines**: Centralized, annotated dashboards can ensure that key metrics are accessible to all.
* **Experimentation Platform**: Use lightweight tools (e.g., Evidential to run and track experiments collaboratively).
* **Project & Knowledge Tools**: Keep tasks visible, foster quick feedback, and hold regular debriefs for deeper insights.

{% hint style="info" %}
Besides the tooling described above, in the following sections of this introductory playbook, we’ll also share an initial overview of the key resources and tools available for each phase.
{% endhint %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=overview%2Fbuilding-blocks-for-genai-evaluation%2Fbuilding-the-team>" %}

</details>


# Building the Infrastructure

Beyond the team, we define five building blocks for building and evaluating AI products for the development sector. Whether assessing a GenAI tutor or a clinical decision support tool, reusable building blocks often apply across all four evaluation levels. When implemented well, they shift teams from static design to repeatable, delivery-embedded practices that support continuous improvement.

## 1. The Foundation: Start with formative research, a theory of change, and subgroup identification

GenAI products—such as math tutors or clinical decision support—operate within larger systems. A government AI tutor, for instance, depends on teacher training, devices, and incentives. If outcomes fall short, the cause may be the model (level 1) or program delivery. Understanding the full system is essential.

<div data-full-width="true"><img src="/files/GavyWv2jtQaTDrxaTxML" alt="Figure 2. Unit of AI evaluation across systems"></div>

**Formative research** helps organizations define the context and system in which a GenAI product operates (Figure 2). At the center is the **AI system**, which includes multiple components—one of which is a foundation model (e.g. GPT-5, Claude Opus 4.5, Gemini 3). Foundation models are trained on large datasets to learn patterns and generate new outputs—text, images, code, or audio—that resemble their training data. In this playbook, however, the AI system extends beyond the model itself to include prompts, knowledge bases, and other elements of the broader AI pipeline.

In global development, AI models typically sit within an **AI product**, such as a direct-to-consumer health chatbot delivered via WhatsApp. That product is deployed through an **intervention, program, or social service**—for example, an onboarding session onto the WhatsApp product for expectant mothers during prenatal screening visits. These interventions, in turn, exist within a broader **delivery system**, such as a country’s public health system that funds and supports prenatal care and onboarding.

Formative research is early-stage work conducted before or during GenAI-based intervention design to understand the problem, the users, and the delivery system. Its goal is not to prove impact, but to inform design decisions. It typically uses qualitative and mixed methods—interviews, observations, usability tests, surveys, and rapid pilots—to reveal how people actually behave, not how we expect them to. A **human-centered design** approach fosters adaptation to local context, conditions and needs, and is relevant from the start: PATH provides [guidance](https://media.path.org/documents/LLM_Playbook_final.pdf?_gl=1*1rxptz6*_gcl_au*Nzg5MjA1MTcuMTc3MjE5Mzc5MQ..*_ga*ODg1Nzc4NDUwLjE3NzIxOTM3OTI.*_ga_YBSE7ZKDQM*czE3NzIxOTcyNjQkbzIkZzEkdDE3NzIxOTcyNzQkajUwJGwwJGgw) on dataset creation, Dalberg illustrates their [approach](https://thepeopleplaybook.ai/) with learnings from various practical experiences, and Google [offers](https://pair.withgoogle.com/guidebook/) guidance for general AI solutions.

Formative research can help form a **Theory of Change (TOC)**. A TOC maps hypothesized causal pathways from inputs (e.g., training, products, information) to a development “North Star” outcome (e.g., literacy, mortality, yields). It traces how inputs move through activities to outputs, produce short-term outcomes, and ultimately generate social impact. For example, the figure below describes how a skills development intervention that trains and certifies workers ultimately achieves impacts such as reductions in poverty and economic growth.

<figure><img src="/files/Bevt1J9OqC7DowQaKRac" alt=""><figcaption><p>Figure 3: An example theory of change of a skills development intervention</p></figcaption></figure>

From an evaluation standpoint, a TOC explains how earlier-stage Level 1–3 variables—AI system performance, user engagement, and user mindsets—are expected to drive the outcome or impact of interest. By making the end-to-end flow from inputs to impacts explicit, a TOC helps motivate the design of the intervention, product, and AI workflow.

Formative research and a TOC also help define the target population: the people or institutions (e.g., schools, health clinics) the intervention aims to benefit. Crucially, this population is **heterogeneous**; clinics vary in capacity, schools in remoteness, individuals in wealth, spoken language, literacy and digital access. Women and girls and other marginalized communities might also experience intersecting inequalities in relation to disability and other axes of inequality. For evaluation, this creates subgroups of interest. The experiences of these subgroups should be monitored at every evaluation level to detect unequal effectiveness to ensure that your product is not reproducing existing bias and inequalities.

For evaluation, this creates subgroups of interest. Subgroups such as gender or rural populations should be monitored at every evaluation level to detect unequal effectiveness. Under a “minimum viable evaluation” approach—doing only what is needed to mitigate serious risks—you can define a broad initial population while focusing on those least able to benefit. That assessment can be informed by the theory of change. Teams can then iteratively design and evaluate for priority groups, such as rural, low-capacity clinics in a health assistant intervention or low-income indigenous girls in a tutoring program. Track outcomes for these subgroups and adjust the AI workflow, product, or intervention so they benefit. However, when evaluating effects on subgroups, achieving sufficient statistical power may not always be feasible.

Teams may have to prioritize key subgroups and balance rigor with available resources. Given that prioritization may exclude vulnerable subgroups, teams should carefully assess the tradeoffs and explicitly justify their exclusion.<br>

## 2. The user funnel: track the journey across Levels 1-4

One of the most useful tools for developing GenAI products is a **user funnel**: a structured map of how people move through a product and program, from first exposure to long-term impact. A comprehensive funnel does more than describe usage—it creates a shared framework for tracking the user journey from discovery to impact. It surfaces weak points to guide improvements and provides a common anchor for the four levels of evaluation.

To build a funnel, start with the program’s TOC. The funnel is essentially a user-centric theory of change: it captures user inputs—typically behaviors or resources the user controls - in response to the intervention. It can also track intervention elements the product team controls and varies, based on factors like cost or impact.

Funnel design is usually bottom-up. Start with the **final development outcome**, or “North Star” metric (Level 4), such as, improved learning outcomes, better health, or higher crop yields. Then work backward to break the journey into specific **user stages:**

{% stepper %}
{% step %}
**Recruitment**

The beneficiary is identified and enters the program. (Level 2)
{% endstep %}

{% step %}
**Onboarding**

The user is introduced to the AI product and completes initial setup. (Level 2)
{% endstep %}

{% step %}
**Engagement**

The user begins actively interacting with the AI product (Level 2).
{% endstep %}

{% step %}
**Retention**

The user continues engaging with the AI product over time, rather than dropping off (Level 2). Level 1 evaluation may continue as needed to monitor model behavior.
{% endstep %}

{% step %}
**Proximal Outcome**

The user demonstrates near-term cognitive or behavioral change (Level 3). Level 1 evaluation may continue as needed to monitor model behavior.
{% endstep %}

{% step %}
**Development Outcome**

The user achieves the desired long-term result (Level 4).
{% endstep %}
{% endstepper %}

Your objective as a product team is to convert users from each stage to the next, minimizing user drop-off along the funnel. For each stage, teams should clearly define:

* **What the program does** to bring users into that stage (i.e., program inputs).
* **What the user must do** to count as having entered the stage (i.e., user input).
* **The metric** that confirms entry into the stage (e.g., login rate, session length, quiz completion)
* **Target metric values** and **transition rates** between stages.
* **Costs** associated with moving a user through a given stage.
* **DRIs (Directly Responsible Individuals)** tasked with maintaining or improving the performance of each metric, at each stage of the funnel.

This structure turns a theory of change into a measurable, cost-aware product design tool. It lets teams track performance over time, identify user drop-offs or failure points, and test whether user behavior aligns with intended outcomes. It also gives funders and evaluators clear signals of where progress is occurring and where it is stalling.

{% hint style="warning" %}
**The Engagement Trap**

In the development sector, high engagement is necessary but not sufficient. A commercial app optimizes for "time on device" (ad revenue). A development app must optimize for "Time to Success."<br>

**Action**: Always pair engagement metrics with negative metrics (e.g., doom-scrolling, repeated confusion) to ensure you aren't optimizing for unwanted behavior.
{% endhint %}

{% hint style="info" %}
**Accounting for subgroups in a user funnel**

A user funnel is not only useful for overall program performance; it can also identify the types of users who are not progressing to the next stage. This creates an opportunity to design features or customizations to help these subgroups advance—or to decide, deliberately, that they are not the target users and update upstream targeting criteria accordingly. Careful user research should inform this determination.
{% endhint %}

At each level of evaluation, you will define and track an array of metrics – from generic benchmarks that are broadly adopted by an industry, to specific metrics used only to improve your product. There is a hierarchy of metrics. At the top are widely used industry or academic benchmarks for well-defined tasks. Examples include **Accuracy** for translation or speech recognition (Level 1), **daily active users (DAU)** for digital products (Level 2), user **information recall** (Level 3), and **household income or consumption** (Level 4). These metrics enable comparisons across foundation models, product classes, or development solutions. Be parsimonious in use of these.

At the bottom of the hierarchy are contextual metrics. These highly specific metrics are often most useful for product improvement, but rarely support cross-product comparison. For a mental health app, examples include percentage of emergency situations that were missed in LLM-driven message triage (Level 1), **time spent on activities improving mental health** rather than DAU (Level 2), **daily self-reported stress** (Level 3), or a validated **generalized anxiety disorder scale** for the target population (Level 4). In general, you will want to develop many of these metrics, monitor them frequently, and run A/B tests on them to improve over time.

At each stage of the user funnel, consider capturing a range of Level 1–4 metrics. This allows you to trace how changes propagate through the full user journey. For example, adjusting the AI system may reduce translation accuracy during user onboarding (Level 1), without affecting engagement in later stages of the product (Level 2); however, this may compromise user understanding and behavior, affecting Level 3 and 4 outcomes.

In addition to constructing a funnel for your users, you can construct funnels for your frontline workers, administrators, and other stakeholders contributing to the impact of your GenAI product. The funnel for your frontline worker might begin with recruiting and/or training; at the bottom of this funnel might be successful delivery of key program elements to users (assuming that these activities are required for users to advance from one stage of their funnel to the next). Maintaining a stack of funnels, with appropriate metrics for each, can help you monitor the quality of your intervention overall – expanding your focus beyond the GenAI product, to include associated activities in your theory of change. Note that the indicators for these non-user funnels are often captured process evaluations (PEs), which we discuss elsewhere in this playbook.

<br>

Applying the principle of **“Minimum Viable Evaluation”** (MVE) here means collecting only the data (through logs, surveys, analysis) needed to get started. Begin with the North Star metric, then define the smallest set of upstream metrics and targets required to observe it. Eliminate anything not essential. Here are examples of MVE metrics you might consider:

* Accuracy, response completeness, or latency[^1] (<i class="fa-gear-code">:gear-code:</i> Level 1)
* Number of daily active users, session duration, timestamps (<i class="fa-box-isometric">:box-isometric:</i> Level 2)
* User satisfaction or comprehension of content (<i class="fa-user">:user:</i> Level 3)

We also recommend defining data quality requirements and target values for each MVE metric. Critically, to track users across evaluation levels and funnel stages, you will need a simple set of identifiers that can be captured in log data and surveys, including:

* **User (Dimension):** Defined by a User ID, this represents the unique identifier for each individual and their persistent attributes.
* **Action (Dimension):** A collection of features or UI elements within your application (e.g., a "Login" button or "Prompt Correction" field). These represent the available touchpoints in the product journey.
* **Event (Fact):** A timestamped record of a specific user interaction with an action. Each event captures the *"who"* (User ID) and the *"what"* (Action) within a specific session, including the system configuration and the resulting outcome.

In some cases, you may need multiple nested funnels to capture the full user experience with an AI product. These funnels and their metrics can also be linked across all framework levels; the final [Linkages Across Levels](broken://pages/FVj1VPVbjI0Qs8yHvFUb) section discusses this in more detail.

## 3. Data pipelines: Build and tracking metrics

A well-designed evaluation framework is only as good as the data infrastructure that supports it. At the heart of that infrastructure is a robust data pipeline – a system that extracts, transforms, and loads data to power consistent, reliable measurement of user funnel metrics (also known as program indicators):

* **Extract**: Collect data from various sources – chat logs, product telemetry, survey tools, third-party APIs, or even spreadsheets.
* **Transform**: Clean, standardize, and reshape the raw data into a usable format. This could involve timestamp alignment, anonymization, session stitching, or deriving new funnel metrics like time-on-task or trust indicators.
* **Load**: Make the transformed data available from centralized storage (like a data warehouse or analytics dashboard) so teams can access it for analysis, visualization, or modeling.

AI products—especially GenAI—generate large volumes of complex, unstructured data. Without a clear data pipeline, turning this data into actionable metrics at scale is slow and unreliable. For example, a product supporting adolescent mental health might collect:

* **Model-level outputs** (<i class="fa-gear-code">:gear-code:</i> Level 1): response quality, hallucination rate, representative failure cases.
* **Engagement logs** (<i class="fa-box-isometric">:box-isometric:</i> Level 2): sessions per user, conversation length, feature use.
* **Behavioral indicators** (<i class="fa-user">:user:</i> Level 3): changes in sentiment or self-reported stress levels.
* **Outcome data** (<i class="fa-chart-column">:chart-column:</i> Level 4): improvement in standardized well-being scores over time.

To make sense of this, teams should build a data pipeline that integrates core datasets into a warehouse—AI system logs, product analytics, user surveys, and outcomes data—and translate them into consistent indicators across evaluation levels. Teams should also track data lineage in the warehouse so indicators are interpreted correctly.

## 4. Hypothesis targeting: Address weak links

Once a user funnel and robust data pipeline are in place, the next challenge is diagnosing why metrics underperform. Start by identifying major drop-offs: if users do not engage, they are unlikely to benefit. Then investigate what drives the drop-off, using **targeted hypotheses**.

Rather than relying on intuition, teams should ask specific, testable questions about drop-offs and mechanisms. This approach bridges product management, UX research, and behavioral science, keeping evaluation disciplined despite nonlinear development.

Evaluation should not dictate what teams build; it should clarify what needs to be understood and changed. For instance, if engagement drops after onboarding, evaluators can surface competing hypotheses—unclear value, interface overload, or mistrust of AI responses—each informing targeted metrics or experiments, often co-designed with product, UX, and behavioral science leads. In this way, evaluation is generative: not just judging performance, but helping teams ask better questions, faster.

## 5. Experimentation: Test with rigor and speed

Once hypotheses are set, experimentation tests them. For lightweight changes (e.g., prompts or onboarding), evaluation datasets and A/B tests (through tools like [Evidential](https://docs.evidential.dev/welcome/)) are often fastest and cheapest. For deeper behavioral or policy questions, teams may use staggered rollouts, holdouts, or—when justified—full RCTs informed by L1–3 data. The aim is consistent: produce credible causal evidence on what improves user outcomes, turning evaluation into a decision tool.

Throughout experimentation, maintain version control by logging every change to the AI system, product features, wrap-around services, and delivery manuals. This often-overlooked practice is foundational: it helps align stakeholders when updates are needed and enables accurate interpretation of shifts in evaluation data at every level.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=overview%2Fbuilding-blocks-for-genai-evaluation%2Fbuilding-the-infrastructure>" %}

</details>

[^1]: **Accuracy** refers to the proportion of system outputs that are correct according to a task-specific ground truth or expert-validated rubric.

    **Response completeness** refers to the extent to which a system’s reply covers all required informational components of the user’s query (as specified by a task-level checklist or rubric).

    **Response latency** refers to the time elapsed between a user message and the system’s response.<br>

    For Level-1 (L1) “minimum viable evaluation” (MVE) metrics, acceptable thresholds should be explicitly specified to support go/no-go deployment decisions (e.g., ≥70% accuracy for emergency-intent detection in a health chatbot may be considered minimally launchable, whereas performance below this threshold would not meet MVE criteria).


# Frequently Asked Questions

### Who should use this guide?

This guide is meant for people working in software, product, data science, behavioral science, impact evaluation and specific development sectors (e.g. health and education). Roles often map to different levels—engineers on model behavior (Level 1), product managers and data scientists on analytics (Level 2), social scientists and user researchers on user experience and behavior (Level 3), and impact evaluators on social impact (Level 4)—but silos can block progress. Effective GenAI for development requires cross-domain collaboration beyond any single level.

### Does this playbook dictate a specific practice or method for evaluating generative AI applications?

This playbook offers multiple approaches for each evaluation level, but remains prescriptive. Where options exist, we highlight their pros and cons. Where minimum standards or proven methods exist, we identify them.

### Does this framework imply a linear process from L1 to L4?

The levels imply order, but evaluation is cyclical. Teams may start with model benchmarks (Level 1), move to usability and engagement in deployment (Level 2), and return to Level 1 if usage drops. If engagement holds, evaluation can progress to user thoughts and behaviors (Level 3) and, eventually, long-term development outcomes (Level 4).

### Is this playbook just focused on GenAI evaluations?

Yes. We recognize that predictive AI, agentic AI, and other AI technologies are often associated with and part of AI products. This playbook does not currently focus on them though future iterations may do so.

### How rigorous should organizations be when resources are limited?

Do not try to measure everything. We advocate for Minimum Viable Evaluations (MVE). You should start small as you pursue each level of evaluation and avoid building complex automated AI evaluation pipelines on Day 1. We highlight what constitutes MVEs in each section.

### Does this playbook help me identify which level of evaluation my organization should pursue for our AI application?

The playbook defines each evaluation level, but there is no single path. Teams may start at Level 1 and move upward, often looping back as products evolve. AI iteration can be fast once a process is in place, while user surveys in Levels 3–4 can be slower. The key challenge is choosing the right level and intensity at each stage—not to “reach” a level, but to decide what is enough evaluation given the context, risks, and questions.

### Are the evaluations described in this playbook all I need to develop a socially impactful product?

This playbook focuses on doing GenAI evaluation well—what to measure, how to measure it, and how to generate evidence with rigor and speed. It is not a full product development guide. While Levels 1–3 inform product decisions, they are necessary but not sufficient. Effective products also rely on process evaluation, UX design, and content strategy, which identify user pain points and shape how products function and feel, often running in parallel to or ahead of evaluation.

### What makes evaluation for development outcomes different from evaluation for commercial use?

Commercial apps often optimize for retention; development apps must optimize for welfare. This means looking beyond engagement to real-world behavior change (Level 3) and life outcomes (Level 4). A long chat may signal success in commercial settings, but in maternal health, a brief exchange that prompts a clinic visit is the real win.

### Do I need to re-evaluate every time the product launches in a new market?

Commercial products evolve and are localized as they enter new markets, supported by well-defined workflows. Evaluations must also adapt to context. When deploying an existing product in a new setting, factors like health system variation, disease epidemiology, or student–teacher ratios will shape the right metrics and benchmarks. For example, in one country, a cough-based AI TB detector may consider chest radiographs as the diagnostic “gold standard,” whereas for a different country, serological biomarkers might be used to ascertain the ground truth. Evaluation is therefore part of product development at every stage—from early prototyping to multi-country launch.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Ffrequently-asked-questions>" %}

</details>


# Tools & Templates

## Level 1

### LLM evaluations

* [LLM Evals: Everything You Need to Know](https://hamel.dev/blog/posts/evals-faq/)
* [Multi-Turn Chat Evals](https://hamel.dev/notes/llm/officehours/evalmultiturn.html)
* [How do I evaluate agentic workflows?](https://hamel.dev/blog/posts/evals-faq/#q-how-do-i-evaluate-agentic-workflows)
* [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
* [Hierarchical AI Evaluation](https://gamma.app/docs/AI-QA-Hierarchical-Evaluation-Architecture-9t79y026n43d7op?mode=doc) by Gamma

***

### LLM evaluation in the social sector

* [Generative AI for Health in Low & Middle Income Countries](https://cdh.stanford.edu/research-portfolio/generative-ai-health-low-middle-income-countries)
* [Evaluation framework of PROMPTS at Jacaranda Health](https://www.google.com/url?q=https://cdh.stanford.edu/generative-ai-health-low-middle-income-countries\&sa=D\&source=editors\&ust=1770879887027623\&usg=AOvVaw2tnpWpMI0955H3SybGibhB) (pg 33)
* [Evaluation framework at Precision Development](https://precisiondev.org/evaluating-ai-for-learning-a-framework/) ([slide](https://www.google.com/url?q=https://docs.google.com/presentation/d/1agCgpDWNVWtbOFhdlDYUpLM3OxyHP5CxyzON_tn61x0/edit?slide%3Did.p%23slide%3Did.p\&sa=D\&source=editors\&ust=1770879887028358\&usg=AOvVaw37SXt8aprD7bCVCdrsfQAW))
* [Evaluation of Farmer.Chat at Digital Green](https://arxiv.org/abs/2409.08916)
* [Evaluation of mMitra at Armman](https://docs.google.com/presentation/d/1mAF1lI8tkTjLLW3SjwrV8mdz4VDkTdog/edit?slide=id.p1#slide=id.p1)

## Level 2

The tech industry has published numerous guidebooks and tools to help you define, collect, and analyze user funnel metrics. For details on how to construct common metrics, consider reviewing [The Agency Fund’s User Funnel Playbook](https://theagencyfund.substack.com/p/user-funnel-playbook-for-the-social).

In addition, you can leverage these reference materials:

* [The Amplitude Guide to Product Metrics](https://info.amplitude.com/rs/138-CDN-550/images/The%20Amplitude%20Guide%20to%20Product%20Metrics.pdf)
* [User Analytics for ChatGPT Enterprise and Edu](https://www.google.com/url?q=https://help.openai.com/en/articles/10875114-user-analytics-for-chatgpt-enterprise-and-edu-public-beta\&sa=D\&source=editors\&ust=1770879887075105\&usg=AOvVaw005sJUqXHiBchsro_K4jTY)
* [What We Know About Using Non-Engagement Signals in Content Ranking](https://arxiv.org/abs/2402.06831#:~:text=What%20We%20Know%20About%20Using%20Non%2DEngagement%20Signals%20in%20Content%20Ranking,-Tom%20Cunningham%2C%20Sana\&text=Many%20online%20platforms%20predominantly%20rank,for%20society%20as%20a%20whole.)

For more details on A/B testing, please review these resources:

* <https://www.youth-impact.org/insights/a-b-testing-toolkit>
* [Optimizely: What is A/B testing?](https://www.optimizely.com/optimization-glossary/ab-testing/)
* [Amplitude: What is A/B testing? How it works and when to use it](https://amplitude.com/blog/ab-testing)

## Level 3

Case Study: [ChatSEL](https://agency-fund.github.io/chatsel-docs/docs/t1-intro) is a GenAI coach developed at the Agency Fund that provides teachers with evidence-based and context-sensitive guidance on understanding and implementing SEL programs in a low-resource classroom. Please see the following document for how we might measure Level 3 outcomes in the context of ChatSEL.

[User Evaluation Workshop - ChatSEL](https://docs.google.com/document/d/18AXtIeDx6HsidhMKTJ2kIDb7hUwHEkEnwGPZuC9JJo0/edit?tab=t.0)

## Process Evaluations

* IDinsight. “Process Evaluation.” IDinsight Impact Measurement Guide,[ https://guide.idinsight.org/process-evaluation/](https://guide.idinsight.org/process-evaluation/)
* World Health Organization. Monitoring and Evaluating Digital Health Interventions: A Practical Guide to Conducting Research and Assessment. World Health Organization, 2016. <https://saluddigital.com/wp-content/uploads/2019/06/WHO.-Monitoring-and-Evaluating-Digital-Health-Interventions.pdf>
* Implementation Monitoring and Process Evaluation (Practical Guidebook) Bliss, M. J., & Emshoff, J. G. (2018). Implementation Monitoring and Process Evaluation. SAGE Publications.​

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fadditional-resources>" %}

</details>


# Minimum Viable Evaluations

| Level 1 - Model evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox"><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/pMp8WTGTVjysMbVi5Lmw">2-3 rubrics for model success</a> with at least one robust safety/guardrail metric computed on your Golden Dataset.</li><li><input type="checkbox">In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment</li><li><input type="checkbox">Develop a <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/nlovOPA1IMARfE9MwQau">Golden Dataset</a> with at least 30-50 items representing key, diverse user interactions</li><li><input type="checkbox">Establish a process for <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/dmxTilPPMauBPlkqtQJw">expert review of AI system </a>responses for inputs in the Golden Dataset, as you iterate on your system configuration</li></ul> |

| Level 2 - Product evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Instrument the product to capture events automatically</li><li><input type="checkbox">Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)</li><li><input type="checkbox">Look for patterns in the data, and talk to users to identify opportunities for improvement</li><li><input type="checkbox">Test these ideas for improvement against these metrics with an A/B test</li></ul> |

| Level 3 - User evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).</li><li><input type="checkbox">Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.</li><li><input type="checkbox">Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.</li><li><input type="checkbox">Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)</li></ul> |

| Level 4 - Impact evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Conduct an <a data-footnote-ref href="#user-content-fn-1">impact evaluation</a> with <a data-footnote-ref href="#user-content-fn-2">counterfactual</a> and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)</li><li><input type="checkbox">Implement strong version control with either a frozen version or a limited number of product versions to be tested</li><li><input type="checkbox">Cost data collection</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fminimum-viable-evaluations>" %}

</details>

[^1]: An MVE Impact evaluation can also be done inexpensively. There are a number of resources on how to reduce costs and effort and still do a rigorous impact evaluation.

[^2]: Choose the counterfactual judiciously. Focus on the policy relevant choice. While it might be interesting to see how an intervention delivered by humans compares to an AI delivery, if human delivery would be too expensive to be feasible, focus on a counterfactual where the intervention is not delivered.


# Glossary

## Core Concepts

**Continuous Evaluation:** An ongoing process where deployment, adaptation, evaluation, and improvement happen in rapid cycles. It ensures that GenAI applications remain safe, effective, and aligned with intended goals even as models or data evolve.

**Repeatable Motions:** Concrete, actionable processes that embed evaluation into ongoing development cycles, such as user funnel mapping, ETL pipelines, targeted hypothesis testing, and experimentation.

**Cross-Functional Team:** A multidisciplinary group (AI engineers, data scientists, product managers, behavioral scientists, economists) that collaborates across all four evaluation levels to connect model performance with social outcomes.

**User Funnel:** A structured map showing how users move through a product journey, from recruitment and onboarding to engagement, retention, and long-term outcomes, defining metrics, thresholds, and responsible leads for each stage.

**ETL Pipeline (Extract, Transform, Load):** A data pipeline that extracts data from multiple sources, cleans and standardizes it, and loads it into centralized storage for consistent and reliable measurement.

**Observability and Tracing:** Systematic logging of inputs, outputs, and metadata (like model parameters and costs) for every AI interaction, enabling transparency, debugging, and continuous improvement.

**Experimentation:** A disciplined process of testing hypotheses through controlled methods such as A/B tests, holdout studies, or randomized trials to generate causal evidence about what works.

**Randomized Controlled Trial (RCT):** A study design that randomly assigns participants to a treatment group (receiving the AI product) or a control group (without it) to establish causal impact.

**Holdout Testing:** An evaluation method that withholds a subset of users from receiving a feature or product update to serve as a comparison group.

**A/B Test:** A method of comparing two versions of a product or feature (A and B) to see which one performs better, typically by randomly assigning users to one version or the other.

### Level 1 – Model Evaluation

**Rubric:** A list of qualitative characteristics and their domain-specific definitions (e.g., accuracy, empathy, clarity) that define what “good” performance looks like for a specific AI product or domain.

**Metric:** A quantitative measure used to evaluate how well the model meets each rubric criterion (for example, accuracy of classifying a user’s message as relevant or irrelevant).

**Scorer:** A tool or method that produces a numeric score for a given metric. Scorers can be statistical (like precision or recall), or model-based (like using a stronger LLM to judge the quality of responses from a weaker LLM ), or human evaluators.

**Golden Dataset:** A curated collection of representative inputs used to evaluate a model’s performance across iterations, paired with either ideal reference outputs or structured evaluation rubrics that define the characteristics of high-quality responses (e.g., as in [HealthBench](https://www.google.com/url?q=https://openai.com/index/healthbench/\&sa=D\&source=editors\&ust=1770879887200975\&usg=AOvVaw0u-TKPSSZu5Bu--gVy5Hsw)), and including real, edge-case, out-of-scope, and adversarial examples.

**Red-Teaming:** A structured, adversarial testing process in which evaluators intentionally try to break or exploit the system to uncover vulnerabilities, biases, or unsafe behaviors before deployment.

**Human Evaluation:** Manual assessment by domain experts or human raters, used when subtlety or context sensitivity is needed, or when automated scorers may miss key nuances.

**Statistical Scorers:** Automated metrics that quantitatively compare model outputs to reference targets using predefined mathematical criteria (e.g., word overlap for text, precision/recall for classification, or error functions like MSE for regression). They are fast and scalable but limited in capturing higher-order semantic or contextual quality.

**Model-Based Scorers:** 1) Smaller, trained models that assess semantic similarity or text quality (e.g., BLEURT, COMET, BARTScore), or 2) LLM-as-Judge: an evaluation method that uses a Large Language Model (LLM) to assess and score the outputs of another model, typically calibrated against human judgments.

**WER (Word Error Rate):** A measure of speech recognition accuracy, calculated by counting substitutions, insertions, and deletions between predicted and reference transcripts.

**CER (Character Error Rate):** Similar to WER but computed at the character level, used for fine-grained evaluation of transcription models.

**MER (Match Error Rate):** A variation of error metrics that focuses on matching semantic meaning rather than exact words or characters.

**Context Precision:** The extent to which the retrieved context consists of information that is actually relevant to answering the user’s query, minimizing irrelevant or noisy content.

**Context Recall:** The extent to which the retrieved context includes all the necessary information required to answer the user’s query, minimizing missing or omitted relevant content.

**Answer Relevancy:** The extent to which an AI-generated response directly addresses the user’s question or task, staying on-topic and aligned with the user’s intent.

**Faithfulness:** The extent to which an AI-generated response accurately reflects information from retrieved or reference sources, avoiding hallucinations.

**Observability Tools:** Software such as Helicone, Langfuse, or Traceloop that logs the inputs, outputs and other metadata (like, cost, latency, and version history) of your AI system for debugging and evaluation.

**Benchmarking:** Comparing model performance against standardized public metrics or datasets to understand relative performance, though often insufficient for context-specific evaluation.

### Level 2 – Product Evaluation

**Engagement Metrics:** Measures of user participation and interaction with the product, such as session length, number of turns, or frequency of logins.

**Non-Engagement Metrics:** Quality and experience metrics that go beyond raw usage data, including user satisfaction ratings, toxicity scores, or perceived helpfulness.

**Retention Metrics:** Indicators of how many users continue to actively use the product over time, such as Daily Active Users (DAU) or Monthly Active Users (MAU).

**Action-Based Engagement:** Measures of how users respond to AI-generated outputs (e.g., clicks, prompt rewrites, or emoji reactions) that show behavioral engagement.

**Feature Uptake:** The rate at which users adopt optional product features, indicating trust in and perceived usefulness of the AI’s suggestions or capabilities.

**A/B Test:** A method of comparing two versions of a product or feature (A and B) to see which one performs better, typically by randomly assigning users to one version or the other.

**Multi-Armed Bandit:** An adaptive experimental approach that allocates more users to the better-performing variant as evidence accumulates.

**Holdout Testing:** An evaluation method that withholds a subset of users from receiving a feature or product update to serve as a comparison group.

**Quality Scores:** Automated or human-assigned ratings that evaluate AI outputs for attributes like clarity, correctness, empathy, or tone.

**User-Level Surveys:** Post-interaction questionnaires that measure satisfaction, usability, and perceived value of the product experience.

**Experimentation Platform:** Tools such as Evidential that automate randomization, monitor outcomes, and streamline experimental workflows.

### Level 3 – User Evaluation

**Cognitive Outcomes:** Changes in users’ knowledge, reasoning, comprehension, or decision-making abilities as a result of using the AI tool.

**Affective Outcomes:** Changes in users’ emotions, motivation, or sense of trust, belonging, and empathy while interacting with the AI.

**Behavioral Outcomes:** Observable actions such as following AI recommendations, asking more questions, or applying learned information in real-world contexts.

**On-Platform Behavioral Measures:** Telemetry or interaction data collected within the app, such as the number of sessions, conversation depth, or rate of follow-up questions.

**Self-Report Surveys:** Short questionnaires embedded in the product experience that directly ask users about their feelings, confidence, or learning outcomes.

**Psychometrically Sound Measures:** Survey instruments or scales that have been validated for reliability and construct validity, ensuring they accurately measure psychological constructs.

**Sentiment Analysis:** Automated scoring of the emotional tone in users’ messages, detecting trends such as increased positivity or reduced anxiety over time.

**Topic Modeling:** An NLP method that clusters user text into recurring themes or topics to track what users are discussing and how it evolves.

**Linguistic Inquiry and Word Count (LIWC):** A dictionary-based text analysis tool that categorizes words into psychological and linguistic domains, revealing shifts in emotion, thinking, or social connection.

**LLM-Based Text Analysis:** Using large language models to infer psychological or behavioral constructs (e.g., confidence, agency) from user text in a scalable and nuanced way.

**Off-Platform Assessments**: Methods such as structured interviews, standardized tests, or observer reports (from teachers, caregivers, etc.) that evaluate real-world behavioral or attitudinal changes.

**Proximal Outcomes:** Near-term indicators (cognitive or behavioral) that signal whether a product is on track to achieve longer-term development outcomes.

### Level 4 – Impact Evaluation

**Impact Evaluation:** The rigorous assessment of whether an intervention leads to measurable improvements in long-term outcomes such as health, learning, or income.

**Randomized Controlled Trial (RCT):** A study design that randomly assigns participants to a treatment group (receiving the AI product) or a control group (without it) to establish causal impact.

**Counterfactual:** The hypothetical scenario representing what would have happened without the intervention; used to isolate true program effects.

**Treatment Group:** The participants or units who receive access to the AI product or intervention being evaluated.

**Control Group:** Participants or units deliberately withheld from receiving the AI product to serve as a comparison.

**Evaluability:** The degree to which a product or program is ready for rigorous impact evaluation, including whether randomization or measurement structures can be feasibly implemented.

**Product Dynamism:** The tendency of AI products to evolve during evaluation (through retraining or updates), which must be managed through version tagging and careful analysis.

**Spillovers and Contamination**: When participants in the control group are unintentionally exposed to the AI product, potentially biasing results, mitigated through clustered or encouragement designs.

**Cost-Effectiveness:** A measure comparing the magnitude of impact achieved relative to the cost of implementation, often used by funders to guide scaling decisions.

**External Validity:** The degree to which the results of an RCT can be generalized to other settings, populations, or time periods.

**Attrition:** Loss of participants during an evaluation, which can threaten the validity and interpretability of results if not managed and reported transparently.

**Theory of Change:** A logical model describing how an intervention is expected to lead to its intended outcomes, outlining causal pathways and assumptions.

### Tools

**Prompt Engineering:** The process of designing and refining prompts to elicit desired model behavior or output quality.

**Promptfoo:** An open-source tool for testing, red-teaming, and securing prompt pipelines.

**DeepEval:** An open-source evaluation framework for automated model testing and guardrails.

**RAG (Retrieval-Augmented Generation):** An AI architecture that retrieves relevant knowledge before generating responses, improving factual accuracy and grounding.

**RAGAS Metrics:** A set of evaluation metrics for RAG systems, including Answer Relevancy, Faithfulness, Contextual Recall, Precision, and Relevancy.

**Helicone / Langfuse / Traceloop:** Observability and telemetry platforms that capture prompts, model calls, costs, and latency for continuous evaluation.

**OpenTelemetry:** An open-source standard for collecting and exporting metrics, logs, and traces across systems.

**Evidential:** A lightweight experimentation tool that automates randomization, tracking, and analysis for A/B or holdout tests.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fglossary>" %}

</details>


# Using the Playbook with AI Tools

You don't need to read through the entire playbook every time you want to apply the 4-level framework. This page shows you five ways to bring the playbook directly into the AI tools you're already using so that the playbook can seamlessly integrate with your existing workflows.

Each option comes with a tradeoff in terms of ease of setting up and the features that are available to you. Choose the option that best suits your needs.

## Skills

**Best for: Answering questions on evaluation and preparing artifacts like slides/docs/reports.**

A "skill" that extends Claude's capabilities by giving it access to specialized knowledge and workflows. For example, a talented presenter can create a "skill" explaining the techniques they use to make engaging presentations. This skill can then be used to help others learn from their expertise and mimic their presentation style.

A skill file is a simple text file that contains the specialized knowledge and workflows. It can be uploaded to your AI tool of choice and used to answer your questions following the instructions in the skill file.

To know more about `Skills`, please refer to [this](https://support.claude.com/en/articles/12512176-what-are-skills) blog post by Anthropic and [this](https://youtu.be/a3uMv1S-1tM) step-by-step tutorial on using Claude skills.

We have created a skill file for the AI Evaluation Playbook that answers your questions on evaluation using the 4-level framework. You can download it from [here](https://github.com/IDinsight/ai-eval-playbook/blob/introduction/skills/ai-eval-playbook-guide.skill.md).

### Adding the playbook skill to Claude

Open [Claude](https://claude.ai).

Click on `Customize`

![](/files/Posa5KwNsCmMSr0NCBWH)

Open `Skills`

![](/files/3u954azIwGSdlQgSXgQw)

Click on `Add skill`

![](/files/ZT6hWYjQVGXANBdzWtR3)

Select `Create skill` -> `Upload a skill`

![](/files/XqTMf1LPIb0qZhVJKnNs)

Upload the skill file you downloaded above

![](/files/skbj0edAH3jlcwPzhtVE)

Once uploaded, you will see the `ai-eval-playbook-guide` skill in the list of skills.

![](/files/AYOvZ1PwB3E0uZg7TMyK)

### Using the playbook skill

To use the skill, you can start a new conversation with Claude and ask your evaluation-related questions. You can specifically ask Claude to use the playbook skill or refer to the 4-level evaluation framework to answer the question. Claude will read the skill file and answer the question based on the relevant sections in the playbook.

For example, ask the following question:

```
We built a Theory of Change 12 months ago for an AI literacy tutor in rural India. Now we have:

Level 1 accuracy data (87% on golden dataset), Level 2 data (35% week-4 retention, most drop-off at onboarding), and Level 3 data (self-efficacy scores improving but knowledge test scores flat).

Using the framework linkages guidance in the AI Evaluation Playbook, stress-test our Theory of Change against this evidence.


Are we ready for a Level 4 RCT? Output this as a structured memo I can share with our funder.
```

Claude will start reading the skill file:

![](/files/ZCvhQYoMDpRaWlVoRKGC)

The final response for this question is generated using the 4-level framework:

![](/files/9sNyPFlxmZFWAx9A6WTp)

## NotebookLM

**Best for: Learning the framework, exploring ideas, and getting answers grounded only in the playbook.**

[NotebookLM](https://notebooklm.google) is a free AI-research tool by Google that lets you upload documents as "sources" and then ask questions about them to get answers that cite specific sections from the source documents.

When you add the playbook as a source, every answer it gives you is drawn directly from the playbook — nothing made up, nothing from outside. It also shows the specific sections from the playbook that were used to generate the answer, a feature unique to NotebookLM.

This makes it a great option if you're new to the framework and want to explore it, or if you want to be confident that responses are grounded in the actual content.

### Connecting the playbook to NotebookLM

Open [NotebookLM](https://notebooklm.google.com/) and create a new notebook.

Add `https://eval.playbook.org.ai` as a source for the notebook.

![](/files/x5wL6bhszhdYNtXZ6KOM)

Wait for 1 minute for NotebookLM to process the playbook content. Once it is ready, you will see a summary of the playbook shown as description of the notebook.

![](/files/UTVMHYe5hCmvQLfzj3g8)

### Using NotebookLM

Start by asking the same question as the previous section:

```
We built a Theory of Change 12 months ago for an AI literacy tutor in rural India. Now we have:

Level 1 accuracy data (87% on golden dataset), Level 2 data (35% week-4 retention, most drop-off at onboarding), and Level 3 data (self-efficacy scores improving but knowledge test scores flat).

Using the framework linkages guidance in the AI Evaluation Playbook, stress-test our Theory of Change against this evidence.

Are we ready for a Level 4 RCT?
```

You can see the response is grounded in the playbook and it cites the specific sections from the playbook that were used to generate the answer.

![](/files/mxvX5tD4RFEV1KyJfgaH)

**Things to keep in mind**

* NotebookLM keeps responses strictly within what you've uploaded — it won't draw on outside knowledge. This is great for accuracy, but means it won't combine the framework with other context you haven't added.
* You can add more links and documents as sources beyond the playbook to NotebookLM so that the response takes all the sources into account.

## Gemini Gems

**Best for: Running the same type of task repeatedly, especially if your team already uses Google Workspace (Docs, Sheets, Drive).**

[Gemini Gems](https://gemini.google/overview/gems/) let you create a customized version of Gemini that performs a concrete task with specific instructions and a clear goal repeatedly.

Think of it as a dedicated assistant pre-configured to work with the 4-level framework. You can also connect it to Google docs/sheets/slides, making it useful when you want to apply the framework alongside your own organisation's data and documents.

### Using the playbook with Gemini Gems

Open the `Gems` tab in [Gemini](https://gemini.google.com/) and create a new Gem.

![](/files/NJExeklEeeOVuuVGcIGy)

Add a name and description for the Gem along with instructions that explain the task the Gem should perform.

**For example:**

`Name`: Policy Impact Auditor

`Description`: To take raw data or project descriptions and categorize them into the 4-level framework to identify where a policy intervention is succeeding or leaking value

`Instructions`:

```
You are an expert Policy Analyst and Socio-Technical Researcher. Your task is to evaluate AI-driven policy interventions using the 4-Level Evaluation Framework:

Level 1 (Model): Technical performance, accuracy, and bias.

Level 2 (Product): UI/UX, accessibility, and adoption metrics.

Level 3 (User): Behavioral changes, trust, and mental models of the target population.

Level 4 (Impact): Long-term systemic outcomes (economic, health, or social equity).

Your Workflow:

When I provide a project summary, break it down into these four levels.

Identify 'Critical Gaps' (e.g., if a model is 99% accurate but the product is too complex for the target user to navigate).

Suggest 'Policy Levers' for each level to improve the final Impact (Level 4).

Maintain a professional, skeptical, and data-driven tone. Use tables for comparisons.
```

![](/files/lHSGBjWLnCVXRYAa51gF)

Next, add sources to the Gem. Here, you can add the NotebookLM notebook created above as a source, along with other files from your Google Drive.

![](/files/TjVRHpBoFF4bAGm9yDDV)

Click `Save`. Your Gem is now ready to use.

### Using the Gem

Try the following prompt:

```
We have deployed an LLM-based SMS chatbot designed to provide real-time agricultural advice to 50,000 farmers to increase national crop yields.

Data for Evaluation:

Technical Performance: The model has a 94% accuracy rate on technical soil science questions in English benchmarks. However, it occasionally hallucinates local seed brand names.

Access & Usage: We have 15,000 active monthly users. Data shows high engagement in urban-adjacent areas, but 0% engagement in the northern "dry-belt" region where 2G connectivity is unstable.

User Feedback: In qualitative interviews, farmers expressed high trust in the AI's "tone," but 40% reported they were confused by the technical jargon used in the advice. 10% of users followed advice that led to minor crop loss due to misinterpreted application rates.

Development Goal: The national policy goal is a 15% increase in maize yield per hectare to ensure food security.
```

Here is a preview of the response:

![](/files/eTiFl5OpS0MG5X2wEK2a)

You can see the full response [here](https://gemini.google.com/share/eb1f3e03b45f).

**Things to keep in mind**

* Gems work best for tasks you run regularly — like reviewing an evaluation plan against the framework, or checking whether a set of metrics maps to the right level.
* You can combine the knowledge with tools like deep research, creating images and videos, etc. as shown below.

![](/files/bFevwh8M0pXC4hkPaW8f)

## Paste the playbook content directly

**Best for: Using the playbook with any AI tool — Claude, ChatGPT, Gemini. No setup needed.**

The options mentioned above were specific to the AI tools that support them. Most of them need you to do some setup to use the playbook.

But if you want to get a flavour of how the playbook can instantly help you with the least setup possible, you can paste the playbook content directly into your AI tool of choice.

Every AI tool has a text box you can type or paste into.

Open the full playbook text file by visiting [this URL](https://eval.playbook.org.ai/llms-full.txt) in your browser:

You will see the full playbook content in your browser. Select all the text and copy it.

Open your AI tool of choice and paste the playbook content into the text box. Use any of the example prompts mentioned in the previous options to test it out.

For copying the contents of a single page rather than the whole playbook, you can append `.md` to any page URL (for example: `https://eval.playbook.org.ai/level-3-user-evaluation/overview/why-is-this-level-of-evaluation-important.md`).

**Things to keep in mind**

* This works in any AI tool and no other setup is required. The entire playbook text is around 45,000 words. Most modern AI tools can handle this, but very long pastes may slow down responses or exhaust the token limit of your plan.
* You need to paste it fresh every new conversation. Once the value of the playbook is clear to you, switch to one of the options above so that you don't have to keep pasting the playbook content every time.

***

## MCP server

**Not recommended for most users. Using the skills file is a better option.**

MCP (Model Context Protocol) is a way to connect Claude to external resources so it can look things up during a conversation without you needing to paste anything.

You need to add the playbook as a connector, and then Claude will pull in the relevant sections automatically whenever you ask evaluation-related questions.

You can practically achieve the same result using the skills file but with a more complicated setup.

Follow the steps [here](https://support.claude.com/en/articles/11175166-get-started-with-custom-connectors-using-remote-mcp) on how to add a connector. Use `https://eval.playbook.org.ai/~gitbook/mcp` as the MCP server URL.

Once connected, use any of the example prompts mentioned in the previous options to test it out.

## Which option is right for you?

If you're not sure where to start, **NotebookLM** is the easiest way to explore the playbook interactively. Once you're comfortable, **Claude Skills** is recommended for regular Claude users and **Gemini Gems** are recommended for Google Workspace users who want to run the same task repeatedly. **Paste the playbook content directly** is a good option for any AI tool and no other setup is required.

## Example use cases by role

To help you understand how the playbook can be used in practice, we have provided some example use cases for different roles in a team where the playbook can help you in your work. These are not exhaustive, but should give you an idea of how the playbook can be used in practice.

### Impact Evaluator

*Designs RCTs and quasi-experimental studies, manages counterfactual selection, and connects Level 1–3 evidence to long-term outcomes. Leads Level 4.*

**Example 1 — Drafting an RCT pre-analysis plan**

> You're pre-registering a Level 4 RCT for an AI agricultural advisory tool. You need a pre-analysis plan that handles the unique challenges of evaluating a product that will change during the trial.

**Try this prompt:**

> I'm pre-registering a cluster-randomised RCT to evaluate an AI agricultural advisory tool for 800 maize farmers across 40 villages in Ethiopia. Primary outcome: crop yield at harvest. The product will likely update 2–3 times during the 8-month trial.
>
> Using the Level 4 guidance in the AI Evaluation Playbook, draft the key sections of a pre-analysis plan. Include: counterfactual justification, how product versions will be tagged and handled analytically, spillover mitigation strategy (the tool is on WhatsApp and can be shared), power calculation assumptions, primary and secondary outcomes, and pre-specified subgroup analyses by gender and land size. Flag the top 3 AI-specific pitfalls to address.

**What you'll get:** A structured pre-analysis plan with AI-specific versioning and spillover sections — ready for pre-registration.

***

**Example 2 — Stress-testing a Theory of Change**

> Your Theory of Change was written 12 months ago. Level 1–3 data is now available. You need to check whether the causal chain still holds before committing to a Level 4 study.

**Try this prompt:**

> We built a Theory of Change 12 months ago for an AI literacy tutor in rural India. Now we have: Level 1 accuracy data (87% on golden dataset), Level 2 data (35% week-4 retention, most drop-off at onboarding), and Level 3 data (self-efficacy scores improving but knowledge test scores flat).
>
> Using the framework linkages guidance in the AI Evaluation Playbook, stress-test our Theory of Change against this evidence. Identify which causal links are supported, which are broken or uncertain, what the flat knowledge scores imply about our proximal outcome assumptions, whether we are ready for a Level 4 RCT or should iterate further, and what process evaluation questions to answer first. Output this as a structured memo I can share with our funder.

**What you'll get:** A structured memo identifying which causal links hold and which don't — with a clear recommendation on whether to proceed to Level 4 or iterate first.

### Domain Expert

*Validates rubrics, golden datasets, metric definitions, and Theory of Change assumptions across health, education, or agriculture domains. Supports all levels.*

**Example 1 — Critiquing a rubric from a clinical perspective**

> The engineering team has drafted a Level 1 rubric for a clinical decision support tool. As a nurse supervisor, you need to validate it before the golden dataset sprint.

**Try this prompt:**

> I'm a nurse supervisor reviewing a Level 1 evaluation rubric drafted by engineers for an AI clinical decision support tool used by community health workers in Uganda. The rubric has 5 dimensions: medical accuracy, response completeness, safety, tone, and latency.
>
> Help me critique this rubric from a clinical domain expert perspective, following the AI Evaluation Playbook's guidance on rubric validation. For each dimension: flag what the engineers likely missed from a clinical workflow standpoint, suggest a concrete real-world failure case that the current definition would miss, and propose a sharper domain-specific definition. Then suggest one additional dimension the engineers have overlooked entirely.

**What you'll get:** A detailed critique with dimension-by-dimension gaps, real failure cases, sharper definitions, and a missing dimension — ready to return to the engineering team.

***

**Example 2 — Annotating a Theory of Change**

> You're reviewing a Theory of Change for an AI advisory tool for smallholder farmers in Northern Ghana. The causal chain looks clean on paper — your job is to find where it breaks in the field.

**Try this prompt:**

> I'm a domain expert reviewing a Theory of Change for an AI advisory tool for smallholder farmers in Northern Ghana. The ToC assumes: farmers receive AI crop advice → act on advice within 48 hours → improve crop management → increase yields.
>
> Using the Theory of Change guidance from the AI Evaluation Playbook, help me identify the weakest assumptions from a field implementation perspective. For each weak link: explain the real-world constraint that breaks the assumption (e.g. input availability, weather, land tenure), suggest a Level 2 or Level 3 metric that would detect when this link is failing, and recommend a process evaluation method to investigate it. Format this as annotated ToC review notes I can return to the research team.

**What you'll get:** Annotated ToC notes with field-grounded constraints, early-warning metrics, and process evaluation methods — ready to send back to the research team.

### Policy Analyst

*Works in government, multilaterals, or think tanks. Interprets evaluation findings, assesses whether a tool is ready to scale, and translates technical evidence into recommendations for decision-makers.*

**Example 1 — Writing a policy brief from evaluation data**

> Your ministry is deciding whether to integrate an AI agricultural advisory tool into the national extension service for 2 million smallholder farmers. You have technical evaluation reports and need a 2-page brief for the Secretary.

**Try this prompt:**

> I'm a policy analyst at a ministry of agriculture. We're evaluating whether to integrate an AI advisory chatbot into the national extension service. I have the following evaluation summary: Level 1 accuracy 89% overall but 74% word error rate in Amharic; Level 2 week-4 retention 52%, with heavy urban/rural split; Level 3 self-efficacy scores up 0.4 SD after 8 weeks, knowledge test scores flat, 12% of users show AI dependency signals.
>
> Using the AI Evaluation Playbook's 4-level framework, help me interpret this evidence for a non-technical Secretary-level audience. Structure your response as: (1) a plain-language verdict on each level — what it means in practice, not what the number is; (2) the 2 biggest risks of scaling now versus waiting; (3) the 3 conditions the implementer must meet before national rollout; and (4) a one-paragraph executive summary I can put at the top of the brief.

**What you'll get:** A structured brief with plain-language verdicts, risk analysis, scale conditions, and a one-paragraph executive summary — ready to hand to the Secretary.

***

**Example 2 — Comparing two competing interventions**

> Two AI tools are competing for the same budget. You need to compare them not by their marketing claims, but by the strength of their evidence chains.

**Try this prompt:**

> I need to compare two AI interventions competing for the same funding:
>
> Option A — AI maternal health chatbot: Level 1 accuracy 91%, Level 2 week-4 retention 61%, Level 3 showing reduced anxiety (effect size 0.3 SD), no Level 4 evidence yet. Cost per user: $4.
>
> Option B — AI teacher coaching tool: Level 1 accuracy 78%, Level 2 week-4 retention 44%, Level 3 knowledge gains 0.5 SD but self-efficacy flat, one Level 4 RCT in progress (results in 9 months). Cost per user: $11.
>
> Using the AI Evaluation Playbook's evidence strength framework across all four levels, help me structure a comparison. For each option: assess the strength and gaps in the evidence chain, flag what is missing before a scaling decision is justified, estimate the relative risk of a premature scale-up, and suggest what interim condition or milestone should be attached to any funding decision.

**What you'll get:** A structured comparison that reads the evidence pattern — not just the numbers — and surfaces what each product still needs to prove before it earns a scaling decision.

### Funding Reviewer

*Works at a foundation, bilateral donor, or multilateral. Reviews grant proposals for GenAI projects, assesses whether proposed evaluation plans are rigorous enough, and sets evaluation conditions for funding.*

**Example 1 — Reviewing a proposal's evaluation plan**

> A promising NGO has submitted a $2M proposal for an AI literacy tutor. Their evaluation section is 3 paragraphs. You need a structured critique before the investment committee meeting.

**Try this prompt:**

> I'm reviewing a $2M grant proposal for an AI literacy tutor targeting out-of-school girls aged 10–14 in rural Pakistan. The applicant's entire evaluation plan reads: "We will track user satisfaction surveys and engagement analytics, aiming for 80% satisfaction and 70% weekly active users by month 6. A third-party evaluation will be commissioned in year 2."
>
> Using the AI Evaluation Playbook's Minimum Viable Evaluation checklists for all four levels, score this evaluation plan against what the playbook considers the minimum bar for each level. For each level: state whether the plan meets, partially meets, or fails to meet the MVE standard, explain the specific gap, and write 1–2 specific questions I should ask the applicant in the clarification call. Then give an overall readiness verdict: fund as-is, fund with conditions, request a resubmission, or decline. Include the 3 non-negotiable conditions I would attach to any funding decision.

**What you'll get:** A level-by-level gap analysis mapped to the MVE checklists, specific clarification questions, and a funding verdict with non-negotiable conditions — reviewable by your investment committee.

***

**Example 2 — Setting evaluation requirements for an RFP**

> Your foundation is launching a $10M RFP for GenAI tools in primary healthcare. You need evaluation requirements that are rigorous but won't exclude smaller organisations.

**Try this prompt:**

> I'm designing evaluation requirements for a $10M RFP for GenAI tools in primary healthcare across Sub-Saharan Africa. Applicants will range from small local NGOs to established international organisations. We want rigorous evaluation without creating requirements so burdensome that only large organisations with research departments can apply.
>
> Using the AI Evaluation Playbook's Minimum Viable Evaluation framework and tiered approach, help me design a two-tier evaluation requirement: a baseline tier all applicants must meet, and an enhanced tier for applicants requesting over $500K. For each tier and each of the 4 evaluation levels, specify the minimum required activities, the evidence format you'd accept, and the red lines that would disqualify a proposal regardless of tier.

**What you'll get:** A two-tier evaluation framework with per-level requirements, accepted evidence formats, and disqualifying red lines — ready to paste into your RFP.

### AI / ML Engineer

*Builds and maintains the AI pipeline, evaluation rubrics, golden datasets, and automated scoring. Primarily works at Level 1 but feeds into Levels 2–4.*

**Example 1 — Drafting an evaluation rubric**

> You're building an agricultural advisory chatbot for smallholder farmers in Kenya. You need a Level 1 rubric before writing a single golden dataset entry.

**Try this prompt:**

> I'm building a RAG-based agricultural chatbot for smallholder farmers in Kenya. It answers questions about crop disease, planting schedules, and input sourcing via WhatsApp in Swahili and English.
>
> Using the evaluation rubric guidance from the AI Evaluation Playbook (Level 1), help me draft a 5-dimension rubric. For each dimension include: the qualitative definition, a concrete example of a passing and failing response, and a suggested scorer type (statistical, model-based, or LLM-as-judge).

**What you'll get:** A structured rubric with pass/fail examples and scorer recommendations — ready to hand to your team before the dataset sprint begins.

***

**Example 2 — Seeding a golden dataset**

> Your domain expert has 2 hours. You need to get maximum value from that session by pre-drafting diverse golden dataset entries for their review.

**Try this prompt:**

> I have 2 hours with an agronomist before my golden dataset sprint. My chatbot handles crop disease, planting advice, and input sourcing for maize farmers in Western Kenya.
>
> Generate 15 draft golden dataset entries covering: typical user queries in varying formality and Swahili-English code-switching, out-of-scope requests, and adversarial/safety edge cases. For each entry provide: user input, ideal output structure, and which rubric dimension it primarily tests. Flag the 3 entries most critical for the expert to validate first.

**What you'll get:** A diverse draft dataset that makes the expert session far more productive — with the highest-risk entries flagged for priority review.

### Product Manager

*Owns product metrics, the user funnel, A/B test design, and translating evaluation insights into the roadmap. Primarily works at Level 2.*

**Example 1 — Designing a user funnel**

> You're launching a maternal health WhatsApp chatbot for expectant mothers in Nigeria. You need a user funnel with metrics before your engineering sprint.

**Try this prompt:**

> We're launching a maternal health chatbot on WhatsApp for expectant mothers in Nigeria. Our theory of change: mothers receive timely health information → increase antenatal care visits → reduce maternal mortality.
>
> Using the user funnel framework from the AI Evaluation Playbook (Level 2), design a complete funnel from Acquisition to Development Outcome. For each funnel stage: define the metric, explain how to measure it in a WhatsApp context, and identify the leading indicator that predicts the next stage.

**What you'll get:** A complete funnel with stage-by-stage metrics, measurement methods, and leading indicators — ready for your engineering sprint planning.

***

**Example 2 — Writing an A/B test plan**

> Retention drops after week 2. You suspect the onboarding tone is too clinical. You need a clean hypothesis and test design before the next sprint.

**Try this prompt:**

> Our maternal health chatbot has a 40% week-2 retention drop. Level 2 data shows users engage heavily in week 1 but disengage after the first prenatal reminder message.
>
> Help me write an A/B test plan following the experimentation guidance in the AI Evaluation Playbook. Include: the specific hypothesis, treatment vs control variants, primary and secondary metrics, minimum detectable effect, guardrail metrics to monitor, and a pre-analysis plan summary. Then list 3 alternative hypotheses I should rule out first via process evaluation.

**What you'll get:** A rigorous test plan with a clear hypothesis, MDE calculation, and a checklist of things to investigate before running the experiment.

### Data Scientist

*Builds ETL pipelines, defines metric schemas, runs A/B analysis, and connects data across evaluation levels.*

**Example 1 — Designing a data schema across all four levels**

> You need to design a data warehouse schema that links model traces, product events, and survey responses across all four evaluation levels.

**Try this prompt:**

> I'm building the data infrastructure for a digital agriculture platform serving 50,000 farmers. We collect: LLM trace logs (Level 1), WhatsApp engagement events (Level 2), quarterly SMS surveys (Level 3), and annual yield data from partner NGOs (Level 4).
>
> Using the ETL pipeline guidance from the AI Evaluation Playbook, propose a data warehouse schema that links all four levels. Include: table structures, key joins, and how to handle data that arrives at different frequencies. Flag the 3 most common pipeline failures in this kind of multi-level setup.

**What you'll get:** A multi-level schema design with join logic, data frequency handling, and a practical failure checklist.

***

**Example 2 — Building a surrogate index**

> Your Level 4 RCT is 18 months away. You need a surrogate index from Level 2–3 data to run faster product iterations now.

**Try this prompt:**

> We're 18 months from our Level 4 RCT measuring smallholder farmer income gains. We have 6 months of Level 2 data (session depth, feature uptake) and Level 3 data (self-efficacy surveys, question complexity scores).
>
> Following the Surrogate Index framework in the AI Evaluation Playbook, help me construct a surrogate index. Suggest which Level 2–3 metrics to include, how to weight them based on theoretical proximity to income outcomes, how to validate the index against any available Level 4 pilot data, and what the assumptions and limitations are. Output this as a draft methods note I can share with our impact evaluator.

**What you'll get:** A surrogate index design with weightings, validation approach, and a methods note — ready to share with your impact evaluation partner.

### User Researcher

*Measures cognitive, affective, and behavioural outcomes. Runs surveys, interviews, and NLP analysis on conversation logs. Primarily works at Level 3.*

**Example 1 — Designing an in-chat survey**

> You need a 3-question in-chat survey to measure self-efficacy and knowledge gain after a tutoring session, without disrupting the conversation flow.

**Try this prompt:**

> I'm evaluating an AI math tutoring chatbot for secondary school students in Ghana. I want to measure self-efficacy and immediate knowledge gain after each session, embedded naturally in the WhatsApp conversation.
>
> Using the survey guidance from the AI Evaluation Playbook (Level 3), design a 3-item in-chat survey. For each item: write the question in natural conversational language, specify the response format (e.g. 1–5 scale, yes/no, open text), explain what construct it measures and why, and flag any cultural adaptation considerations for a West African student population.

**What you'll get:** A 3-item survey with conversational wording, validated constructs, and cultural adaptation notes — ready to embed in your chatbot flow.

***

**Example 2 — Analysing conversation logs at scale**

> You have 500 conversation logs from a health chatbot. You need to extract cognitive and affective signals at scale without reading every log.

**Try this prompt:**

> I have 500 conversation logs from a postpartum mental health chatbot deployed in South Africa. I need to extract Level 3 signals at scale without manually reading each log.
>
> Based on the NLP analysis methods in the AI Evaluation Playbook (Level 3), design an analysis pipeline. Specify: which sentiment and linguistic signals to extract and why, the appropriate NLP method for each signal (LIWC, LLM-as-judge, topic modelling), a sample LLM-as-judge prompt for scoring 'perceived empathy' from a conversation excerpt, and guardrail checks to detect AI dependency patterns.

**What you'll get:** A scalable analysis pipeline with method-to-signal mappings, a ready-to-use judge prompt, and dependency detection checks.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fusing-the-playbook-with-ai-tools>" %}

</details>


# Overview

Does the AI system perform as intended?

Level 1 evaluation is the foundational "stress test" for your AI. It moves beyond simple code checks to verify the "smarts" of your product. Because Large Language Models (LLMs) predict the next word rather than "understanding" reality, they are prone to hallucinations, static knowledge gaps, and instruction failure.

This level of evaluation ensures your system is useful, accurate, and safe before it reaches a single user.

***

#### Key Motivation

In high-stakes sectors like health, education, and agriculture, misalignment isn't just a bug—it’s a safety risk. Level 1 evaluation is important because:

* Mitigating Hallucinations: Verifies that fluent-sounding responses are actually factually grounded.
* Contextual Accuracy: Ensures the system uses your proprietary data or local context (e.g., specific soil types) rather than generic internet data.
* Harm Prevention: Identifies potential biases or unsafe advice before they reach vulnerable populations.
* Cost Efficiency: Catching a misaligned system during development is significantly cheaper than fixing a deployed product that users have already lost trust in.

<a href="/pages/IdLwEHGlXwxGEMOg6fjm" class="button primary">Read more -></a>

***

#### Core Concept: The "Cell" vs. The "Nucleus"

We distinguish between the Foundation Model (the raw AI engine) and the End-to-End AI System (your full pipeline). Evaluation must cover the entire "Cell."

* Pre-processing: Sanitizing inputs, language translation (low-resource to high-resource), and query refinement.
* Context Preparation: Managing the "system prompt," external tools (web search, calculators), and retrieved knowledge (RAG).
* Post-processing: Final safety guardrails, hallucination checks, and formatting the output for the user.

<a href="/pages/YmiMvD8Yb54OQKNEVToa" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 1 evaluation follows a 6-step continuous loop to move from lab testing to real-world monitoring.

1. **Define the Rubric:** Work with experts to select up to 5 dimensions (e.g., Accuracy, Tone, Safety, Robustness, Linguistic Consistency).
2. **Select Metrics & Scorers:** Choose how to measure success using Statistical (fast/cheap), LLM-as-Judge (flexible), or Human-as-Judge (the gold standard for nuance).
3. **Build a Golden Dataset:** Create a "Minimum Viable Evaluation" (MVE) set of 30-50 high-quality input/output pairs that represent ideal interactions.
4. **Score & Analyze Errors:** Conduct Offline Evaluation (lab testing) to identify where the pipeline breaks—whether it's a retrieval failure or a prompting error.
5. **Automate:** Integrate evaluations into your engineering workflow (CI/CD) to ensure no new update causes a "regression" (a drop in quality).
6. **Red-Teaming:** Actively try to "break" the system by acting as a malicious or confused user to find vulnerabilities before launch.

<a href="/pages/hYtXrs8BLrgmkpbiFUiy" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Foverview>" %}

</details>


# Who is most involved in this level of evaluation?

| Execute 🟢                                                                                                                                                                                                                                                         | Support (as product owners) 🟡                                                                                                                                                                                                                               |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| <p><i class="fa-square-code">:square-code:</i> <strong>AI Engineers</strong></p><p><i class="fa-flask-vial">:flask-vial:</i> <strong>ML Researchers</strong></p>                                                                                                   | <p><i class="fa-graduation-cap">:graduation-cap:</i> <strong>Domain Experts</strong></p><p><i class="fa-house">:house:</i> <strong>Product Owners</strong></p><p><i class="fa-magnifying-glass">:magnifying-glass:</i> <strong>User Researchers</strong></p> |
| Your engineering team will be driving the process from driving technical development (e.g. implementing metrics and setting up automated evaluation pipelines) and working together with domain experts to finalise the rubrics and developing the golden dataset. | Domain experts and product owners must support the engineering team as product owners by deciding the rubrics, validating if the metrics proposed measure those rubrics accurately and helping inform the design of the golden dataset.                      |

## Why is this level of evaluation important?

Level 1 evaluations focus on the AI system (see [What is an AI system?](/model-behaviour/level-1-module-evaluation/what-is-the-ai-system-being-evaluated)) that form the “smarts” of your product. And while these AI systems are powerful, they have inherent blind spots. Large language models (LLMs) like GPT, Claude and Gemini do not [understand](https://www.sciencenews.org/article/ai-large-language-model-understanding?utm_source%3Dchatgpt.com\&sa=D\&source=editors\&ust=1770879886919367\&usg=AOvVaw2IuQUzdG6HVftGYlMDjGer) content in the way humans do. Given an input, they generate output by predicting the next word in a sequence. Their predictions mimic the data used in model training—usually a vast collection of information published to the internet, including textbooks and computer code, as well as misinformation, unverified claims, and conspiracy theories. This is why they can appear fluent and convincing while remaining inaccurate, irrelevant, or harmful—a phenomenon known as hallucination.

Because of the way they are trained, AI models face several limitations:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><strong>Static Knowledge</strong></td><td>Used alone, they cannot access real-time information (e.g., current weather in a rural village) so are limited to the training data they have received.</td><td><a href="https://images.unsplash.com/photo-1584184200374-73d7f6c6a175?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHxjb25jcmV0ZXxlbnwwfHx8fDE3NzI2NDAwMzB8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1584184200374-73d7f6c6a175?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHxjb25jcmV0ZXxlbnwwfHx8fDE3NzI2NDAwMzB8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Limited Context</strong></td><td>The model will not have access to personal information or your proprietary documents unless explicitly engineered to do so. As a result, models may lack the context to generate actionable, personalized, or even accurate outputs for a given task.</td><td><a href="https://images.unsplash.com/photo-1586769852836-bc069f19e1b6?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxpbmZvcm1hdGlvbnxlbnwwfHx8fDE3NzI2NDAwNjh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1586769852836-bc069f19e1b6?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxpbmZvcm1hdGlvbnxlbnwwfHx8fDE3NzI2NDAwNjh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Instruction Following</strong></td><td>Models may struggle to adhere to complex instructions or fail to follow constraints consistently, leading to results that do not fully meet expected criteria.</td><td><a href="https://images.unsplash.com/photo-1508726096737-5ac7ca26345f?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxvYmV5fGVufDB8fHx8MTc3MjY0MDA5OHww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1508726096737-5ac7ca26345f?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxvYmV5fGVufDB8fHx8MTc3MjY0MDA5OHww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Task Mismatch</strong></td><td>AI models are not the right “tool” for every task; for example, they may confidently make errors in math calculations which are trivial for a calculator. Understanding where they shine and augmenting them with capabilities they lack is key to using them well.</td><td><a href="https://images.unsplash.com/photo-1613905780946-26b73b6f6e11?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwyfHx3cm9uZ3xlbnwwfHx8fDE3NzI2NDAxNDB8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1613905780946-26b73b6f6e11?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwyfHx3cm9uZ3xlbnwwfHx8fDE3NzI2NDAxNDB8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr></tbody></table>

Product developers can often address these limitations, but it requires a structured, continuous evaluation process: a set of iterative workflows to verify that the AI system is useful, accurate, and safe; and that it reliably exhibits desirable behaviors and characteristics. For instance, an effective AI tutor will follow pedagogical best practices – like withholding answers to encourage self-directed learning, or gauging a student’s abilities to better tailor instruction.

Level 1 evaluation verifies that the AI system performs reliably and is appropriate to the context. This is non-negotiable in sectors like education, health, and agriculture, where misalignment or unverified claims can cause real-world harm to vulnerable users. We recommend starting early with Level 1 evaluation, to prevent wasted effort and time.

You can begin by engaging key stakeholders, including users and domain experts, to define success criteria and a continuous evaluation strategy. This allows you to shape system behavior throughout the development process, and to avoid the high costs (and delays) of fixing a misaligned system after it has already been built.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Foverview%2Fwhy-is-this-level-of-evaluation-important>" %}

</details>


# What is the “AI system” being evaluated?

In this playbook, we will distinguish between two different concepts:

**Foundation Model (The "Nucleus")**: This is a large-scale model, trained on vast datasets, used as part of an overall system. Examples: GPT-5, Claude Opus 4.5, Gemini 3

**End-to-End AI System (The "Cell")**: This is the entire AI workflow or system that you build. It incorporates one or more foundation models, plus all the other components that make the pipeline work for a user. Just as a cell contains a nucleus, mitochondria (for energy) and a cell wall, your full pipeline will include multiple components, like:

* Knowledge bases containing specific information for retrieval
* Instructions for the AI model (“system prompt”) and safety guardrails, like content filters
* Language translation, speech-to-text/text-to-speech transformations, and other processing steps
* Tools that let the model take actions like sending an email or performing web search

When we say “AI system”, we generally are referring to the End-to-End AI System described above. To avoid confusion, we will always use the term foundation model when referring to the “nucleus” and use “AI system” or “AI pipeline” when referring to the “cell”.

![Figure 6: Defining the AI System](/files/PuOJwzQEkBTj4O5QyOmH)

To simplify the evaluation of an AI system, we define three distinct components:

1. **Pre-processing**: Before a user input hits the foundation model, it is transformed into a suitable format. Common steps include:
   * Sanitization: Rejecting unsafe or irrelevant inputs
   * Conversion: Turning speech into text (e.g. using an automatic speech recognition model)
   * Refinement: Paraphrasing the request (e.g. converting a vague message to more specific based on the conversation history) or translating it from a low-resource language to a high-resource one (since LLMs perform better in high-resource languages)
2. **LLM Context Preparation**: Beyond the user’s pre-processed request, a foundation model requires the following additional components to function:
   * The “system prompt”: These instructions guide the foundation model’s behavior.
   * External Tools: These augment the foundation model by letting it take actions (e.g. web search)
   * Context: Relevant background, such as conversation history, data retrieved from a knowledge base, or responses received from calling tools (e.g. web search results).
3. **Post-processing**: Before the output reaches the user, it undergoes final checks and transformations. Common steps include:
   * Quality Control: Checking for hallucinations (e.g. by ensuring the response is always grounded in the knowledge base) and verifying safety guardrails as defined by you
   * Formatting: Converting text to speech or translating the answer back into the user’s preferred language.

Level 1 evaluations should cover this entire pipeline. They assess your complete AI system, from the user's input to the final output, verifying that each piece of the pipeline exhibits desirable behaviors. You can (and should) test individual components of this workflow using unit tests. Note that others have written at length on the topic of [unit testing](#user-content-fn-1)[^1], and it is not covered here in detail.

Remember that AI solutions can take many forms. They can be chatbots, voice bots for real-time conversation, or agents that take actions, like filling out our forms or calling external services. Level 1 evaluations cover all these modalities.

### Example: an AI agronomist deployed in Senegal

Consider a product answering questions from farmers in Pulaar. The AI system includes the three components as follows:

<table><thead><tr><th width="203.69140625">Component</th><th>Workflow Steps</th></tr></thead><tbody><tr><td>Pre-processing</td><td><ul><li>Check input for malicious or off-topic content (filtering model)</li><li>Translate query from Pulaar to English (translation model)</li></ul></td></tr><tr><td>Context Preparation</td><td><ul><li>Retrieve relevant agricultural content from the database</li><li>Retrieve specific information about the farmer from the context window</li><li>Generate response to processed user input (large language model)</li></ul></td></tr><tr><td>Post-processing</td><td><ul><li>Verify the answer is grounded in the content provided in your knowledge base</li><li>Translate the response back to Pulaa (translation model)</li></ul></td></tr></tbody></table>

### Low vs high-resource languages

Most LLMs are trained on digitized text in just a handful of languages, predominantly English. Yet they are used in contexts where users speak ”low-resource” languages, such as Kannada or Hikuyu. These languages may be spoken by tens of millions of people, but there is relatively less digitized text (and even fewer labeled datasets) available to train foundation models. As a result, LLM queries in these languages may result in higher rates of hallucination or other failure modes. In contrast, “high-resource” languages like English or Hindi have far more internet and digital data available, leading to stronger performance. To improve the performance of an AI system operating in “low-resource” languages, you may want to design your systems to first translate the user’s input from their language to a high-resource one, generate the AI response in a high-resource foundation model, then translate the answer back, so the user receives guidance in the language they prefer.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Foverview%2Fwhat-is-the-ai-system-being-evaluated>" %}

</details>

[^1]: e.g. [Unit tests for AI models](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) by Hamel Husain.

    Numerous books exist on unit testing. We found chapters 11-14 of Software Engineering at Google especially useful when building right sized, right scope, and repeatable tests.


# What is the Minimum Viable Evaluation for Level 1?

The earliest stage of AI development involves prototyping with offline evaluations. Here, we strongly recommend using notebooks (e.g. Jupyter notebooks, or Google Colab) to establish reproducible workflows instead of aiming to set up an automated pipeline from the start.

The goal of this step is to quickly analyze errors in the current configuration, make suitable changes, and test for resolution of issues. Working inside a notebook helps you access every component in one place—data, configs, models, metrics and any other intermediate steps like retrieval, tool calling—giving you full visibility into your existing system and a test bed for validating your experiments end-to-end.

Once you are ready to deploy a product to actual users, consider using an observability platform (like Langfuse or DeepEval) to automatically record traces as you iterate. This is important for understanding where your AI system is failing and why. But don’t let this delay your launch.

| Level 1 - Model evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox"><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/pMp8WTGTVjysMbVi5Lmw">2-3 rubrics for model success</a> with at least one robust safety/guardrail metric computed on your Golden Dataset.</li><li><input type="checkbox">In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment</li><li><input type="checkbox">Develop a <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/nlovOPA1IMARfE9MwQau">Golden Dataset</a> with at least 30-50 items representing key, diverse user interactions</li><li><input type="checkbox">Establish a process for <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/dmxTilPPMauBPlkqtQJw">expert review of AI system </a>responses for inputs in the Golden Dataset, as you iterate on your system configuration</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Foverview%2Fwhat-is-the-minimum-viable-evaluation-for-level-1>" %}

</details>


# How is Level 1 evaluation performed?

End-to-end, the entire Level 1 evaluation workflow is both complex and highly iterative (see Figure 7). However, we encourage you to start with a [Minimum Viable Evaluation](/additional-resources/minimum-viable-evaluations), and build incrementally as the product matures.

<figure><img src="/files/0hRYTw9iSvYpegyU6UQj" alt=""><figcaption><p>Figure 7: Level 1 Evals Workflow</p></figcaption></figure>

### 6-step process for evaluating AI systems. <a href="#what-is-the-minimum-viable-evaluation-for-level-1" id="what-is-the-minimum-viable-evaluation-for-level-1"></a>

We will elaborate on each of these steps in turn. You can apply this process to each of the non-deterministic models in your AI system, individually at first (if needed) but eventually as an ensemble:

{% stepper %}
{% step %}

#### [Decide on an evaluation rubric](/model-behaviour/how-to-evaluate/1.-decide-on-an-evaluation-rubric)

The first step in Level 1 evals is to come up with your evaluation rubric.
{% endstep %}

{% step %}

#### [Decide on metrics](/model-behaviour/how-to-evaluate/2.-decide-on-metrics)

Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric.
{% endstep %}

{% step %}

#### [Develop a golden dataset](/model-behaviour/how-to-evaluate/3.-develop-a-golden-dataset)

To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system.
{% endstep %}

{% step %}

#### [Scoring & error analysis](/model-behaviour/how-to-evaluate/4.-scoring-and-error-analysis)

Run online and offline evaluations and conduct error analysis
{% endstep %}

{% step %}

#### [Automate your evaluations](/model-behaviour/how-to-evaluate/5.-automate-your-evaluations)

Manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow.
{% endstep %}

{% step %}

#### [Red-teaming](/model-behaviour/how-to-evaluate/6.-red-teaming)

Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild.
{% endstep %}
{% endstepper %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed>" %}

</details>


# Decide on an evaluation rubric

The first step in Level 1 evals is to come up with your evaluation rubric. Working with domain experts and other stakeholders, you will define the characteristics that your AI solution must exhibit, in the form of targets or success criteria. For example, an AI agronomist might prioritize the “accuracy” of scientific information presented, and a mental health bot might need to emphasize “empathy”.

While some evaluation criteria are common, the majority of your rubric will be driven by your specific use case. To ensure a comprehensive evaluation, your rubric should explicitly address these five dimensions:

<table data-header-hidden><thead><tr><th width="152.31640625">Dimension</th><th>What to Measure</th><th>Target</th></tr></thead><tbody><tr><td>Accuracy/ Usefulness</td><td>The quality of the AI’s response and whether it sufficiently addresses the task at hand</td><td>“The response must address the user’s specific question instead of giving a generic answer and it must be medically accurate.”</td></tr><tr><td>Qualitative / Branding</td><td>The "personality" and tone of the AI.</td><td>"The response must be professional and never use jargon."</td></tr><tr><td>Safety &#x26; Sensitivity</td><td>Identifying sensitive issues specific to your use case and specify any unacceptable behaviours.</td><td>"The AI system must never provide legal advice or comment on [Sensitive Topic X]."</td></tr><tr><td>Robustness &#x26; Stability</td><td>The system's ability to remain consistent when the same question is asked in different ways.</td><td>"The core answer should not change if the user uses different phrasing or synonyms."</td></tr><tr><td>Linguistic Consistency</td><td>For multi-language apps, ensuring performance doesn't drop across languages.</td><td>"The Swahili and Sheng question must receive the same level of detail as the English version."</td></tr><tr><td>Service-Level Performance</td><td>The "cost of doing business."</td><td>"The end-to-end response time must be less than 2 seconds at a cost of &#x3C;$0.01 per query."</td></tr></tbody></table>

The rubric will be determined by your use case, context, and impact goals. This step often takes lots of reflection and discussion to get right. It is a critical step that guides the rest of your evaluation, so do not rush this step.

#### How many dimensions should I have in my rubric?

It is tempting to make a long list of characteristics you want. After all, you want your AI system to be trustworthy as well as friendly, on-brand, concise, complete, curious, empathetic, encouraging, direct, and so many other things. Unfortunately, the longer this list, the more expensive and difficult your evaluation process. There are also tradeoffs that are hard to get right (e.g, concise vs. complete, friendly vs. direct). We recommend that you restrict the rubric to a maximum of 5 items to start.

{% hint style="success" %}

### Case Studies

[Jacaranda Health (JH)](https://www.google.com/url?q=https://jacarandahealth.org/\&sa=D\&source=editors\&ust=1770879886943808\&usg=AOvVaw0nVZiASt1YHvQ0QpeJu4z-) pioneers the use of generative AI to transform how underserved mothers in Sub-Saharan Africa access, understand, and act on vital maternal and newborn health information. Their product (PROMPTS) is a two-way SMS service designed to promote positive care-seeking behaviors amongst new and expectant mothers through timely health information and support throughout the pregnancy and postpartum journey. Responses are generated by Jacaranda’s customized LLM, UlizaLlama, which is based on Meta’s Llama 2 and fine-tuned for use in Swahili and English ([Stanford Center for Digital Health, 2025](https://www.google.com/url?q=https://cdh.stanford.edu/our-research-portfolio/generative-ai-health-low-middle-income-countries\&sa=D\&source=editors\&ust=1770879886944918\&usg=AOvVaw1eFSz30uQd-nZcYNQEzTye)). The evaluation of PROMPT’s LLM responses at Level 1 is based on rubrics for “medical accuracy and appropriateness, personability, and simplicity.”

\
Another example comes from [Digital Green (DG)](https://www.google.com/url?q=https://digitalgreen.org/\&sa=D\&source=editors\&ust=1770879886945632\&usg=AOvVaw2jB00YounbCX7945760PK0), which uses GenAI to democratize access to localized, actionable agricultural knowledge for smallholder farmers across Africa and Asia. Their product (Farmer.Chat) is a multilingual, multimodal conversational platform that delivers personalized, context-aware agricultural advice through familiar messaging apps like WhatsApp and Telegram. Built using a Retrieval-Augmented Generation (RAG) architecture and integrated with a dynamic knowledge base of expert-vetted documents, videos, and real-time data, Farmer.Chat provides reliable guidance on more than 40 crops across four countries (Kenya, India, Ethiopia, and Nigeria). Responses are generated by Digital Green’s custom large language model pipeline, optimized for low-literacy users and localized languages including Swahili, Amharic, Hausa, Hindi, Odiya, Telugu, and English. Their AI system synthesizes structured and unstructured agricultural data to produce clear, trustworthy, and culturally relevant information delivered via text, voice, and video formats. Hence, the evaluation of Farmer.Chat’s performance at Level 1 is based on rubrics for “faithfulness, relevance, and accessibility” ([Singh et al., 2024](https://www.google.com/url?q=https://arxiv.org/abs/2409.08916\&sa=D\&source=editors\&ust=1770879886947391\&usg=AOvVaw2SYezWTyiUl0qF6cl35RSL)).
{% endhint %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F1.-decide-on-an-evaluation-rubric>" %}

</details>


# Decide on metrics

Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric. The metrics you define can range from “benchmarks” (i.e., industry-standard metrics that evaluate foundation model performance on common tasks) to context-specific measures that examine whether the system performs for your specific use case.

We advise focusing on metrics that assess the AI system against the criteria that matter most for your solution. Industry benchmarks are primarily used to choose the right foundation model for your context, enabling comparisons on common tasks like word error rate (for translation tasks) or accuracy (for automatic speech recognition tasks). More specific measures should be used to track performance over time and to evaluate the effectiveness of modifications.

To actually compute metrics, data scientists and engineers will define “scorers” (i.e., algorithms or analytic strategies to assess the AI system against a performance target). Scorers typically fall under one of these categories, each with its own pros and cons:

* **Statistical and model-based scorers** are designed to deliver metrics for narrow, specific tasks. You cannot use them interchangeably. Examples of common metrics (and the associated analytic strategies) include:
  * [Precision/Recall/F1](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall): For measuring classification accuracy
  * [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) (WER): For accuracy of transcription in speech recognition
  * [AlignScore](https://github.com/yuh-zha/AlignScore): Use for checking factual consistency

{% hint style="warning" %}
Be aware of the weaknesses of each method. For instance, metrics like WER only check the overlap between predicted and reference transcript – but don’t compare the meaning, making them less reliable. For meaning preservation, consider [alternative methods](https://www.google.com/url?q=https://research.google/blog/assessing-asr-performance-with-meaning-preservation/\&sa=D\&source=editors\&ust=1770879886951494\&usg=AOvVaw1juT7BpZ-ZygpNlan3KKDu).
{% endhint %}

* **LLM-as-Judge** uses an LLM to score AI system outputs flexibly and comes in many variants. Approaches include:
  * Direct Prompting: Asking the LLM to score the output based on a text-encoded rubric
  * Comparison with reference: Asking the LLM to score the output by comparing it to a reference answer.
  * Chain-of-Thought: Asking the LLM to explain its reasoning before scoring.
  * Claim Extraction: Breaking a response into specific claims and checking each against a reference text (ideal for hallucination detection).

{% hint style="warning" %}
[Evidence suggests](https://aclanthology.org/2024.findings-naacl.148.pdf) LLM-as-judge methods may perform poorly when evaluating low-resource languages.
{% endhint %}

* **Human-as-Judge** remains the "gold standard" for catching subtle nuances and context that automated scoring tools miss. However, human raters are slow, expensive, and prone to [their own biases](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md). Therefore, do not use humans to score your entire dataset. Instead, reserve them for high-leverage tasks:
  * Prototyping: Human feedback will help you move faster at the beginning, when no LLM judges exist
  * Rubric creation: Humans create better rubrics after having reviewed a few outputs themselves
  * Alignment: Check if the LLM judges are aligned to human experts. Set aside a small set of inputs, obtain both the LLM and human judgements and compare to ensure they are aligned.
  * Quality Assurance: Perform a final human safety check on high-stakes examples before a major launch.

The examples below provide a high-level view of common existing scoring methods, though they are not comprehensive. Each has its pros and cons; and the ideal metrics and analytic strategies will likely be a combination of these approaches.

<table data-header-hidden data-full-width="true"><thead><tr><th>Method / Scorer</th><th width="142.7734375">Example Metrics</th><th width="254.33984375">Example Use Case</th><th>Pros/Cons</th></tr></thead><tbody><tr><td>Statistical scorers These are based on the words in the LLM output and don’t take the semantic meaning into account.</td><td>​<a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall">Precision/ Recall/ F1</a>, <a href="https://www.geeksforgeeks.org/maths/mean-squared-error/">Mean squared error</a>, <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>, <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a>,​<a href="https://en.wikipedia.org/wiki/METEOR">METEOR</a>, <a href="https://en.wikipedia.org/wiki/Word_error_rate">WER</a>​</td><td>An NGO evaluates a literacy chatbot that generates short reading comprehension questions in Swahili. BLEU and ROUGE are used to compare the chatbot’s questions to a set of human-written reference questions to assess linguistic overlap.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i></p></td></tr><tr><td>Model-based scorers These are small language models trained to do one specific task.</td><td>​<a href="https://github.com/yuh-zha/AlignScore">AlignScore</a> / <a href="https://arxiv.org/pdf/2404.06579">LIM-RA</a>,​<a href="https://github.com/google-research/bleurt">BLEURT</a>, <a href="https://arxiv.org/pdf/2106.11520">BARTScore</a>,​<a href="https://unbabel.github.io/COMET/html/index.html">COMET</a>​</td><td>A health information NGO uses BLEURT, a pre-trained model designed to score text quality, to evaluate the responses of an AI assistant that explains vaccination schedules to parents. The model-based scorer assesses how semantically faithful and understandable each generated message is compared to a trusted reference explanation.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr><tr><td>LLM-based scorers a.k.a LLM-as-judge Since they use LLMs, they are flexible and powerful. But it can also be expensive and slow.</td><td>​<a href="https://arxiv.org/abs/2303.16634">G-Eval</a>,​<a href="https://arxiv.org/abs/2210.08726">RARR</a>​</td><td>A digital agriculture platform uses a large language model (LLM) as a judge to evaluate the quality of pest management advice generated by smaller domain models. The LLM judge scores each message for accuracy, clarity, and farmer-friendliness, comparing them to expert agronomist responses.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr><tr><td>Human evaluation For tasks requiring nuances and complex reasoning, or detecting subtle hallucinations, humans are ideal -- though not without their <a href="https://arxiv.org/pdf/2307.03025">own</a> <a href="https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md">biases</a>.</td><td>​<a href="https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md">Human evaluation</a>​</td><td>A mental health NGO tests a GenAI counseling tool for youth. Human evaluators (e.g., psychologists and peer mentors) manually rate the empathy, appropriateness, and emotional resonance of responses.</td><td><p>Speed: <i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr></tbody></table>

#### How do I know that I have selected the right metrics?

To choose the right metrics for your rubric, you must bridge the gap between "what we value" (the qualitative rubric defined by product managers) and "what we can measure" (the quantitative scorers implemented by engineers). The process of selecting metrics is a translation exercise between roles:

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><strong>Goal-setting</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Product Owners / Domain Experts</p></div><p>Define the qualitative goal. For example, "The AI should be trustworthy".</p></td><td><a href="https://images.unsplash.com/photo-1628440501245-393606514a9e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHx0YXJnZXR8ZW58MHx8fHwxNzcyNjQyNDM2fDA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1628440501245-393606514a9e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHx0YXJnZXR8ZW58MHx8fHwxNzcyNjQyNDM2fDA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Measurement</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Engineers</p></div><p>Map the goal to a measurable proxy. For "trustworthy," you might select a Factual Consistency Score or an AlignScore.</p></td><td><a href="https://images.unsplash.com/photo-1602503497726-dc6cfaab7e17?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxtZWFzdXJlfGVufDB8fHx8MTc3MjY0MjQ0Nnww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1602503497726-dc6cfaab7e17?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxtZWFzdXJlfGVufDB8fHx8MTc3MjY0MjQ0Nnww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Validation</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Product Owners</p></div><p>Review the technical metric to ensure it accurately reflects the organization’s intent (or intended impact).</p></td><td><a href="https://images.unsplash.com/photo-1516382799247-87df95d790b7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjaGVja3xlbnwwfHx8fDE3NzI2NDI0NTh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1516382799247-87df95d790b7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjaGVja3xlbnwwfHx8fDE3NzI2NDI0NTh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr></tbody></table>

Not all metrics work for all tasks. You must select your metric and "scorer" based on the needs for speed, cost, and nuance. Standard industry benchmarks often fail in development contexts, particularly for low-resource languages or specific technical domains (like agriculture), so you may need to invent a custom metric.

Remember, do not try to measure everything. While you may want your AI system to be "friendly, on-brand, concise, complete, and curious," a long list creates conflicting tradeoffs (e.g., concise vs. complete) and increases evaluation costs.

#### How do I know that my LLM-based scorer is working?

In our experience, it is difficult to build an LLM-as-judge workflow that is adequately aligned with human reviewers, especially in the language and cultural contexts we encounter in the development sector. If there is any room for ambiguity, LLMs will produce wild variation in judgement. They may also fail to pick up nuances in human expert evaluations, if implicit. Unless the LLM judge is given precise instructions for handling different situations and nuances, its judgement will not match human experts. The process of tuning or instructing the LLM judge, to make it consistent with human experts, is called “alignment”. Here is an example of how such an alignment process looks like:

1. Create a set of 100-200 input/output pairs from the AI system, either from a sample of real user queries or generating a few queries (if no user queries exist) based on your knowledge of the key user interactions.
2. Pick a rubric item that is important to you, e.g. helpfulness, and write the instructions for an LLM judge on how to score it.
3. Have 2 independent human raters score the outputs for the same rubric item. It is strongly [recommended](https://hamel.dev/blog/posts/llm-judge/#step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques) to start by asking the raters to mark the output as binary pass/fail instead of scoring between 1 to 5 or 1 to 10. The resource linked above explains the rationale for this and not starting with binary pass/fail ratings is one of the key reasons why teams fail to produce aligned LLM judges.
4. Calculate the [Inter-annotator agreement](https://surge-ai.medium.com/inter-annotator-agreement-an-introduction-to-cohens-kappa-statistic-dcc15ffa5ac4) for your human reviewers: how correlated are their ratings?
   1. If the agreement is low, work on calibration across reviewers, and iterate on the instructions for your rubric (in this case, “helpfulness”) to clearly define what it means.
   2. Ask (ideally) a new set of independent reviewers to rate the outputs.
   3. Repeat this process until the agreement is good enough
5. Once you obtain high agreement, run your LLM Judge on this “alignment dataset” for that rubric item.
6. Check the agreement between the LLM’s score and your human raters
   1. If the agreement is high (> 0.8), you can be confident about the scores given by your LLM judge
   2. If low, continue improving your LLM Judge by modifying its prompt, updating the instructions for the rubric item, adding examples of input-output-judgement pairs, or use [more advanced methods](https://hamel.dev/blog/posts/llm-judge/). Then, repeat this step.

For most use cases, performing the steps above diligently should give you a well-aligned LLM judge. However, you might face other foundational challenges:

* The LLM Judge may not work well on low-resource languages (as mentioned above, these foundation models are trained on datasets dominated by high-resource languages).
* It may not be possible to verbalise the nuances of what makes a response "good" for the specific use case.

For such cases, training a smaller foundation model for your specific use case (“fine-tuning”) might be needed. Explaining the details of [fine-tuning](https://parlance-labs.com/education/#fine-tuning) is beyond the scope of this playbook.

{% hint style="success" %}

### Case Studies

[Jacaranda Health (JH)](https://www.google.com/url?q=https://jacarandahealth.org/\&sa=D\&source=editors\&ust=1770879886973046\&usg=AOvVaw2iOiR4afs3h9T1HMKYDH8y) recently added voice capabilities to its service for pregnant women and new mothers, for users with difficulty reading or seeing text. With voice, mothers can access maternal health guidance more easily. To train the foundational voice model, JH initially used audio samples from Mozilla Common Voice. However, the source had too many male voices and was not specific to their use case. They recorded a balanced Swahili‑English voice corpus from rural and urban mothers across Kenya, then fine‑tuned OpenAI’s Whisper model with those data. Over successive iterations, they drove Word Error Rate (WER) down from 87 percent to 15 percent, inching toward their 6 percent target (which matches the speech-to-text performance for top‑tier languages). Hitting each new milestone meant trading off the volume of diverse accents in the training set with the computing and annotation budget they had available.\ <img src="/files/Pg574WxLUqk5THyluIVr" alt="" data-size="original">\
They also modified their target metric as they iterated. Standard WER tallies substitutions, insertions, and deletions without regard for meaning. That metric penalizes Swahili’s flexible word order and complex verb forms, even when the intent is clear. For an alternative measure of the model’s performance, Jacaranda now measures semantic accuracy using a custom metric based on [cosine similarity](https://www.google.com/url?q=https://en.wikipedia.org/wiki/Cosine_similarity\&sa=D\&source=editors\&ust=1770879886975498\&usg=AOvVaw0DFsmBIr2nNZu7VVj7XrZ6). This experimental approach rewards transcripts that convey the same health guidance, even if they differ in exact phrasing. Hence, it is an example of non-standard metrics developed to make an AI system work in a new context. Jacaranda has been [transparent about their work](https://www.google.com/url?q=https://jacarandahealth.org/jacaranda-launches-open-source-llm-in-five-african-languages/\&sa=D\&source=editors\&ust=1770879886976200\&usg=AOvVaw1LlZXUv4okIuQLyQbBQi3U) on [Swahili fine-tuned models](https://www.google.com/url?q=https://huggingface.co/Jacaranda\&sa=D\&source=editors\&ust=1770879886976377\&usg=AOvVaw1ndzVJUxxTGUlN4oNf23ov), which has helped them capture community feedback and advance more quickly.

\
In a similar vein, to benchmark Automatic Speech Recognition (ASR) models in agriculture, [Digital Green](https://www.google.com/url?q=https://digitalgreen.org\&sa=D\&source=editors\&ust=1770879886976870\&usg=AOvVaw3PLu1q4JQFOlwYUQdjDMsp) (DG) began with metrics such as Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER). However, they had to introduce a custom Agri‑Weighted WER that penalizes errors in key agricultural terms more heavily. Using weighted metrics, DG could track progress on agricultural ASR performance across Hindi, Telugu, and Odia datasets and could tailor improvements to support scalable, farmer‑focused advisory systems.
{% endhint %}

Defining AI system metrics is an area of active research, and newer methods are being developed all the time. The [Huggingface Evaluation Guidebook](https://github.com/huggingface/evaluation-guidebook) is a great resource for understanding model benchmarking and discovering the right metrics for your use case.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F2.-decide-on-metrics>" %}

</details>


# Develop a golden dataset

By this step, you have defined a rubric (describing desirable system behaviors), metrics (quantitative measures for system performance), and scorers (tools or algorithms that calculate your metric values). To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system. This represents your performance target. You will use this dataset to benchmark the AI system’s performance over time, or to compare performance across different variants of your AI system.

Golden datasets include sample inputs to the AI system, paired outputs, and associated labels. Creating the dataset is often the most time-consuming part of a Level 1 evaluation, and it requires a cross-functional team (e.g. domain experts, annotators, product owners, and quality assurance). Inputs and outputs often are annotated by human raters, who create ideal reference answers or define how a given output should be scored according to your rubric items and metrics.

We offer three different approaches to building this dataset:

1. **Past Transaction Data**: If you are adding AI to an existing application or program, leverage your historical data to define “ideal” inputs and outputs. For example, if human support staff have answered user queries in the past, extract high quality and representative question-answer pairs from these non-AI interactions to form the Golden Dataset. You can use LLMs to pre-process or clean this data, and involve domain experts in labeling.
2. **Human-Annotated Data**: If you are building a new product, you must generate labeled datasets from scratch. To generate inputs, you may want to crowdsource initial questions from real potential users. To generate ideal outputs or responses, you will tap domain experts (e.g., nurses, agriculture tech advisors, tutors). Because experts are expensive resources, you may be tempted to save time by using an LLM to generate the "ideal" answers to user queries– and then invite experts to just review, validate, and correct. However, even experts might take shortcuts; as reviewers, they are likely to skim and accept a "plausible" AI answer rather than rigorously correcting it. This risks validating hallucinations or mediocrity, and it may result in a low-quality Golden Dataset (jeopardizing your entire product development process). We recommend having experts produce answers to user queries from scratch.
3. **Customized public datasets**: If a high-quality public dataset exists that closely matches your use case, it can serve as a starting point. You can extract only the most relevant examples from the large dataset, and augment or refine them to better reflect your specific context.

#### How do I know that my Golden Dataset is good enough?

Ideally, your Golden Dataset will cover the full range of user interactions you expect the product to support. But achieving that is near impossible because, no matter how thorough your planning, users will find new and surprising ways to interact with your AI system. Do not wait to prepare your Golden Dataset; you will miss out on the opportunity of learning with real users. We strongly recommend you to adopt the mindset of “Minimum Viable Evaluation”, in this case building the smallest dataset needed to adequately represent key user interactions. You may need to conduct qualitative research or user observation sessions to identify these. Specifically, your dataset should include:

1. **Various modes of user behavior**: Consider not just what users will ask, but also how they will ask. This includes the communication medium (e.g., voice, text) as well as the tone, language, and phrasing. User interactions may reflect code-switching, informality, spelling errors, and varying levels of verbosity. There may also be multiple “personas” or user demographic groups. Should your AI pipeline take into account gender of the user when responding? Consider the diversity of user types and interaction modes that your solution should support – and ensure that your Golden Dataset represents these cases.
2. **Out of scope requests**: Remember to incorporate inputs or questions that the AI application does not support, to ensure that they are handled appropriately. These might be off-topic requests or topical requests that you do not want to handle, e.g, “write me a poem about pregnant women eating avocados”.
3. **Adversarial or malicious requests**: We recommend including malicious/unsafe questions (e.g. abusive inputs) in the Golden Dataset to test safety performance as well. You might also include examples of prompt injection, jailbreaking, and data or privacy attacks.

{% hint style="info" %}
Some use an LLM to start with expert-suggested questions, and then generate variations in different tones, dialects, or levels of verbosity. But synthetic generation of input/output pairs can be risky, because the languages, cultures, and dialogue patterns of people in the “global majority” are under-represented in foundation models. Most commercially available models are trained using published materials and online content, which is heavily biased toward higher-income contributors in wealthier countries and excludes content and norms from oral communication.
{% endhint %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F3.-develop-a-golden-dataset>" %}

</details>


# Scoring & error analysis

There are two different forms of evaluation in AI development:

* **Offline Evaluation**: Also referred to as lab testing, this phase of evaluation happens during development, before your AI product or solution reaches users. You are testing your pipeline against a fixed "Golden Dataset" to see if it meets your target performance. This is a controlled environment used to measure baseline performance and identify and analyze errors in your AI system.
* **Online Evaluation**: The process of analyzing your AI system in the real-world starts after your solution is deployed to users. This workflow involves measuring system performance on tasks created by real users, in real-time.

#### Offline evaluation

Once your golden dataset is ready, you can begin to run your scorer code, which typically compares golden input/output pairs with the AI system’s response to each input. The result is a set of metrics that average performance across all inputs received. These evaluation scores are not a final grade; they are a diagnostic tool to reveal areas for improvement and guide refinement of your AI system. Where there are issues (like a poor score or performance regression), engineers can conduct error analysis to identify the root causes and develop potential solutions.

Error analysis is implemented by inspecting traces. A trace is the complete, end-to-end record of a single user request as it moves through each component of the AI system. A trace typically includes:

* **Each component’s inputs and outputs**: For the “answer generation” component of an AI agronomist product, this may include the raw user query (in Marathi), its english translation (the actual input to this component), system prompt, data retrieved from a knowledge base, and the answer generated.
* **Model selection and parameters**: If the system can call multiple models, the trace will include which foundation model was used (e.g. GPT-4o), along with the settings or configuration (e.g., temperature: 0.7, max\_tokens: 1024).
* **Usage, cost, latency**: Trace data also include the number of tokens used to produce a given output, the corresponding cost of generation, and the time required to deliver the output.

Since a modern AI solution has many components, a poor score may indicate the issue but not its source. Engineers must identify which component(s) have contributed to a failure; it could be ineffective document retrieval in a RAG system, a poorly structured prompt, or something else.

The Product Manager is the primary consumer of an error analysis, prioritizing modifications and refinements for testing. They must weigh the business impact of a given metric (e.g. an improvement in "Hallucination rate" or "Accuracy") against the engineering cost to address it. Most solutions will require multiple cycles of the measurement-refinement loop on your golden dataset before deploying to real users.

As you engage in cycles of evaluation and analysis, it is tempting to endlessly tweak the AI system to maximize its evaluation score. However, metrics are coarse proxies for how an AI pipeline will perform in the real world. This is especially true if you lack the historical transaction data needed to build a truly representative Golden Dataset. Instead of chasing incremental gains (e.g., trying to move accuracy from 93% to 95%), consider establishing a performance threshold for each metric (e.g. accuracy > 90%). Once the AI system passes this threshold, stop optimizing in the lab and move quickly toward a real-world deployment.

Shifting to a threshold-based approach accelerates your transition from a controlled environment to the real world, offering two critical advantages:

1. **Access to authentic behavior**: Real user behavior is often drastically different from what developers anticipate. Shipping the AI product early allows you to gather high-value user data to update your Golden Dataset, making it representative of actual edge cases.
2. **Prioritization of Real Problems**: The issues that frustrate users in the wild are rarely the same ones solved by chasing a 2% improvement on an internal metric. Real-world exposure helps you identify the most pressing failure modes so you can prioritize the fixes that actually improve user experience.

#### Online evaluation

Once your solution has been debugged and is ready for deployment to real users, you will need to continuously monitor performance metrics as well as guardrail metrics. This enables you to manage the trade-offs between accuracy, safety and broader service-level performance (e.g. latency). Evaluation results should be actively monitored over time, and unexpected behavior (i.e. weak scores, performance regressions) should be flagged automatically.

To do this, we advise integrating an observability tool (e.g. Langfuse) into your AI system to implement logging or “tracing” of the inputs and outputs for various components in your AI system once you launch. Many of these platforms allow you to track and visualize your evaluation results. By reviewing the user interaction "traces" captured by these services, you can spot novel patterns, user request types, and (critically) failure modes. Add these new examples to your "golden dataset" to improve its coverage and representativeness.

By monitoring user traces during online evaluation, you can identify new and unexpected ways your users may be interacting with your solution, including failure modes not encountered during lab testing (offline evaluation). The resulting insights should be used to augment or modify your golden dataset, update metrics, and refine your product or solution in a continuous feedback loop.

Though we will reference online evaluations where relevant, we do not provide detailed guidance on LLM tracing. We will leave this for future extensions of this playbook, and [recommend this guide for reference](https://hamel.dev/blog/posts/evals-faq/).

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F4.-scoring-and-error-analysis>" %}

</details>


# Automate your evaluations

When you first score an AI system’s performance against your Golden Dataset, it is reasonable to use a notebook, where you can quickly test code, see the results, and make changes. However, manual evaluation can become tedious, is not scalable, and introduces inconsistency. We recommend gradually automating the process and integrating it directly into your engineering team's workflow.

Automated evals can be continuous (e.g. on every AI response on production), or they can be triggered by certain events (e.g. a change in the system prompt). The engineering team is responsible for the technical implementation of the evaluation pipeline. This includes managing the execution and frequency of evaluation, ensuring results are reliable and accessible, and integrating the evals in your deployment flow. We recommend the following practices:

* **Find the right evaluation frequency**: Evaluation methods vary significantly in reliability, computational cost and latency. A tiered approach balances cost and information:
  * Low-cost evals: Statistical scorers and model-based scorers (covered earlier) are fast and inexpensive. These can be run frequently to provide rapid feedback. But as noted above, they are limited in what they can measure.
  * High-cost evals: LLM-as-judge scorers can be more comprehensive but can incur significant token costs. Their execution can become less frequent once you have a stable version deployed. Common triggers include nightly builds, weekly schedules, or as a final validation step before a major release.
* **Check periodic alignment**: You may decide to run your LLM judge on the output of every response on production (or a sample of production data) for monitoring online evaluation. However, it is important to ensure that your LLM judge continues to be aligned with human experts so that its judgements remain relevant. Similar to the initial alignment exercise explained earlier, it is strongly recommended to periodically sample your production data (once in a month/quarter/year, depending on the maturity and stability of your AI system) and repeat the alignment exercise.
* **Track performance over time**: Evaluation is a continuous process. Use your observability tool to plot your online evaluation metric scores over time. This dashboard provides a critical view for product owners to track progress against rubric goals and verify that solution changes are yielding measurable improvements.
* **Perform A/B tests**: Instead of releasing every change to all your users, it would be more prudent to release it to a small subset of users (e.g. 1%) to ensure it is stable and works equally as good or better than the previous version by comparing your metrics on
* **Integrate with CI/CD**: Once the evaluation suite is stable, it can be integrated into your deployment pipeline to ensure that all code/prompt/model changes are validated before being deployed, preventing regressions.

Once you start generating real user inputs and feedback, you can update your metrics and golden datasets, test different system configurations, and deploy the version of your AI pipeline that performs best.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F5.-automate-your-evaluations>" %}

</details>


# Red-teaming

Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild. ​​The goal is to find vulnerabilities, biases, and failure modes before your users do. You must adopt the mindset of a malicious actor, a confused user, or a creative edge-case generator to trick the system into behaving in ways it shouldn't.

While all AI solutions benefit from adversarial testing, it is non-negotiable in the following high-stakes scenarios:

* **Access to PII**: If your system handles sensitive data, the risk of privacy breaches increases. You must ensure adversarial prompting cannot manipulate the AI system into leaking restricted information.
* **Fine-Tuned Foundation Models**: Custom training can inadvertently weaken a base foundation model’s built-in safety guardrails or introduce new biases. You must re-test to confirm the foundation model remains aligned after modification.
* **Agentic & Flexible Solutions**: The more autonomy a system has (e.g., browsing the web, executing code), the more pathways exist for failure. Increased freedom demands increased adversarial testing.
* **Long Conversations**: Evaluators often test single-turn Q\&A exchanges, missing the cumulative risks in conversational interfaces (e.g., mental health companions or tutors). Long interactions are susceptible to cumulative errors. A small misunderstanding in turn 1 can be amplified by turn 10, causing the system to "drift into unsafe or nonsensical territory." Red-teaming must explicitly test these long-context scenarios.
* **High-Risk Domains**: In sectors like maternal health or financial planning, failure causes severe harm. Red-teaming is essential to identify and mitigate dangerous advice.
* **Population-Scale Deployments**: When deploying to a large, anonymous user base, you must assume two things: 1) you do not know how users will interact with the system; and 2) at scale, even improbable "edge cases" will occur somewhat frequently.

Specialist red-teaming services can be prohibitively expensive. For most social sector organizations, it is more practical to build this capability internally.

To do this effectively, follow a simple three-step workflow:

{% stepper %}
{% step %}
**Plan: define the scope and the team.**

* **Define "Redlines"**: Establish your threat model by identifying worst-case scenarios and the specific behaviors the system must never exhibit (e.g., leaking PII or giving medical advice). Core threat categories include misuse, loss of control, robustness failures (e.g. the system performs well in lab conditions but breaks in real-world variability).
* **Assemble the Team**: Gather a diverse mix of technical engineers and domain experts.
* **Choose the Method**: Decide if you will test via model APIs (faster, automated) or the Product UI (more realistic user experience), or both.
  {% endstep %}

{% step %}
**Probe: Adopt an adversarial mindset.**

* **Attack the AI system**: Act like a malicious actor, a confused user, or an edge-case generator. Try to distract, exploit, and stress-test the system.
* **Hunt for Failures**: Specifically look for unsafe, biased, or nonsensical responses, particularly in long conversations or sensitive domains.
* **Log Everything**: For every failure, capture the specific Input, the Output, and the Context to ensure reproducibility.
  {% endstep %}

{% step %}
**Prioritize: Not all failures are equal.**

* **Rank Risks**: Review findings based on severity (impact) and likelihood (frequency).
* **Assign Fixes**: Allocate owners to address the highest-priority vulnerabilities.
* **Re-Test**: After mitigations are applied, rerun the tests to ensure the fix worked and didn't break anything else.
  {% endstep %}
  {% endstepper %}

For practical templates and sample prompts, see:

* [Red-Teaming AI for Social Good Playbook](https://humane-intelligence.org/insights/research/) (UNESCO & Human Intelligence, 2024)
* [Planning Red-Teaming for Large Language Models](https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/red-teaming) (Microsoft Learn, 2024)

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F6.-red-teaming>" %}

</details>


# Overview

Does the overall product engage and retain users?

An AI system that produces perfect responses is worthless if no one uses it. Level 2 Evaluation moves beyond technical accuracy to measure the "digital traces" users leave behind. By tracking how users move from their first interaction to long-term habit formation, we can ensure the product actually delivers value in the real world.

***

#### Key Motivation

Technical performance (Level 1) does not guarantee user adoption. Level 2 evaluation is critical because:

* Value Validation: If users stop interacting, they likely see no value, meaning the intervention cannot achieve its intended life outcomes.
* Continuous Improvement: It transforms product development from opinion-driven to data-driven through iterative cycles and A/B testing.
* Safety & Risk Management: Monitoring user signals allows for controlled rollouts of experimental features, preventing negative reactions from reaching your entire user base at once.

<a href="/pages/E03jNZhP6xVQj9iyqtpQ" class="button primary">Read more -></a>

***

#### Core Concept: The User Funnel

To evaluate the product, we "instrument" the application to track users as they progress through four distinct stages. We prioritize "Time to Success" (solving the user's problem) over "Time on Device" to ensure we are optimizing for welfare rather than just addiction.

| Stage       | Goal                                | Key Metric Example                        |
| ----------- | ----------------------------------- | ----------------------------------------- |
| Acquisition | Bring users into the ecosystem.     | New User Count, Cost Per User (CAC)       |
| Activation  | Ensure users find "First Value."    | Activation Rate, Time to Activate         |
| Engagement  | Measure depth and frequency of use. | Active Users (DAU/WAU), Interaction Depth |
| Retention   | Build long-term habits/commitment.  | Stickiness (DAU/MAU), Retention Rate      |

<a href="/pages/EqhXdpLbzDck30Xp4yln" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 2 evaluation is performed by integrating 3rd party analytics tools (e.g., Amplitude, Mixpanel) to capture real-time data.

1. Define & Instrument: Map your user journey and identify specific "events" (e.g., "audio advice played") that signal progress.
2. Analyze Trends: Use dashboards to identify friction points where users consistently drop off.
3. Experiment: Run A/B Tests to compare different versions of a feature. By randomly assigning users to "Version A" or "Version B," you can statistically prove which design better supports user goals.
4. Diagnose: If metrics are low, conduct a Process Evaluation (interviews or surveys) to understand the "why" behind the data—such as connectivity constraints or literacy barriers.

<a href="/pages/E2tuvaDI9APvHRvWJyRS" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Foverview>" %}

</details>


# Who is most involved in this level of evaluation?

| Execute 🟢                                                                                                                            | Support 🟡                                                                                                                        |
| ------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| Product Managers                                                                                                                      | Data Scientists                                                                                                                   |
| Directly responsible for product metrics at this level. Works cross-functionally to prioritize the most promising hypotheses to test. | Apply evaluation methods with the proper measurement tools. Ensure accuracy and availability of product metrics (data pipelines). |

## Why is this level of evaluation important?

An AI system that produces perfect responses is worthless if users do not use it. Once you deploy your AI system as a product (e.g., a chatbot or app), you must track a few critical user signals, like:

* Engagement: How many users are using the product?
* Retention: How likely are they to continue using it?

If users never engage—or stop interacting because they see no value—they are unlikely to change their behavior in ways that improve their life outcomes.

Like AI system evals, Level 2 evaluation is a continuous, iterative cycle, not a one-time exercise. We track user interaction metrics over time, and look for unexpected drop-off or intended improvements, for example when a promising new feature is released as part of an A/B test. Product evaluations are critical for iterative improvement, but they can also be a matter of safety. Suppose you have an experimental new feature in your chatbot – but you’re not sure how people will react. It might be risky to roll out this new feature to all users, all at once.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Foverview%2Fwhy-is-this-level-of-evaluation-important>" %}

</details>


# What is the “Product” being evaluated?

To understand if users are actually finding value in your product, you must "instrument" your application, setting it up to automatically log specific user actions. The resulting log data allow you to track users as they move through the User Funnel: from their first interaction (Activation), to regular usage (Engagement) to long-term commitment or habit formation (Retention).

In the tech sector, companies might track "clicks" and "purchases" as users move through a website. By analyzing logs, you can then identify which content or products are likely to bring users back to the website over time, or how different web experiences affect browsing time. In the development sector, we need to track actions that signal user intent, and estimate the value returned to users in response. For example:

* For an AI Agronomist, instead of tracking page views you might track whether a farmer uploads a photo of a diseased crop, listens to audio advice to completion, or shares content with another person.
* For an AI Tutor, you might track if a student completes a quiz, how many follow-up questions they ask in a single session, or if they return to the app the night before an exam.

By analyzing these "digital traces," we can identify exactly where users lose interest. Does the farmer drop off because the photo upload takes too long? Does the student quit because the AI’s first response was too complex?

**The good news:** Fortunately, Level 2 evaluation methods are well-scoped; you don't need to reinvent the wheel. The technology sector has spent decades standardizing digital product metrics. Most off-the-shelf analytics tools for web/mobile applications (like Amplitude or Google Analytics) come ready-made to measure the standard metrics you need, such as Daily Active Users (DAU), Time to Activate, and Retention Rates.

The following table defines common Level 2 product metrics for each stage:

<table data-full-width="true"><thead><tr><th width="140.33984375">Stage</th><th>Metric</th><th>Examples</th><th>Notes</th></tr></thead><tbody><tr><td><strong>Acquisition</strong></td><td><strong>New Users (#):</strong><br>Total count of users entering the top of the funnel</td><td># farmers consenting to receive WhatsApp messages; # students downloading an app; # health workers attending a training session.</td><td>There are costs associated with recruitment, so you may wish to target your ideal users effectively and efficiently. Track the source of each new user (e.g., "Referral" vs. "Field Visit") to identify which channels yield the most relevant users, then scale the channel with the highest yield.</td></tr><tr><td><strong>Acquisition</strong></td><td><strong>Cost Per User (CAC)</strong>:<br>Cost of running a recruitment activity, divided by new users acquired. Also called User Recruitment Cost.</td><td>Cost of printing flyers / # of QR code scans.<br>Cost of field agent stipend / # of farmer sign-ups.</td><td>High CAC may be unsustainable for low-margin services (i.e. general advice) but acceptable for high-impact interventions like urgent telemedicine consults.</td></tr><tr><td><strong>Activation</strong></td><td><strong>Activation rate:</strong><br>% of users who complete the "First Value" action after signing up or being recruited.</td><td>% of mothers who complete their health profile<br>% of teachers who create their first lesson plan</td><td>Activation measures if users actually start using the tool or service, not just if they installed it. It is a user's early experience with the product, so important to “get right” to prevent drop-off. Onboarding can also be a critical point for collecting user demographic data used for personalization and subgroup analysis.</td></tr><tr><td><strong>Activation</strong></td><td><strong>Time to activate</strong>:<br>Average time elapsed between sign-up and the first core action.</td><td>Minutes from first WhatsApp message to asking the first medical question<br>Days from initial training to logging the first case data</td><td>New user interest tends to taper off exponentially after sign-up, so it’s important to encourage users to complete onboarding within the first few hours/days.<br>Long activation times usually signal a confusing interface or a lack of trust or value.</td></tr><tr><td><strong>Engagement</strong></td><td><strong>Monthly, weekly, and/or daily active users</strong> (MAU, WAU, DAU):<br>Number of users using the app/feature in a time window.</td><td># Community Health Workers using a reporting tool daily<br># Students using a study bot weekly before exams</td><td>Do not optimize for addiction. In welfare-focused apps, "Time to Success" (i.e., getting to an urgent answer quickly) is better than "Time on Device."</td></tr><tr><td><strong>Engagement</strong></td><td><strong>Interaction depth</strong>:<br>Volume of interaction per session (e.g., turns per conversation).</td><td>Average # of follow-up questions a student asks (signals curiosity) per session<br>Rate at which a user accepts a suggestion (signals trust)</td><td>For a chatbot, the # of conversation turns can be good (deep inquiry) or bad (confusion). Pair engagement metrics with qualitative inquiry (Level 3).<br>Remember that frequent interaction (e.g. page loads, button clicks, form submits, session length) may not translate to meaningful interaction.</td></tr><tr><td><strong>Retention</strong></td><td><strong>Stickiness (DAU/MAU):</strong><br>Ratio of daily users to monthly users.</td><td>A ratio of 0.25 means the average user uses the tool 7-8 days per month.<br>A ratio close to 1 indicates sustained value or habit formation.</td><td>High stickiness indicates the tool is part of a daily workflow. A low ratio suggests the tool is only useful for sporadic problems (e.g., seasonal crop disease) rather than daily habits. Is habit formation critical to your welfare goals?</td></tr></tbody></table>

In addition to these, you may want to specify negative metrics (or safety guardrails) that track problematic user behavior. For example, if a mother uses a health chatbot for an emergency, a long session may indicate confusion and inefficiency. We must aggressively distinguish between intense or extensive engagement, and effective engagement.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Foverview%2Fwhat-is-the-product-being-evaluated>" %}

</details>


# What is the Minimum Viable Evaluation?

We recommend using commercial platforms, when feasible, to track user metrics and automate experiments.

| Level 2 - Product evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Instrument the product to capture events automatically</li><li><input type="checkbox">Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)</li><li><input type="checkbox">Look for patterns in the data, and talk to users to identify opportunities for improvement</li><li><input type="checkbox">Test these ideas for improvement against these metrics with an A/B test</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Foverview%2Fwhat-is-the-minimum-viable-evaluation>" %}

</details>


# How is Level 2 evaluation performed?

It is straightforward to instrument your product to track user actions and events in your product using 3rd party analytics tools (e.g. Amplitude, Mixpanel, Google Analytics). You can start delivering meaningful Level 2 results very quickly, simply by observing trends in user events over time. For example, users may get stuck during a specific chatbot interaction, or they might stop engaging after a specific prompt for action. With appropriate metrics, you can run rapid cycles of experimentation, testing variations in your product’s elements (or features) to find strategies that reduce user dropoff.

These are the critical steps in a Level 2 evaluation:

{% stepper %}
{% step %}

#### Define the user funnel and metrics

First, qualitatively define the stages of your user journey. How do you describe the user who is: acquired (recruited), activated (onboarded), engaged (using regularly), and retained (coming back). An acquired user could be someone who has attended a recruitment event, and given their consent to participate in a program. An activated user might be one who has completed the training needed to start benefiting from an app or service. Once stages of the funnel are defined, select industry-standard metrics that match the expected behavior at each stage. If your app is designed for weekly use, track Weekly Active Users (WAU) over time, rather than Daily Active Users (DAU).
{% endstep %}

{% step %}

#### Instrument the product

Next, you must identify the specific events, or user actions, that signal progress through the user funnel. These might include "app opened," "question asked," or "training completed." Select the events that are most consequential to the user's success. The events you select will be used to calculate your metrics. Some events may not be used to calculate a metric, but can still be useful to track, for example if they help you understand the user’s path and identify potential bottlenecks (like clicking the help icon). Your engineering team can implement a third-party analytics tool (e.g., Amplitude, Mixpanel) to capture and log these events automatically.

{% hint style="warning" %}
Always log a unique User ID with these events; this connects your product data to evaluations at Levels 3 & 4.
{% endhint %}
{% endstep %}

{% step %}

#### Automate metrics and analyze trends

From the event data being logged, your data scientist or engineer will construct the metrics you have selected. For instance, “Time to Activate” can be calculated by counting the number of days between the first and last user actions required for completion of the "activation" stage. Your engineering team can display key metrics in a dashboard that can be shared across the entire organization. Out of the box, most commercial analytics platforms can display time-series charts and funnel charts (to visualize drop-offs at each stage). Ideally, each metric has a directly responsible individual who is accountable for monitoring and improving the value over time.
{% endstep %}

{% step %}

#### Identify frictions and design improvements

Once you configure your analytics tool to visualize your metrics, use the data to identify friction points—specific steps in the product flow where users consistently drop off. For example, you might notice expectant mothers stop engaging specifically when a maternal health chatbot asks for their location. Or you might see farmers abandon an advisory app if a prompt requires too much typing. Where you identify failure points or bottlenecks, brief user interviews or surveys can be conducted to investigate. Program or product leads can then propose new features, interventions, or nudges to help users progress through the funnel.
{% endstep %}

{% step %}

#### Test product upgrades

To test the effectiveness of promising new features or interventions, you can randomly assign users to receive either the existing product, or the “improved” variant. This is done through [A/B testing and other forms of experimentation](/product-analytics/how-to-evaluate/methods-for-experimentation-a-b-testing-and-beyond). The events associated with a new feature should be flagged in your 3rd party product analytics tool, so that you can easily assess their impact on priority metrics. Many engineering teams leverage automated experimentation platforms to manage feature testing; these platforms also help you track which versions of a product or feature are most effective, across experiments.
{% endstep %}
{% endstepper %}

Once the results of an experiment are available, product or program managers interpret the findings and work with engineering to roll out the most effective features or interventions to all users.

{% hint style="warning" %}

### <mark style="color:$warning;">Do not optimize your funnel metrics for the user engagement alone.</mark>

While product use and retention may be necessary entry points for impact, you should avoid engaging users in ways that waste their time or worsen their well-being. Instead, optimize for a metric like “Time to Success”, where success is defined as solving the user’s problem or building their capabilities. Setting appropriate boundaries for a healthy level of engagement is also critical.
{% endhint %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Fhow-is-level-2-evaluation-performed>" %}

</details>


# Methods for experimentation: A/B testing and beyond

Once you routinely track product performance metrics, you can run rapid experiments to observe how new features or improvements affect your users’ behavior. For example, if you tweak the text on a call-to-action, you might expect to see an immediate rise in the corresponding user event. However, not every change yields a positive outcome, and often there are multiple ways to solve a single problem. To identify which product variant is best, you can run Level 2 evaluations.

A/B testing is the most common approach. It’s how tech companies build and refine products, and how digital marketers improve ad performance. The concept is simple: expose “version A” of a product to one set of users, and “version B” to another, then measure which version more effectively improves your funnel metrics.

For the A/B test to be valid, **randomization is critical**: it ensures that any observed final differences across groups can be attributed to the product changes being tested, rather than to pre-existing differences between user groups. The process begins with selecting a random sample of your user base for the experiment. Individuals within this sample are then randomly assigned to different treatment groups.

It is important to select a user sample large enough to ensure your test is representative of the entire user base. The treatment groups must also be large enough to generate statistically significant estimates of test effects.

While we recommend randomized evaluations, there are multiple methods for testing your product changes:

### Pre/post comparison

This is the simplest method: you simply compare the value of a metric before the change with the value of the metric after. For high-impact, obvious updates, this time-series analysis is often enough. If the product change does not have the intended effect, you simply roll it back.

### Multivariate testing

If you need to test multiple feature variations at once—such as changing text, visuals, and timing of content simultaneously—you can use Multivariate Testing. This reveals the combined effect of these changes, helping you find the optimal combination of elements. For this test, you will randomize the roll-out of different combinations of changes across users, and then compare metrics across user groups.

### A/B testing (recommended)

For the most reliable evaluation of a change in your product, we recommend A/B Testing. Rather than rolling out a change to everyone at once, you expose different versions of the product to different randomized groups of users. This allows you to compare the effects of a change directly, or rapidly test multiple ideas at once. Because this approach balances statistical rigor with feasibility, A/B testing is considered a part of any "Minimum Viable Evaluation" (MVE). There are two flavors of A/B testing:

* **A/B Tests with Planned Assignment (recommended)**: Individual product features or content variants are deployed to randomized user groups. This design works well for testing discrete changes with clear hypotheses. The approach can be extended by testing several competing ideas at the same time, although this requires larger sample sizes to maintain statistical power across all comparison groups.
* **A/B Tests with Self-Assignment**: Users can dynamically self-assign themselves to different variants based on conditional, logic-based paths. For example, some users may choose to attend an in-person training for an app, while others attend an online training. This approach is designed to analyze how different segments of users interact with and respond to the specific variant they chose. Be cautious when analyzing these results: because users self-select, the comparison groups are not random. Differences in performance may stem from the users' inherent traits (e.g., users choosing in-person training may naturally be more motivated) rather than the effectiveness of the variant itself.

### Dynamic assignment

Instead of randomizing users into fixed A/B groups, you can dynamically route users over time to different product variants, using a method known as multi-armed bandit (MAB) testing. If a product variant results in large improvements in user metrics, the MAB algorithm will progressively allocate more users to that variant over the course of the experiment. Contextual bandits is a further refinement that allocates versions of a product based on a combination of the variant’s performance and the user's characteristics. These algorithms allow you to maximize success for your users overall, while still running a rigorous experiment. However, they require more sophisticated real-time analysis and assignment infrastructure, and if you also need unbiased estimates of effect, they also require substantially larger sample sizes.

### Holdout testing

To track performance of your product over the long term, you can use Holdout Testing. In this design, a small group of users is kept on a "status quo" version of the product, frozen in time, while the majority receives the accumulation of new, tested features. Comparing metrics across these groups allows you to measure the total value of your improvements over months or years. This approach can be used in combination with any of the evaluation methods above.

The technical implementation of A/B testing has been greatly simplified by modern tooling. Randomized assignment can be automated using both open-source frameworks and commercial platforms. Open source tools like Evidential and GrowthBook offer flexibility and control for teams with engineering capabilities. Commercial analytics tools like Amplitude and Mixpanel offer ready-to-use experimentation interfaces that automatically track and analyze results using events data. Solutions like Optimizely provide visual editors, statistical analysis, and more sophisticated targeting capabilities.

A/B testing has become indispensable because it replaces opinion-driven decision-making with empirical evidence. When implemented rigorously, it enables organizations to continuously improve their products and experiences based on how real users actually behave rather than how we think they might behave.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Fhow-is-level-2-evaluation-performed%2Fmethods-for-experimentation-a-b-testing-and-beyond>" %}

</details>


# Connection with other levels

Ideally, as your Level 1 metrics improve and the AI system becomes more reliable, your Level 2 engagement metrics should also improve. However, technical performance does not always guarantee user adoption, so it is critical to monitor AI system metrics and product analytics in tandem, to ensure that engineering “improvements” actually translate into a better user experience.

In this playbook, Level 2 evaluation focuses exclusively on the digital traces users leave within the product. It does not include qualitative interviews or surveys that probe user beliefs and moods; these activities will fall under the domain of Level 3 evaluation. Level 3 is also where we track many of the metrics used to **monitor harm** (e.g., anxiety, addiction). This is why we must evaluate Levels 2 and 3 in tandem.

Note that Level 2 also ignores the external, “real world” inputs to a social program or service (like in-person trainings or customer support) that complement a digital product. Metrics to track these in-person events are typically captured in process evaluations, conducted by Monitoring and Evaluation (M\&E) teams. We recommend reviewing process evaluation data alongside Level 2 product analytics, to better understand whether user frictions result from failures in the product, or in the associated offline services.

As your product evolves, remember to refine and revalidate your Level 2 metrics, to better capture the nuance of the user experience. Metrics that record meaningful interactions are more valuable than raw event counts.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Fhow-is-level-2-evaluation-performed%2Fconnection-with-other-levels>" %}

</details>


# Why Aren’t Users Engaging?

If your funnel metrics reveal low uptake or drop-off, a process evaluation (PE, [see primer here](/level-linkages/linkage-across-levels/process-evaluations)) can diagnose the cause and inform product iteration before progressing to Level 3.

A process evaluation at this stage interrogates the assumptions in your theory of change about the user: what they know, what they can do, and what conditions need to be in place for them to engage as intended.

<table><thead><tr><th width="139.9453125">Funnel stage</th><th>Example PE questions</th><th>Methods</th></tr></thead><tbody><tr><td>Acquisition</td><td>Are the onboarding and recruitment processes delivering users to the product in ways that set up meaningful engagement — or are early frictions in the program's delivery model creating drop-off before users have a fair chance to engage?</td><td>Key informant interviews; administrative data on recruitment or referral sources</td></tr><tr><td>Engagement &#x26; Retention</td><td>What barriers (e.g. literacy demands, connectivity constraints, or social and structural limits on device access) are preventing the hardest-to-reach users from progressing through the funnel?</td><td>Interviews, focus groups, and/or surveys with non-adopters and early lapsed users from underserved subgroups</td></tr></tbody></table>

Document what product changes are made in response, and re-run the relevant funnel metrics afterward to confirm the bottleneck has been resolved.

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-2-product-evaluation%2Fhow-is-level-2-evaluation-performed%2Fwhy-arent-users-engaging>" %}

</details>


# Overview

Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome?

Once an AI system is reliable (Level 1) and engaging (Level 2), we must ask the deeper question: Is it actually working? In the development sector, "liking" a product is not a proxy for impact. Level 3 evaluates the "stepping stones" of change—the intermediate cognitive and affective shifts that predict long-term life improvements in health, education, or livelihoods.

***

#### Key Motivation

Unlike commercial sectors that rely on satisfaction scores (NPS), development outcomes require objective evidence of change. Level 3 is essential because:

* Predictive Power: Intermediate changes (e.g., increased confidence or knowledge) serve as early signals of success long before distal outcomes (e.g., higher income) materialize.
* Beyond "Vanity Metrics": It distinguishes between a user who is merely "addicted" to an interface and one who is actually gaining agency or mastering a skill.
* Fast Iteration: It allows you to run experiments on psychological "states" (like motivation or trust) to refine your product during pilots.

<a href="/pages/ul2a65TvdgqoIC9bEwF4" class="button primary">Read more -></a>

***

#### Core Concept: Intermediate Outcomes

Level 3 measures how an "adequate dosage" of your AI product shifts the user across several dimensions. We look for changes in the following constructs:

| Outcome Category | What we measure                                                           |
| ---------------- | ------------------------------------------------------------------------- |
| **Cognitive**    | Knowledge acquisition, belief updating, and reasoning complexity.         |
| **Affective**    | Emotional valence, sense of safety, trust, and perceived empathy.         |
| **Behavioral**   | Intent to act, application of info, and proactive help-seeking.           |
| **Motivational** | Self-efficacy, intrinsic curiosity, and persistence vs. dependency.       |
| **Relational**   | Quality of interpersonal communication and trust in human vs. AI sources. |

<a href="/pages/z0S08pUtqzummCSqeNjA" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 3 combines the experimental rigor of Level 2 with deeper psychological and linguistic analysis.

1. **Generate hypotheses based on a theory of change:** Based on the theory of change, define intermediate cognitive, affective, or behavioral outcomes that are plausibly linked to your targeted social impact.
2. **Identify outcome metrics (Digital Traces):** E.g. Analyze conversation logs for "on-platform" behaviors that signal growth, such as increased query depth, technical vocabulary, or proactive follow-up questions.
3. **Define guardrail metrics and measure potential harm:** Specifically measure potential harms, such as "AI dependency" (reduced willingness to attempt tasks without help) or "social displacement."
4. **Consider constructing proxies for long-term development outcomes:** We propose constructing a "Surrogate Index", consisting of Level 2 and Level 3 metrics, to serve as a proxy for longer-term Level 4 outcomes.
5. **Consider conducting experiments to improve the selected key metrics and running process evaluations:** After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes.&#x20;

<a href="/pages/QJQlH47rYKyaPoYUIIAq" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Foverview>" %}

</details>


# Who is most involved in this level of evaluation?

In Level 3, researchers evaluate users’ attitudes and behaviors using [quantitative and/or qualitative methods](https://www.nngroup.com/articles/which-ux-research-methods). Contributors are often trained in behavioral science, social psychology, public health, or behavioral economics. They should have a mix of quantitative and qualitative skills, or be knowledgeable enough to support a multi-methods team using:

* **Quantitative approaches** (analyzing logs, surveys, and conversation data) to infer users’ state or traits. These approaches measure constructs such as knowledge, beliefs, intention, norms, feelings, behaviors, etc.
* **Qualitative approaches** (interviews, focus groups, usability tests, and ethnographic methods) to understand how users interact with a product. These approaches help researchers validate assumptions, understand mechanisms, surface unintended effects, and expose contextual or environmental drivers of user pain points.

| Design and Execute 🟢                                                  | Support and Implement 🟡                                                                                                         |
| ---------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| User Researchers                                                       | Data Scientists / Engineers                                                                                                      |
| Develop and apply evaluation methods with the proper measurement tools | <p>Build and deploy surveys and experiments within the product<br>Support the design of A/B tests and randomized experiments</p> |

## Why is this level of evaluation important?

Once an AI system is functioning as expected (Level 1), and the product is engaging users as intended (Level 2), we can ask a deeper question: Is the product influencing how users think, feel, or act—and in ways that advance a development outcome of interest?

The success of commercial AI products is often measured via user satisfaction ratings or Net Promoter Score (NPS)—essentially asking, *'Do you like this product enough to recommend it?'* But in the development sector, satisfaction is not a proxy for impact. A student might enjoy a tutoring app (high NPS) without actually mastering the curriculum. A patient may favorably review a health provider, even when harmed by sub-standard care.

In Level 3 evaluation, we identify and measure specific behaviors, beliefs, or feelings that predict long-term improvements in health, education, or livelihoods. We will use a program’s Theory of Change (TOC) to specify the “stepping stones” that users traverse on their path toward impact. Instead of waiting years to see if health or education outcomes improve, we will identify intermediate changes in how users think, feel, or act to serve as early signals of success.

To do this, organizations should address 5 key issues:

1. **Measures**: Which specific user-level changes actually matter to our Theory of Change? Can we measure these short-term changes relatively cheaply and frequently?
2. **Attribution**: Can we plausibly claim these changes are caused by our AI product?
3. **Trajectory**: Are metrics trending in the right direction? Do sub-groups behave differently?
4. **Malleability**: Can we shift metrics by altering the product experience? Do users show increased drive to act (e.g., asking proactive questions or expressing intent to change) when we intervene with product improvements?
5. **Perception**: Do users feel more empowered to act (e.g., do they have a clearer understanding of their next steps), even if they do not immediately take action? User perceptions can be predictive of outcomes, for example when a student recognizes the learning gains they have achieved while using an app.

By defining and tracking intermediate outcomes, Level 3 helps you conduct fast product iterations during pilots and ongoing feature development, setting the stage for a successful Level 4 evaluation down the road.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Foverview%2Fwhy-is-this-level-of-evaluation-important>" %}

</details>


# Who is the “User” being evaluated?

At Level 3, we will examine whether the engaged user (i.e., a user receiving an “adequate” dosage) is thinking, feeling, or acting differently as a result of the product – hopefully in ways that predict improved life outcomes. This level of evaluation typically occurs in advance of an impact assessment. Before committing to a rigorous and time-intensive impact study, we want to observe users changing along some of the following dimensions, based on the theory of change of the product:

* **Cognitive outcomes:** Are users learning? Are they gaining new knowledge or updating beliefs? Do they demonstrate improved skills or decision-making ability as a result of engaging with the product?
  * *Constructs to measure:* comprehension, knowledge acquisition and retention, belief updating, critical evaluation of information, metacognitive awareness (e.g., accurate calibration of what one does and does not know), perceived clarify, complexity of reasoning during/following interaction.
* **Affective outcomes:** How does the product make users feel? Do users report feeling supported, motivated, and capable after interactions, or are there indications of confusion, anger, or emotional distress?
  * *Constructs to measure:* mood, emotional valence and arousal, frustration, confusion, emotional granularity, felt support, sense of safety, sense of belonging, perceived empathy, trust, or comfort interacting with AI.
* **Behavioral outcomes:** Is the user doing something different? Are users taking small but meaningful actions that predict longer-term development?
  * *Constructs to measure:* application of new information, intent to try recommended behaviors, observable shifts in interaction patterns(e.g., asking more complex questions, prompt sophistication) that proxy for longer-term development outcomes, help-seeking behavior.
* **Motivational outcomes:** Does the product energize or deplete a users’ drive to pursue goals, learn, or act independently?
  * *Constructs to measure:* intrinsic motivation, curiosity, self-efficacy, perceived autonomy, goal commitment, persistence (not the same as perseveration), dependency (i.e., reduced willingness to attempt tasks without AI assistance).
* **Social and relational outcomes:** Does use of the product affect the user’s human-to-human relationships and broader social functioning?
  * *Constructs to measure:* social displacement (substitution of AI for human interaction), loneliness, perceived social support, quality of interpersonal communication, willingness to engage with others, trust in human versus AI sources of social support and information.
* **Well-being outcomes:** Does using the product provide broader and more distal effects on users’ overall quality of life and psychological health beyond momentary feelings/mood?
  * *Constructs to measure:* life satisfaction, meaning and purpose in life, flourishing, burnout (i.e., in professional contexts), perceived agency/control over one’s environment.

#### Level 3 vs. Traditional User Research

User research plays a critical role across the product lifecycle (e.g., [Discover, Explore, Test, and Listen](https://www.nngroup.com/articles/ux-research-cheat-sheet/)). However, in Level 3 we will focus on quantitative user research that captures intermediate outcomes at scale. These outcomes are sometimes observed in the product logs described in Level 2, but more often captured in surveys or automated analysis of user text, voice recordings, and other digital traces. At Level 3, our goal is to cheaply track psychological “states” and “traits” (e.g., cognitive, affective, and behavioral) across a large sample of users, for use in product monitoring and rapid-cycle experiments. We want to understand user shifts in psychology or behavior, but without needing to conduct bespoke qualitative research every time a new feature is being tested (even though at least one such qualitative evaluation should be conducted for the overall GenAI solution).

Of course, quantitative methods (e.g. logs, surveys, sentiment analysis) should be complemented with qualitative methods (interviews, ethnography) to validate why users behave the way they do. Ideally, qualitative user research contributes to the theory of change and informs every stage of the evaluation framework:

* Level 1 (AI system): Interviews help define the "Golden Dataset" by gathering realistic user questions, identifying edge cases, and defining "ideal" answers based on needs expressed by real users.
* Level 2 (Product): Interviews and direct observation can contextualize engagement data. If an A/B test reveals a drop in retention, qualitative research helps diagnose the underlying friction or confusion.
* Level 3 (User): Interviews or focus groups can validate that intermediate outcome metrics (e.g., user-reported confidence) actually correlate with real-world behavior changes.

#### Individual vs. System Outcomes

You may have noticed that in Level 3, we focus on individual outcomes, rather than the broader community or system. This is a practical choice, not a philosophical one. While changing social norms (e.g., how a village views vaccination) is often the ultimate goal of development, measuring those shifts requires slower, more extensive fieldwork (as discussed in Level 4).

In contrast, individual changes can be immediate. You can observe if a user is learning or motivated right now, even if the full impact of an AI product requires changes in social dynamics. This makes individual metrics the fastest, most sensitive signals for rapid product iteration.

Our recommendation: Do not ignore social dynamics, but do not let them slow down your experimentation. Measure what you can see today (i.e., the user’s behavior) and until you are ready for a full impact evaluation, use lightweight proxies to track social effects (e.g., asking "Did you share this advice with a neighbor?"). In-app questions about off-app behaviour can complement information about user behaviour.

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Foverview%2Fwho-is-the-user-being-evaluated>" %}

</details>


# What is the Minimum Viable Evaluation?

| Level 3 - User evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).</li><li><input type="checkbox">Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.</li><li><input type="checkbox">Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.</li><li><input type="checkbox">Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Foverview%2Fwhat-is-the-minimum-viable-evaluation>" %}

</details>


# How is Level 3 evaluation performed?

**The full workflow includes:**

{% stepper %}
{% step %}

#### Generate hypotheses based on a theory of change

Based on the theory of change, define intermediate cognitive, affective, or behavioral outcomes that are plausibly linked to your targeted social impact. Validate these via qualitative methods (e.g., user interviews) as well as quantitative research or reviews of the academic literature.
{% endstep %}

{% step %}

#### Identify outcome metrics

There are three potential ways to identify outcome metrics. First, you can analyze interaction data to construct metrics that reflect psychologically and behaviorally meaningful user interaction. Second, you can collect primary data. Often, the most direct way to gauge users' thoughts, feelings, knowledge, and behaviors is simply to ask them: short surveys can capture self-reported changes and subjective experiences, while longer surveys, interviews, quizzes, or observer reports can measure psychological well-being, behavioral frequency, and attitudinal shifts over time. Third, you can analyze conversation logs. For instance, you can use Natural Language Processing (NLP) methods to mine actual conversation logs or written outputs for signals of cognitive or emotional change.

<a href="/pages/mmnsDIJ2m98MmPatssMA" class="button primary">Read more -></a>
{% endstep %}

{% step %}

#### Define guardrail metrics and measure potential harm

As you reach Level 3 evaluations, you are not just measuring if your product is working; you want to measure if it is causing harm. While Level 2 metrics track usage, Level 3 is your opportunity to use direct interviews and surveys to track unintended consequences.

<a href="/pages/YS7PH4iamJQdKhqbTsOV" class="button primary">Read more -></a>
{% endstep %}

{% step %}

#### Consider constructing proxies for long-term development outcomes

We expect Level 3 metrics to materialize more quickly than Level 4 evaluation outcomes. In principle, short-term Level 3 indicators can be used in A/B testing to rapidly design and test product improvements. However, it is unlikely that any one Level 3 metric is fully predictive of Level 4 outcomes. Therefore, we propose constructing a "Surrogate Index", consisting of Level 2 and Level 3 metrics, to serve as a proxy for longer-term Level 4 outcomes. The validity of this index can be assessed in Level 4 evaluations (e.g., in RCTs), following the framework proposed by [Athey, Chetty, Imbens, and Kang (2025)](https://academic.oup.com/restud/advance-article/doi/10.1093/restud/rdaf087/8268796?guestAccessKey=). Although this approach relies on very strong assumptions of unconfoundedness, surrogacy, and comparability, we encourage the continued collection of indicators to capture the links between the intervention, its adoption, underlying mechanisms, and ultimate development outcomes.
{% endstep %}

{% step %}

#### Consider conducting experiments to improve the selected key metrics and running process evaluations

After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes. The evaluation methods remain the same as in Level 2, but are applied to a different set of outcomes (e.g., A/B testing: Feature A vs. Feature B; multi-armed bandits: performance-based adaptive allocation; holdout testing: e.g., AI vs. non-AI). We also recommend running process evaluations to gain an understanding on why and when Level 3 metrics are not changing.

<a href="/pages/wDurugvCHKU063ijmTAd" class="button primary">Read more -></a>
{% endstep %}
{% endstepper %}

As sensitive data, collecting information on user thoughts and feelings carries significant legal and ethical responsibilities. Because using GenAI models often involves sending data to third-party model providers (e.g., OpenAI, Google, Anthropic), it is also important to scrutinize whether their data governance and privacy and safety policies align with these responsibilities.&#x20;

<a href="https://eval.playbook.org.ai/level-linkages/linkage-across-levels/data-protection" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed>" %}

</details>


# Identify outcome metrics

There are three potential ways to identify outcome metrics, through 1) analyzing interaction data, 2) primary data collection, and 3) analyzing conversation logs.

### Identify outcome metrics through analyzing interaction data

Review log data or conversation data, as well as Level 2 metrics, to better understand what constitutes psychologically and behaviorally meaningful user interaction. Can you construct measures that go beyond the standard Level 2 engagement and retention metrics? As a first step, look for on-platform behaviors (captured in your app’s telemetry or interaction logs) that can proxy for cognitive and affective outcomes. Examples are provided in the table below.

#### Example metrics

<details>

<summary><strong>Frequency and depth of queries</strong><br>How often and how deeply do users query your AI service?</summary>

An upward trend in the quantity of questions, and the specificity of those questions, may indicate that the user’s curiosity, confidence, and understanding are increasing over time.

* Increased frequency of interaction with an AI tutor can signal curiosity and learning gains. [A recent study](https://eric.ed.gov/?q=source:%22British+Journal+of+Educational+Technology%22\&ff1=pubReports+-+Research\&ff2=subLearner+Engagement\&id=EJ1427270) with an AI “study coach” found that the number of student-chatbot interactions predicts improvement in students’ self-regulated learning (SRL) behavior. As learners become more confident, they ask more questions and explore topics further.
* The technical depth of queries can be informative: if users progress from basic factual questions to more advanced specific inquiries, it indicates knowledge growth. [Learning analytics](https://moldstud.com/articles/p-essential-benchmarking-metrics-for-evaluating-e-learning-success) often track whether learners move on to advanced content as a proxy for learning progression.

</details>

<details>

<summary><strong>Changes in language</strong><br>How complex are user queries?</summary>

As users gain expertise, their vocabulary, syntax, and linguistic sophistication may advance.

* Empirical studies support the connection between language complexity and cognitive development. For instance, students tend to write lengthier, more complex sentences when [engaging in in-depth learning tasks](https://eprints.soton.ac.uk/390338/1/1761_Article_Text_8498_1_10_20160728.pdf).
* If a user’s questions or messages show increased application of advanced concepts, or more complex sentence structures, they may have achieved cognitive gains and/or subject matter mastery.

</details>

<details>

<summary><strong>Follow-up question rate</strong><br>Are users persisting in a line of questioning?</summary>

When a user asks a new question related to the previous answer (an indicator of engagement) they are likely engaging in deeper thinking.

* While direct experimental measures are still emerging, [educational theory](https://www.cambridge.org/elt/blog/2022/02/22/engine-achievement-role-curiosity-learner-engagement/) suggests that students who ask more questions are more actively engaged in their learning.
* Some conversational learning systems monitor average dialogue turns per query, with longer conversational exchanges indicating intellectual curiosity, active learning, and higher-order thinking.

{% hint style="warning" %}
It’s important to distinguish productive follow-up questions from those caused by confusion or misunderstanding.
{% endhint %}

</details>

<details>

<summary><strong>Feature utilization</strong><br>Do users follow recommendations or use suggested tools?</summary>

AI education platforms often include specific features or recommendations intended to drive learning actions. For example, a chatbot might offer to quiz you or provide links to further reading. Feature utilization rates – which track whether users adopt suggested tools or follow AI advice – can signal levels of trust and motivation.

* High follow-through rates imply that the user finds the suggestions valuable and trusts the guidance enough to act on it. In [a recent study](https://knowledge.wharton.upenn.edu/article/why-is-it-so-hard-for-ai-to-win-user-trust/), users built trust when they observed over time that AI recommendations were correct. Users with positive outcomes tended to rely on AI more.
* When learners consistently accept AI suggestions and complete optional exercises, this can signal trust in the AI and a high level of motivation to learn. Conversely, low uptake can indicate weak trust, poor relevance of the suggestions, or lack of motivation.

</details>

***

### Identify metrics through primary data collection

Often, the most direct way to gauge a user’s thoughts, feelings, knowledge, and behaviors is simply to ask them. Short surveys can capture self-reported changes and subjective experiences or perceptions.

#### Guidelines for developing the metrics

When developing such measures within an AI product, a few guidelines are important:

* **Use validated scales or questions:** Do not write questions from scratch if you don't have to. Adapt existing, well-tested psychological scales to measure things like self-efficacy, motivation, or emotional state. Instead of asking, "Do you like math now?" you can ask true/false questions about confidence (“I am more confident solving these problems on my own”), emotional state (“Using this app made me feel motivated to keep learning”), and behavioral intentions (“After using the app, I plan to try the recommended technique in real life”). Short, psychometrically sound surveys can capture short-term user outcomes with surprising depth if designed well.
* **Keep it Short**: To avoid fatiguing or annoying your users, limit surveys to a handful of items (e.g., a max of three questions). A mix of multiple-choice, rating scales, and an open-ended question can yield quantitative and qualitative insights. For example:
  * “How helpful was the advice you received?” (Likert scale)
  * “How did you feel during the conversation?” (use emojis or a frustration scale)
  * “What was the most useful part of this interaction?” (open response)
* **Integrate into the flow**: Carefully consider the timing and context of survey questions, so that feedback is tied to a concrete experience. Don't use a pop-up form that breaks the user experience. Have the AI ask for feedback naturally within the chat flow, ideally right after the user has completed a significant task or learned a new concept. Make the process feel like a natural dialogue or a reflection, not an intrusive add-on.
* **Account for interaction learning over time:** Improvements in engagement metrics may sometimes reflect users becoming more skilled at interacting with the system rather than genuine improvements in knowledge, emotions, or behavior. As users gain experience, they often develop more effective prompting strategies (*prompt maturity*) and interact with the interface more smoothly (*interaction habituation*). These changes can create the appearance of user improvement even if the underlying construct has not improved. Therefore, when designing on-platform measures, it is essential that the survey measures capture constructs beyond habituation to help distinguish user adaptation to the tool from meaningful gains (For example, survey items such as “I feel more confident solving similar problems without AI’s assistance” or “I understand why the recommended solution by AI works” assess self-efficacy and cognition that are not captured simply by smoother interactions with the produc).

{% hint style="info" %}
Recent research in AI psychometrics has used [GPTs to generate user-level survey items with strong construct validity](https://psychometrics.ai/); this technique can help you develop short survey assessments that unfold seamlessly within conversations.
{% endhint %}

Comprehensive evaluations will survey users off-platform, especially for behavior changes that manifest over longer periods or are difficult for users to perceive. These measures are designed to minimize direct references to the platform or intervention itself. Self-reports that explicitly ask users to evaluate the product (e.g., “How helpful was this AI tool?”) are susceptible to “halo effects,” in which users who like the product report more positive outcomes regardless of actual changes in knowledge, feelings, or behavior. To reduce this bias, evaluators should prioritize decoupled measurement approaches – for example, assessing knowledge, attitudes, or behaviors at multiple time points without framing questions in terms of the intervention.

In this context, you might conduct:

* **Longer Surveys or Interviews**: Outside the app, more extensive questionnaires or surveys can be administered to measure knowledge (quizzes or tests), psychological well-being, or frequency of behaviors (e.g., “How often did you practice math outside the app this week?”). Interviews and focus groups can delve into how users’ attitudes or habits have changed over time (e.g., a student might say, “I never liked math before, but now I find myself challenging myself with problems for fun”).
* **Observer Reports**: In an educational context, teachers or parents might report on the student’s changes (“I noticed my child now approaches homework more confidently”). These external perspectives can validate self-reported survey items and trace data.
* **Analysis of Objective Performance Data**: Whenever possible, tie AI usage to objective outcomes measured external to the AI solution. For example, if an AI writing assistant claims to improve writing skills, administer a writing assessment before and after prolonged use, with blind graders evaluating outputs (or conduct a randomized controlled trial, [as in this field experiment](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486)). You can correlate Level 2 metrics with exam scores, task completion rates, and even health indicators. This gets you closer to impact-level metrics, providing strong evidence of user-level change – but within a shorter time period.

***

### Identify metrics through analyzing conversation logs

An exciting addition to the toolkit is using Natural Language Processing (NLP) methods to analyze what users say or write during their interactions. The actual conversation logs or written outputs can be mined for signals of cognitive or emotional change. Several approaches are outlined in the table below:

<table data-full-width="true"><thead><tr><th width="246.18359375">Method</th><th>Examples</th></tr></thead><tbody><tr><td><strong>Sentiment Analysis</strong>: Automatically scoring the sentiment of user utterances over time.</td><td><ul><li>Are the words generated by the user more positive or less anxious with increased product use or dosage? A trend from negative to positive tone could indicate growing comfort or satisfaction.</li><li>Spikes of negative sentiment might flag frustration at certain points. Tools like fine-tuned transformer models can rate sentiment for each message or session.</li></ul></td></tr><tr><td><strong>Topic Modeling&#x26; Keyword Analysis</strong>: Analyze the content of conversations for emergent themes.</td><td><ul><li>Topic modeling can track a user’s progression over time. Topics discussed might shift from fundamental concepts to more advanced ones – indicating cognitive growth.</li><li>Keyword analysis can surface unexpected themes – e.g., users frequently citing “exam anxiety” might signal an affective need that your product should address.</li></ul></td></tr><tr><td><strong>Linguistic Inquiry and Word Count (LIWC)</strong></td><td><ul><li><a href="https://www.liwc.app/">LIWC</a> is a dictionary-based text analysis tool developed by psychologists that <a href="https://www.researchgate.net/publication/383061194_GPT_is_an_effective_tool_for_multilingual_psychological_text_analysis">maps words to psychological categories</a> (like emotion, social words, cognitive processes). You can analyze user text to quantify the percentage of words indicating analytical thinking or emotional dysregulation.</li><li><a href="https://journals.sagepub.com/doi/abs/10.1177/0261927x09351676">Decades of research</a> have shown that linguistic indicators correlate with psychological states. For instance, an increase in first-person plural pronouns (“we, us”) might indicate users feel more socially connected, whereas a drop in words like “never, not” might indicate reduced negativity.</li></ul></td></tr><tr><td><strong>LLM-Based Text Analysis</strong>:<br>Using LLMs to analyze text</td><td><p>Studies have shown that we can <a href="https://arxiv.org/abs/2309.10771">leverage large language models (like GPT)</a> to code or score text in nuanced, human-like ways: for instance, see <a href="https://openai.com/index/scaling-social-science-research/">OpenAI's Gabriel</a>  and <a href="https://developers.googleblog.com/introducing-langextract-a-gemini-powered-information-extraction-library/">Google's Langextract</a>. </p><ul><li>GPT-4 can accurately detect psychological constructs (i.e., sentiment, loneliness) in text with reliability often surpassing traditional dictionary methods. This appears to work across multiple languages. Example: prompt the model with “On a scale of 1–5, how much self-confidence does this message show?”</li><li>LLM text analysis can be used to profile user text generation over time. For example, we can prompt a model to score a chatbot conversation as {sentiment trend: positive; confidence expressed: moderate and rising; themes: independence 3/5, belongingness 4/5}. Quantified scores allow us to track subtle changes, at scale. Be careful when using an LLM to evaluate LLM-user interactions; you want to avoid bias and misinterpretation.</li></ul></td></tr></tbody></table>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed%2Fdescriptive-analysis>" %}

</details>


# Define guardrail metrics and measure potential harm

As you reach Level 3 evaluations, you are not just measuring if your product is working; you want to measure if it is causing harm. While Level 2 metrics track usage, Level 3 is your opportunity to use direct interviews and surveys to track unintended consequences.

A central concern is that AI models and agents can empower or disempower users. In evaluating social impact, user agency can be a critical guardrail for AI products. There is a risk that "helpful" AI agents might actually undermine development, creating dependency for users or communities rather than building capabilities. Therefore, you will want to track whether your tool is improving or reducing users’ agency.

For instance, we recommend measuring agency in two ways:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><strong>Subjective Agency (Internal-Facing)</strong></td><td>This captures users’ beliefs and perceptions of their own capabilities (e.g. <a href="https://albertbandura.com/albert-bandura-agency.html">Albert Bandura’s Social Cognitive Theory</a>), measured through qualitative or survey methods. Ask users about their sense of self-efficacy: do they believe they can solve the problem on their own now?</td><td><a href="https://images.unsplash.com/photo-1515463626042-123ab67dcaa7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxSZWZsZWN0aW9ufGVufDB8fHx8MTc3MzIzNzE5OXww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1515463626042-123ab67dcaa7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxSZWZsZWN0aW9ufGVufDB8fHx8MTc3MzIzNzE5OXww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Objective Agency (External-Facing)</strong></td><td>These are the capabilities required to plan, navigate, execute, and reflect on personal goals (e.g. <a href="https://www.cambridge.org/core/books/abs/amartya-sen/capability-and-agency/65BD3415B565147A740E03F42E41D047">Amartya Sen’s Capability Approach</a>). Does the user have the skills to act? Test their ability to plan and execute goals without the AI's help. Are they learning the underlying logic, or just copy-pasting answers?</td><td><a href="https://images.unsplash.com/photo-1603804449564-2ad32f24d17e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwyfHxwbGFufGVufDB8fHx8MTc3MzE1MTc1NXww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1603804449564-2ad32f24d17e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwyfHxwbGFufGVufDB8fHx8MTc3MzE1MTc1NXww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr></tbody></table>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed%2Fdefining-guardrail-metrics-measuring-potential-harm>" %}

</details>


# Consider conducting experiments to improve the selected key metrics and running process evaluations

After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes without bringing harm. The evaluation methods remain the same as in Level 2, but are applied to a different set of outcomes (e.g., A/B testing: Feature A vs. Feature B; multi-armed bandits: performance-based adaptive allocation; holdout testing: e.g., AI vs. non-AI).

We also recommend running process evaluations to gain an understanding on why and when Level 3 metrics are not changing.

A PE ([see primer here](/level-linkages/linkage-across-levels/process-evaluations)) linked to a level 3 evaluation can surface what is or isn’t enabling cognitive, affective, or behavioral outcomes, informing what to test and where to focus program improvements.

At this level it is important to zoom out from the user into the broader program ToC (Figure 3 in Building Blocks section). Behavior change is shaped by the program delivery system and social context outside the program: the workflows, organizational, and social conditions that determine whether product use translates into changed thoughts, feelings, and behavior.

| ToC Domain / Assumption | Example PE questions                                                                                                                                   | Methods                                                                                                                                                                                        |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Interpretation          | Are users receiving and interpreting AI outputs in the way the program intended, and does this differ across subgroups?                                | Semi-structured interviews; cognitive walk-throughs with purposive sample of users; analysis of in-conversation signals (e.g., follow-up questions, expressed confusion), quantitative surveys |
| Opportunity to act      | What contextual factors- social norms, competing demands, or structural constraints- are enabling or blocking users from acting on AI recommendations? | Focus groups and ethnographic observation; barrier/enabler mapping using implementation science frameworks                                                                                     |
| Externalities           | Has the tool shifted the roles or behaviors of others in the program ecosystem — intentionally or not? Are these shifts undermining outcomes?          | Key informant interviews and/or surveys with supervisors and non-user staff; focus groups; administrative data on workload, staffing, or service utilization                                   |

A PE at Level 3 typically requires richer qualitative inquiry than at Level 2, since barriers to behavior change are often rooted in context, relationships, and norms. When possible, run the PE alongside or before the Level 3 evaluation, so findings can directly inform program refinements.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed%2Fwhy-arent-thoughts-feelings-and-behavior-changing>" %}

</details>


# Why Aren’t Thoughts, Feelings, and Behavior Changing?

A PE ([see primer here](/level-linkages/linkage-across-levels/process-evaluations)) linked to a level 3 evaluation can surface what is or isn’t enabling cognitive, affective, or behavioral outcomes, informing what to test and where to focus program improvements.

At this level it is important to zoom out from the user into the broader program ToC (Figure 3 in Building Blocks section). Behavior change is shaped by the program delivery system and social context outside the program: the workflows, organizational, and social conditions that determine whether product use translates into changed thoughts, feelings, and behavior.&#x20;

| ToC Domain / Assumption | Example PE questions                                                                                                                                   | Text                                                                                                                                                                                           |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Interpretation          | Are users receiving and interpreting AI outputs in the way the program intended, and does this differ across subgroups?                                | Semi-structured interviews; cognitive walk-throughs with purposive sample of users; analysis of in-conversation signals (e.g., follow-up questions, expressed confusion), quantitative surveys |
| Opportunity to act      | What contextual factors- social norms, competing demands, or structural constraints- are enabling or blocking users from acting on AI recommendations? | Focus groups and ethnographic observation; barrier/enabler mapping using implementation science frameworks                                                                                     |
| Externalities           | Has the tool shifted the roles or behaviors of others in the program ecosystem — intentionally or not? Are these shifts undermining outcomes?          | Key informant interviews and/or surveys with supervisors and non-user staff; focus groups; administrative data on workload, staffing, or service utilization                                   |

A PE at Level 3 typically requires richer qualitative inquiry than at Level 2, since barriers to behavior change are often rooted in context, relationships, and norms. When possible, run the PE alongside or before the Level 3 evaluation, so findings can directly inform program refinements.

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed%2Fuser-privacy-and-security>" %}

</details>


# Overview

Do users with access to the product improve development outcomes?

Impact evaluation provides strong evidence for understanding causal social impact. While Level 3 measures shifts in thoughts and feelings, Level 4 measures the ultimate results: improved crop yields, higher test scores, or better health outcomes. By using a counterfactual—comparing those who use your product to a similar group that does not—you can isolate the true impact of your AI intervention from the "noise" of a messy world.

***

#### Key Motivation

Policy makers, donors, and governments require credible evidence before they invest in scaling a solution. Level 4 evaluation is critical because:

* **Causal Attribution:** It proves that improvements were caused by your product, not by coincidence or external trends.
* **Informing Scale:** It provides the cost-effectiveness data needed to justify large-scale budget allocations.
* **Identifying Unintended Effects:** Rigorous trials can surface hidden negative consequences or surprising positive spillovers that simpler metrics miss.

<a href="/pages/nDEp5z31imLnvAXYixVk" class="button primary">Read more -></a>

***

#### Core Concept: The Counterfactual

To know if your AI tool works, you must estimate what would have happened to the same people *without* it. We do this by creating a comparison group.

| Method                        | How it Works                                                           | Best Used When...                                      |
| ----------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------ |
| **RCT**                       | Randomly assign users to "Treatment" or "Control."                     | You have a large sample and high control over rollout. |
| **Difference-in-Differences** | Compare groups that follow "parallel trends" over time.                | Randomization is not feasible or ethical.              |
| **Regression Discontinuity**  | Compare people just above/below a specific cutoff (e.g., test scores). | Resources are allocated based on a strict threshold.   |

<a href="/pages/tVhuFj0GRCyKNwFJGA1l" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 4 is a high-investment undertaking. It should only be performed when Levels 1–3 are strong and your product is mature.

1. **Select the Right Counterfactual:** Decide what you are comparing against. Is it "Business as Usual" (no tech), a "Non-AI digital tool," or "Human-delivered services"?
2. **Manage Product Dynamism:** AI products change fast. Avoid biasing your study by tagging versions and, if possible, maintaining a holdout group on a frozen baseline version.
3. **Measure True Capabilities:** Use objective, industry-standard assessments. Ensure students aren't just "copy-pasting" AI answers; test them when they *don't* have access to the tool.
4. **Account for Spillovers:** GenAI is "leaky"—users share advice with neighbors. Use Cluster Randomization (by school or village) to prevent the control group from accidentally being "treated."
5. **Monitor Attrition:** Digital tools often have high drop-off. Use Level 2 engagement data to monitor who leaves the study and ensure it doesn't skew your final results.

**When to Start?**

Do not rush into an Impact Evaluation. You are ready for Level 4 when:

* ✅ Level 1–3 evidence is consistent.
* ✅ Scale-up is being considered by major partners.
* ✅ You have the technical bandwidth to coordinate with independent researchers.

<a href="/pages/gnarBxNy7gjgKIjcRcSH" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Foverview>" %}

</details>


# Who is involved in this evaluation?

| Execute 🟢                                                  | Support 🟡                                                                    |
| ----------------------------------------------------------- | ----------------------------------------------------------------------------- |
| Policy Researchers, some Data Scientists, and/or Economists | AI Engineers                                                                  |
| Apply evaluation methods with the proper measurement tools  | Ensure that the product functions as expected throughout the evaluation phase |

## Why is this level of evaluation important?

Interventions in the development sector aim to improve the quality of people’s lives. Impact evaluations (IEs) measure the effects of the intervention on outcomes such as mortality, learning outcomes, and earnings. The main issue these evaluations face is that the world is a messy place: as an intervention is being implemented, many other things are happening that would make a simple before-and-after comparison an insufficient way to judge program effectiveness.

To address this, we consider the counterfactual: what would have happened to the same people in the absence of the intervention. Because we cannot observe both realities at once (the same people with and without the intervention), we estimate the counterfactual using a comparison group that is as similar as possible to the group that received the intervention. It represents what would have happened without the program. Comparing outcomes across these groups helps us isolate the intervention’s impact.

There are a number of ways to estimate or measure the counterfactual. The most straightforward approach is usually a randomized controlled trial (RCT). In an RCT, participants are randomly assigned to one or more treatment groups that receive an intervention (or variants of it) and a comparison group that does not. Researchers then measure outcomes across groups. Well-designed randomized evaluations enable credible, and are less-prone-to-bias estimates of causal impact—that is, which changes in participants’ lives can be attributed to the program. Other techniques for the counterfactual construction include propensity score matching, difference in differences, and regression discontinuity designs. These are discussed further below, but in general require more technical econometric expertise and contextual knowledge in order to execute well.

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Foverview%2Fwhy-is-this-level-of-evaluation-important>" %}

</details>


# What is the “intervention” being evaluated?

The central reason to do an impact evaluation is to inform policymakers, donors and implementers on whether and how to incorporate an intervention in their plans.

By isolating the intervention from other influences, impact evaluation enables causal attribution of outcome changes. Once effectiveness is established for a specific setting and population, additional evaluations can test whether it works elsewhere or for other groups. Moreover, since impact evaluations isolate causal effects, they are ideal for measuring unintended (as well as intended) impacts of an intervention[^1].

For many funders and public sector partners, IEs are central to decision-making. They seek credible evidence that a product improves lives—beyond engagement metrics or self-reports—before scaling. A well-designed IE signals real-world effectiveness and the likelihood of meaningful social returns (see e.g. [Hauser et al., 2025](https://www.nature.com/articles/d41586-025-02266-7.epdf?sharing_token=jCKO3Tx8dFeQfucqP5VCcNRgN0jAjWel9jnR3ZoTv0PS1htX8Sko7IudKf1MVjrKQ-g3NeuYAsnuJ-Io9wHN3uMBrjSLLnu_wjpJLF2G-unWgOw27UqLqC_yalnt2AFTYmMZAO31agMcWvNwKRpfYsfrMt3fmIKm0iVbftxqAsY%3D); [UK GOV, 2025](https://www.gov.uk/government/publications/the-magenta-book/guidance-on-the-impact-evaluation-of-ai-interventions-html)).

IEs also help funders compare options. Combined with cost data, they enable cost-effectiveness and cost-benefit analysis—critical when governments, donors, and multilaterals allocate scarce resources. In many cases, IE results directly inform decisions to scale, replicate, or exit.

### When is it appropriate to do an IE?

IEs are high-investment undertakings, both financially and operationally, although strategies exist to address both financial and operational constraints. They are most useful when your product is mature enough to test and when the decision stakes are high enough to justify the effort. In general, consider an IE when:

* [x] **Levels 1–3 are strong**: The model performs well, users engage meaningfully, and early evidence suggests improvements in knowledge, attitudes, or behavior.
* [x] **You are preparing to scale**: Funders or policymakers are considering wider adoption, and therefore evidence; cost-effectiveness or cost-benefit estimates, would be helpful to inform the decision. Conversely, scale-up plans may be in progress and present an opportunity for evidence gathering.
* [x] **You have bandwidth**: Implementing an IE is a lot of work for both the research team and implementer; doing it well takes time and effort.

You do not need to run an IE if your product is still in early design or usage is too inconsistent to expect impacts. In such cases, Level 3 evaluations—focused on user cognition and behavior—are more appropriate. Once you have confidence that the theory of change is working, you can and should revisit an impact evaluation.

### Plan for Evaluability Early

Although IEs are usually run later, credible and cost-effective evaluation requires early design choices. Building in features like holdout groups, staged rollouts, or embedded randomization from the start (also useful for A/B testing) preserves the ability to estimate causal effects without disruptive redesigns. Even if a full IE is premature, these choices create opportunities for credible inference later and reduce evaluation burden. Funders assessing scale readiness should look for signs of early evaluability.

### How to do an IE responsibly

Rigorous IEs require expertise. We recommend working with an **independent evaluator**—such as an academic partner, a research or research-and-policy organization (e.g., J-PAL, IPA, or an LMIC-based evaluation firm), or a third-party M\&E firm (e.g., IDinsight, Laterite)—to strengthen technical quality and the perceived independence of the impact evaluation. Being clear on your evaluation goals (as discussed above) will help you choose among evaluator options.

At a minimum, we suggest:

* [x] **Clarifying roles**: who builds the product, who runs the study, who communicates findings
* [x] **Pre-registering the design**: on platforms such as the AEA RCT Registry, EGAP, or RIDIE
* [x] **Sharing results transparently**: Disclose all findings, including null or negative results, and make methods and materials publicly available where feasible to support reproducibility and sector-wide learning.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Foverview%2Fwhat-is-the-intervention-being-evaluated>" %}

</details>

[^1]: Selecting unintended consequences to measure can be informed by process evaluations or level 3 data to minimize cost of data collection.


# Minimum Viable Evaluation

| Level 4 - Impact evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Conduct an <a data-footnote-ref href="#user-content-fn-1">impact evaluation</a> with <a data-footnote-ref href="#user-content-fn-2">counterfactual</a> and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)</li><li><input type="checkbox">Implement strong version control with either a frozen version or a limited number of product versions to be tested</li><li><input type="checkbox">Cost data collection</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Foverview%2Fminimum-viable-evaluation>" %}

</details>

[^1]: An MVE Impact evaluation can also be done inexpensively. There are a number of resources on how to reduce costs and effort and still do a rigorous impact evaluation.

[^2]: Choose the counterfactual judiciously. Focus on the policy relevant choice. While it might be interesting to see how an intervention delivered by humans compares to an AI delivery, if human delivery would be too expensive to be feasible, focus on a counterfactual where the intervention is not delivered.


# How is Level 4 evaluation performed?

Performing a Level 4 evaluation requires rigorous experimental or quasi-experimental designs to isolate the effect of the AI from other external factors.

#### 1. Choosing Your Methodology

At its core, impact evaluation compares a Treatment Group (those using the AI) to a Control/Comparison Group (those who are not).

| **Method**                | **Best Used When...**                                                                                                  |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| RCTs                      | You have a large sample and can randomly assign access to ensure groups are identical.                                 |
| Propensity Score Matching | You have a large dataset of users and non-users and need to statistically "match" them based on similar traits.        |
| Difference-in-Differences | Randomization isn't possible, but you can compare trends before and after the intervention between two similar groups. |
| Regression Discontinuity  | The intervention is delivered based on a strict numeric cutoff (e.g., test scores or income level).                    |

<a href="/pages/TU7OBFRR4zkKykpyLagI" class="button primary">Read more -></a>

***

#### 2. High-Level Steps for AI Impact Evaluation

**Step A: Select the Right Counterfactual**

You must define what "the world without the AI" looks like. In AI evaluations, the comparison isn't always "nothing"—it might be a static chatbot, a human teacher, or a traditional paper-based process.

**Step B: Account for "Product Dynamism"**

Unlike a static pill or a physical textbook, AI products change constantly. To maintain scientific rigour:

* **Tag Versions:** Log exactly which model version every user interacts with.
* **Maintain a Hold-out Group:** Keep a small group on the "baseline" version of the AI to see if updates actually improve outcomes.
* **Coordinate with Tech:** Ensure the engineering roadmap doesn't accidentally "break" the evaluation design.

**Step C: Measure True Outcomes, Not Proxies**

Ensure the evaluation measures actual welfare or capability gains.

* **Avoid Gaming:** Don't use tests that users can pass simply by repeating AI-generated answers.
* **Use Validated Tools:** Rely on industry-standard assessments or administrative data (e.g., health records, employment rates).

**Step D: Manage Spillovers and Attrition**

AI tools are easily shared, which creates a risk of "contamination" (the control group getting access to the AI).

* **Cluster Randomization:** Randomize by village or school rather than by individual to prevent sharing.
* **Monitor Drop-outs:** Use Level 2 (Usage) and Level 3 (Behavior) data to see who stops using the tool and why, as high attrition can ruin your statistical power.

<a href="/pages/RVNxdLBXgZfwL2MoYuss" class="button primary">Read more -></a>

***

#### 3. Common Pitfalls

* **Underpowered Studies:** Assuming 100% of people will use the AI. In reality, uptake is often low; plan for a larger sample size than you think you need.
* **The "Black Box" Problem:** If the AI evolves mid-study without version tracking, you won't know *which* version of the product caused the impact.
* **Transparency vs. Adaptability:** Use a Pre-Analysis Plan to define how you will handle product changes before the study begins.

<a href="/pages/IDBlz7DbfAPcsHfHFfq8" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed>" %}

</details>


# A Quick Primer on Impact Evaluation Methods

Once it is the right time and resources are in place, you must choose a method. At its core, an impact evaluation compares outcomes between groups that differ only in exposure to the intervention—that is, treatment versus control. There are several ways to achieve or approximate this:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><strong>Randomized Control Trials (RCTs)</strong></td><td>RCTs assign a sufficiently large number of units (e.g., individuals, schools, clinics) at random to receive the intervention, while others are excluded (or often assigned to a waitlist). Randomization, along with sufficient sample size, ensures groups are comparable on average, except for whether they receive the intervention. Sometimes, politics, ethics, or other constraints will make it less feasible to randomize, so we can turn to other methods. Other times, conducting a randomized evaluation is easier politically and more ethical; context and resources will determine that.</td><td><a href="https://images.unsplash.com/photo-1588348442528-85c6fa3b0440?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw1fHxleHBlcmltZW50fGVufDB8fHx8MTc3MzI4NDgxOXww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1588348442528-85c6fa3b0440?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw1fHxleHBlcmltZW50fGVufDB8fHx8MTc3MzI4NDgxOXww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Propensity score matching</strong></td><td>This approach requires a large dataset covering both participants and non-participants, with a clear indicator of treatment. It uses statistical techniques to match treated units with similar untreated ones based on observable characteristics. Because it relies only on what is observed, robustness declines when unobservable differences are likely to matter.</td><td><a href="https://images.unsplash.com/photo-1581574919402-5b7d733224d6?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxzY29yZXxlbnwwfHx8fDE3NzMyODQ4NDV8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1581574919402-5b7d733224d6?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxzY29yZXxlbnwwfHx8fDE3NzMyODQ4NDV8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Difference-in-Differences</strong></td><td>This method relies on the assumption that treated and untreated (comparison) groups would have followed parallel trends in outcomes, but does not have the luxury of random assignment to force that to be so by design. By comparing differences before and after the intervention, impact can be estimated. Key is to try to understand why the comparison group was not treated and whether that reason is masking—- i.e., predictive— of a likely difference in trends that they may experience compared to the likely trend of those treated had the treated not been treated.</td><td><a href="https://images.unsplash.com/photo-1705163630188-bd3f0844113b?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw1fHxEaWZmZXJlbmNlfGVufDB8fHx8MTc3MzI4NDcyOXww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1705163630188-bd3f0844113b?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw1fHxEaWZmZXJlbmNlfGVufDB8fHx8MTc3MzI4NDcyOXww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Regression discontinuity design</strong></td><td>This approach uses a cutoff, comparing people (or other treatment units) just below it to those just above. For example, if students below a threshold receive remedial education, impact is estimated by comparing students near the cutoff on either side. Valid implementation requires that the cutoff itself does not directly affect outcomes (e.g., it reflects budget constraints, not pedagogy) and that there are many observations close to the threshold, since differences grow farther from it.</td><td><a href="https://images.unsplash.com/photo-1669027108349-a9bea2bec1d5?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjcmFja3xlbnwwfHx8fDE3NzMyODQ4Nzl8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1669027108349-a9bea2bec1d5?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjcmFja3xlbnwwfHx8fDE3NzMyODQ4Nzl8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr></tbody></table>

These are very basic introductions. For more on methods as well as a step-by-step guide to impact evaluation planning – including sampling, power calculations, and analysis – we strongly recommend:

* [Impact Evaluation in Practice](https://openknowledge.worldbank.org/server/api/core/bitstreams/4659ef23-61ff-5df7-9b4e-89fda12b074d/content) (Gertler et al., World Bank)
* [Running Randomized Evaluations](https://press.princeton.edu/books/paperback/9780691159270/running-randomized-evaluations) (Glennerster & Takavarasha)

In the following section, we do not replicate that guidance. Instead, we focus on what is *distinctive* when evaluating AI-based products in the development sector.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed%2Fa-quick-primer-on-impact-evaluation-methods>" %}

</details>


# Key design considerations for AI-specific impact evaluations

With the increased evaluations of AI products, distinct challenges for impact evaluation are emerging. Below are some considerations that merit special attention.

{% stepper %}
{% step %}

#### Selecting the right counterfactual

Choosing the counterfactual—what participants would receive without the AI-enabled intervention—is foundational to impact evaluation design. In GenAI evaluations, the range of plausible comparators is often larger, making clear justification essential.[\[5\]](https://docs.google.com/document/d/18du_LUMPGGu4pZQ1nZ-pKoEwu2zlzmQhFLYwX-A-ix0/export?format=html#ftnt5) The counterfactual should represent a meaningful alternative to your product and credibly reflect what the world would look like without it at scale.

There is no single “correct” counterfactual. In some cases, a pure control (no intervention at all) may be appropriate (more on this in the next point). In others, a more active comparator offers greater policy relevance. Common options include:

* *Business-as-usual* (e.g., no digital support or sporadic human guidance), an established intervention without GenAI interaction is especially relevant when evaluating a potential improvement on an existing service delivery model.
* *Non-AI digital tools* (e.g., static chatbots or curated content), when considering whether generative AI adds value over existing tech products.
* *Human-delivered services*, when the AI tool substitutes for scarce professional labor (e.g., teachers, health workers). In such cases, it will be valuable to measure not only outcomes but also the cost of implementation (for resources on how to measure costs well see [here](https://www.worldbank.org/en/programs/sief-trust-fund/brief/cost-measurement)).

Thoughtful counterfactual choice affects not only estimated effect size, but also the interpretability and generalizability of results. A strong evaluation will explain both why a given comparator was selected and what alternative scenarios it helps illuminate.
{% endstep %}

{% step %}

#### Measuring latent access and contextual factors

The marginal impact of an AI product depends on users’ baseline access to support, including existing AI tools, related technologies, informal use of the product, and competitors. Measuring this baseline is essential for interpreting effects. Where such tools are already widespread, gains may be modest; in low-capacity settings, the same product may yield much larger (or smaller) benefits. Because access can change quickly, it should be tracked throughout the evaluation.

Evaluators should:

1. *Measure existing technology use*, including frequency, type, and purpose of AI or other digital tool usage, whether directly or indirectly.
2. *Measure what users rely on today*, such as informal networks, human advisors, basic technology, or no support at all.
3. *Keep a sharp eye out for leakage* – since the AI-enabled intervention is likely to be easily portable or shareable, it is important to measure how much of the control group has access to the intervention in some form (more on this below).

Substitutes that users turn to when they don’t use, or have access to your AI product can shape the outcomes you are trying to measure. These fallback options and access patterns shape the AI product’s incremental value. This makes understanding your target population—and which segments face different access, resources, or barriers—essential. Anticipate these dynamics and measure impacts across key dimensions of heterogeneity (e.g., age, gender, poverty, or their interactions), ensuring sufficient sample size to do so.
{% endstep %}

{% step %}

#### Managing product dynamism

RCTs enable powerful causal inference, but only under specific assumptions. One of the most important is the Stable Unit Treatment Value Assumption (SUTVA). [A key component of SUTVA is the *no-multiple-versions* condition: all treated units must receive the same version of the intervention.](#user-content-fn-1)[^1]

In practice, this condition is often only imperfectly met (e.g., motivated providers continuously adapt their services). For GenAI and many other digital platforms, it is almost systematically violated: these tools improve iteratively through retraining, interface changes, or content updates, often alongside ongoing experimentation. As a result, participants within the same trial may face different product versions. This can bias estimates if version exposure correlates with unobserved potential outcomes and, even when identification holds, complicate interpretation of the causal estimand, making it difficult to draw policy recommendations.

Freezing the product version during a trial would restore the single-version condition but undermine ecological validity by eliminating the adaptation that defines product interventions. A better approach is to design evaluations that permit evolution while still delivering credible, interpretable causal estimates.

We recommend four practices:

1. **Tag your versions** – Define in advance what counts as a substantively distinct change, including updates to underlying models outside implementers’ control. Tag each release with a unique version label. Calibrate granularity: definitions that are too fine reduce power, while definitions that are too coarse can hide meaningful heterogeneity.
2. **If A/B testing, randomise test participation** – Do not only randomize between versions A and B; also randomize which users enter the A/B test. Pre-specify the procedure so participation is not correlated with unobserved outcomes. Both this and version tagging require close coordination between the evaluation and tech teams.
3. **Maintain a hold-out group on the baseline version** – If sample size allows, keep a subset of treated participants on a frozen baseline version throughout the trial. Comparing them to users on updated versions allows estimation of the incremental effect of product changes. In a more dynamic variation on this, adaptive experiments could be a [useful approach](#user-content-fn-2)[^2].
4. **Pre-specify at a high level** – In the pre-analysis plan, specify how versions are defined, how rollouts occur, and how exposure is measured. Avoid overly detailed commitments that limit flexibility in responding to unforeseen product changes.

These can seem daunting. The first step is to focus on the primary purpose of the evaluation: whether it's to prove out an individual product/intervention or whether you are trying to generate generalizable insights about human behavior and how this technology is affecting them. Referring back to the primary objective (and maybe giving the secondary some weight) will help you decide, for example, at what level of granularity you want to tag your versions (or at least at what level the tags are important).

In addition, the good news is that there is data to help inform decisions about when to exercise these practices. If L1 and L2 are running frequently (or even continuously), these will provide insights into the magnitude of changes in the models and user use. And L3 evaluations can help you understand whether these changes are associated with changes in behaviors and practices (some quick qualitative work can help you gauge how causal you think these changes are). These data, together with the purpose of the evaluation, will help you judge what merits a significant enough change, for example, that merits a tag or consideration of an A/B test.
{% endstep %}

{% step %}

#### Measuring true development outcomes

AI tools often simulate expertise. But does the user *learn*, or just *copy*?

* Invest in using industry-standard **validated assessments** and **administrative data** to credibly measure improvements in capabilities and welfare.
* Avoid measurement tools that can be gamed by simply repeating AI output (e.g., regurgitating chatbot answers). In educational contexts, for example, use measures where performance tests students’ ability when they don’t have access to AI.
  {% endstep %}

{% step %}

#### Embracing Spillovers to Improve Estimates

GenAI tools are often built to scale—easy to access and share—which makes contamination a real risk in impact evaluations. Information spillovers, where users obtain information about the GenAI solution and perhaps also from it, is one aspect. Another is actual use of GenAI solution despite being in the non-treatment group.

Your randomization (or other identification strategy) should reflect how the product is delivered.

* If access is controlled (e.g., via onboarding or closed rollout), individual or cluster assignment to treatment may be appropriate.
* If the product is public or spreads organically, consider a randomized encouragement design that invites or incentivizes only some users. Because these are often underpowered to detect final outcomes, pilot the encouragement to ensure it works.

When contamination risk is high, it may be best to run trials in settings with low existing exposure (e.g., regions or populations where the product is not yet known) and to closely monitor control groups for access to the product or close substitutes. But GenAI access is increasingly widespread, and embracing this situation rather than avoiding it may be the most constructive approach. First, consider the likely magnitude of spillovers. If it may be substantial enough to lead to a bias, then it is best to build into the design some measurement of the spillover. Analysing the impact of the spillover on the treatment can then enhance the learning, and even often point to paths to scale and maximize impact.

Cluster randomization (e.g., by school or clinic) can further reduce spillovers. In all cases, monitor usage and be prepared to adjust power calculations or analytic strategies if cross-group exposure occurs.

<br>
{% endstep %}
{% endstepper %}

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed%2Fkey-design-considerations-for-ai-specific-impact-evaluations>" %}

</details>

[^1]: For a rigorous treatment of this, see [VanderWeele and Hernán (2013)](https://pmc.ncbi.nlm.nih.gov/articles/PMC4219328/)

[^2]: See <https://www.gsb.stanford.edu/sites/default/files/publication/pdfs/academic-publication-desiging-adaptive-experiments-2021-mar.pdf> for a good introduction to adaptive experiments.


# Common pitfalls to avoid

Impact evaluations are high-leverage, high-effort undertakings. Avoiding a few predictable errors can significantly improve the value – and credibility – of your results. While these issues face many non-AI impact evaluations, here we have tried to capture the ways these risks are manifesting differently in early impact evaluations of AI products.

#### Being underpowered

Even real impacts can go undetected in underpowered studies. For AI products, low uptake—especially early on—is a key risk. Overly optimistic uptake assumptions can leave treatment groups too small to detect effects. Set realistic expectations by piloting uptake with groups similar to the intended treatment population, involve skeptics in planning, and use recent Level 2 evaluations to inform assumptions.

As discussed earlier, you track your target population and key sub-groups across all four evaluation stages. At Level 4, it is critical to have sufficient sample size to detect statistically significant, programmatically meaningful effects, including differences across groups. This challenge is not AI-specific but applies to any sub-group analysis; however, AI interventions may see groups participate in different ways and at different rates. Insights from Levels 1–3 should inform Level 4 sample design and outcome measurement. If budget allows, keep samples and outcomes broad enough to detect unintended positive or negative effects not flagged earlier.

#### Mismanaging transparency

Impact evaluations should build confidence by involving credible, independent investigators, sharing data where appropriate, and pre-specifying key measures and analyses. But transparency should not come at the expense of adaptability. Researchers and implementers need to have an agreement on what they are evaluating, and if e.g. the intervention is to remain static or not and if not what data will be available to understand the dynamic nature of the intervention. Given how quickly AI interventions can change, establish mechanisms for early and ongoing coordination during implementation.

#### Letting product evolution obscure the analysis

If the product may change during the study, pre-specify how changes will be handled analytically. One option is to freeze a version for the trial; if that is not feasible, define and log substantive changes, tag version exposure, and use this metadata to test for improvements or degradations over time. While this creates risk, it is also an opportunity for GenAI evaluations. Unlike analog interventions—where changes often went unobserved—version tracking, and even repeating Levels 1–3 evaluations after major updates, can enable much richer analysis during the impact evaluation period.

#### Underestimating the risks of attrition

Attrition—through disengagement or loss to follow-up—can seriously weaken power and interpretability. In digital interventions, only a small share of sign-ups may engage, and drop-off is easy. Plan for this: track engagement early, power studies accordingly, and use passive data where possible. If attrition is unavoidable, pre-specify how it will be handled and report it transparently. Use Level 2 and 3 data to monitor attrition early, adjust design, and link it to version tracking to understand who drops out and when.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed%2Fcommon-pitfalls-to-avoid>" %}

</details>


# Process Evaluation: Why Aren’t Outcomes Changing?

Level 4 impact evaluations reveal the intervention’s effect but may not explain the mechanism behind it. A PE ([see primer here](/level-linkages/linkage-across-levels/process-evaluations)) could offer an explanation. When impact is limited, it could identify if implementation was weak. When impact is strong, it can identify conditions that made it possible which should be replicated when scaling.

<table><thead><tr><th width="137.19921875">ToC Domain / Assumption</th><th width="391.46484375">Example PE questions</th><th>Methods</th></tr></thead><tbody><tr><td>User-to-beneficiary pathway</td><td>When the user is a frontline worker, is the intended beneficiary (e.g. patient, student, farmer) actually receiving and able to act on the information? What are the barriers and facilitators of a strong user-to-beneficiary pathway?</td><td>Interviews with beneficiaries; observation of frontline worker-beneficiary interactions</td></tr><tr><td>System readiness and sustainability</td><td>What institutional enablers support or constrain the integration and use of AI-enabled products within a broader delivery system? Are there bottlenecks outside the product (in supply chains, health systems, or institutional capacity) that mediate impact?</td><td>Document review; stakeholder interviews; review of administrative and information system data</td></tr></tbody></table>

In a level 4 evaluation, process data collected across study sites can also be used directly in the analysis process to generate richer insights on impact via:

* **Mediation and subgroup analysis**. If you have sufficient variation in implementation across study units and sample sizes, evaluators can test whether impact concentrates among subgroups defined by implementation quality or contextual factors.
* **Treatment-on-the-Treated analysis** with varying implementation intensity. In practice, not all treatment units might receive the intervention as designed. Process data capturing implementation fidelity or intensity allow evaluators to move beyond the intention-to-treat estimate and conduct ToT analysis that uses randomization as an instrument for actual exposure or treatment intensity.<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed%2Fprocess-evaluation-why-arent-outcomes-changing>" %}

</details>


# Overview

There are several ways to link the four levels as you develop and evaluate an AI solution. These cross-level linkages are essential for tracing how changes in your model system, product, solution, and/or program affect outcomes—whether intentional or not. Consideration of risks and data protection requirements should also be done in a wholistic manner that cuts across the levels. Key practices include:

* Developing Level 1-3 metrics at each stage of the user funnel;
* Defining a single set of identifiers that links data collected at each level (e.g. a user ID, session, and model/product version)
* Ensuring product managers, data scientists, and user researchers cooperate across levels to manage risks and provide continuity and context while iterating on product features.

Here are a few other actions you can take to link your evaluations across levels:

1. **Use critical metrics from one evaluation level as guardrails for others** so engagement optimizations don’t undermine “North Star” outcomes. Similarly, use metrics from one stage of the funnel as guardrails for other stages. For example, optimizing a bot for low latency (L1) while targeting student learning (L4) creates trade-offs: added latency may improve chain-of-thought correctness but reduce engagement and learning. If feasible, do not track L1 without L3 and L4 guardrails; North Star metrics propagate trade-offs across L1–L4, requiring deliberate weighting and interpretation.

{% hint style="info" %}
A more sophisticated—but less mature—option is multi-objective optimization, which optimizes an AI solution across multiple goals at once (e.g., cost, latency, safety). These [techniques](https://arxiv.org/pdf/2502.18635) are still new and under development.
{% endhint %}

2. **Identify a product manager to “own” the North Star metric.** They are responsible for shaping the roadmap by balancing engineering and design trade-offs across all levels. This person ensures design choices—such as adding UI friction for specialized users—stay aligned with the overall goal, even if they look sub-optimal in one level’s metrics in isolation.
3. **Conduct routine multi-level risk assessments and failure-mode analyses.** When conducting error analysis, flag aberrant behavior at any level—for example, benchmark drift (Level 1) or user gaming (Level 3)—then assess whether it is detectable in the data produced at that level or other levels. Combine these insights with user research to predict fixes: issues appearing in Level 1 metrics near the top of the funnel often require AI system changes (e.g., knowledge base updates, prompt engineering), while downstream failures may require new product features or broader solution/intervention changes.
4. **User research** should sit alongside each evaluation level to interpret log data. Its depth varies by level: interviews to design golden datasets (L1), workflow observation to develop hypotheses (L2), and cognitive interviewing to inform survey design (L3–L4).

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=linkages-across-levels%2Foverview>" %}

</details>


# Risk assessment and mitigation

The discovery of risks or potential failure modes - and developing and testing control measures - requires integrated work across evaluation levels. Use outcomes at one level to guide control levers or solution updates that influence others. Risk mitigation should support comprehensive, iterative detection and response, with the associated cost and intensity varying by level.

### Example scenario: WhatsApp tutor chatbot

As an example, suppose we identify edtech failure modes after observing aberrant behavior at one level. The solution is a WhatsApp tutoring bot for secondary students, providing math and logic problems to solve independently at home, linked to their school curriculum. How might risks show up at each level, what mitigations would we use, and which control metrics would measure mitigation effectiveness?

### Cross-level risk mitigation

<table data-full-width="true"><thead><tr><th width="119.88671875">Level</th><th>Risk Discovered</th><th>Control Strategies</th><th>Control Metric</th></tr></thead><tbody><tr><td><i class="fa-gear-code">:gear-code:</i> <strong>Level 1</strong></td><td>The problem complexity does not increase with each turn of the WhatsApp dialogue</td><td>Link weekly assessed learning level to problem difficulty; increase the model context window;<br>use multi-shot prompting</td><td>Question complexity (LLM-as-a-judge using a rubric aligned to curriculum standards)</td></tr><tr><td><i class="fa-box-isometric">:box-isometric:</i> <strong>Level 2</strong></td><td>High engagement, but concentrated on easy problems or off-topic conversations</td><td>Default to progressive difficulty;<br>add rewards for completing challenging problems</td><td>“Time spent learning” = session length ÷ # unique problem types solved</td></tr><tr><td><i class="fa-user">:user:</i> <strong>Level 3</strong></td><td>Users become overly dependent on the AI, reducing self-directed problem solving and help-seeking agency</td><td>Introduce delayed hints and scaffolded responses; require users to attempt a solution before seeing AI guidance; prompts that encourage reflection (“What would you try next?”)</td><td>% of problems attempted before requesting help; average number of user-initiated solution steps per problem; self-efficacy score from survey</td></tr><tr><td><i class="fa-chart-column">:chart-column:</i> <strong>Level 4</strong></td><td>Learning plateaus or declines</td><td>—</td><td># correct on standardized test; % of students exceeding threshold score</td></tr></tbody></table>

As in red teaming, you can define different risk classes to investigate (e.g., safety, privacy, security). User safety and mental health are critical concerns, and can be mitigated through activities at each level:

<table data-full-width="true"><thead><tr><th width="120.4453125">Level</th><th width="268.84765625">Approach</th><th>Mitigation</th></tr></thead><tbody><tr><td><i class="fa-gear-code">:gear-code:</i> <strong>Level 1</strong></td><td>Red-team GenAI models</td><td>Detect/classify harmful outputs;<br>align models via pre-/post-processing</td></tr><tr><td><i class="fa-gear-code">:gear-code:</i> <strong>Level 1</strong></td><td>Inspect model logs</td><td>Update knowledge base;<br>apply pre-/post-processing (e.g., content filters)</td></tr><tr><td><i class="fa-box-isometric">:box-isometric:</i> <strong>Level 2</strong></td><td>Observe product use</td><td>Adjust UI/UX to reduce friction or harm</td></tr><tr><td><i class="fa-box-isometric">:box-isometric:</i> <strong>Level 2</strong></td><td>Analyze trace data</td><td>Add nudges/notifications;<br>build affordances for different user segments</td></tr><tr><td><i class="fa-user">:user:</i> <strong>Level 3</strong></td><td>Collect qualitative data (interviews, focus groups)</td><td>Surface risks, cultural fit, and harms;<br>invite community input on mitigations</td></tr><tr><td><i class="fa-user">:user:</i> <strong>Level 3</strong></td><td>Identify and analyze metrics that embed in conversation text</td><td>Trigger risk-reduction interventions and referrals</td></tr><tr><td><i class="fa-chart-column">:chart-column:</i> <strong>Level 4</strong></td><td>Run impact evaluations</td><td>Qualitative research to explore unintended consequences</td></tr></tbody></table>

As you mitigate risk, weigh the financial and moral costs of failures across evaluation levels. A Level 1 error may be minor (extra developer time), while a Level 3 failure (e.g., loss of user trust) may require intensive in-person outreach and far higher cost. Use a routine workflow: start with risk discovery (aberrant metrics, one-off surveys, user interviews), then translate findings into new routine metrics. Three questions guide the investigation:

> Why is the behavior occurring?

> How could it have been discovered earlier?

> What can be changed to align with the theory of change?

If product development reveals an incompatible insight, then you may need to modify the theory of change for it to maintain its guiding function.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=linkages-across-levels%2Frisk-assessment-and-mitigation>" %}

</details>


# Data protection

## Data protection

### Why data protection is important in GenAI

Generative AI applications with potential for development impact typically deal with sensitive topics. Livelihoods, vulnerabilities, health-related questions, children's data in education and beyond: the list is long and substantial. Where users reveal details of their situation in these fields, registration data and chat logs record and process sensitive personal data. As responsible stewards, the providers of GenAI systems need to safeguard and protect that data carefully. But instead of being a burden, responsible data handling is an opportunity to build the trust necessary for open and honest user evaluation.

### Requirements depend on local context

Compliance with local legal requirements is the foundation for processing personal information. These requirements differ substantially and it is thus impossible to list needs that are relevant in any setting. Instead, this section gives a brief overview of basic considerations for professional practice in handling sensitive data. It is neither legal advice nor a complete listing of advisable practices, but simply a starting point for assessing what responsible data protection means in a particular context. Since human subjects research is integral to evaluating generative AI applications, the need to involve institutional review boards and similar bodies should also be considered where appropriate.

### Minimum Data Protection Practices

In addition to locally relevant data protection requirements – which often include standard principles such as data minimization, purpose and storage limitations – foundational steps towards responsible and trustworthy handling of sensitive personal information should include:

| Practice                                     | Description                                                                                                                                                                                                                                                                                                                                                                      |
| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Data protection impact assessment (DPIA)** | A structured review of privacy risks should be conducted before deployment, and also whenever the system changes substantially. A DPIA includes documentation of data flows, the legal basis or other authorization for processing, potential harms such as re-identification or model memorisation, and the controls that mitigate them.                                        |
| **User rights**                              | Where required by regulations and perhaps also as a matter of transparent and fair practice, users should have access to the data held on them, the right to correct mistakes, and the option of ending their participation in consent-based activities (including data deletion). User-friendly processes that implement these rights need to be integrated into system design. |
| **ICT security standards**                   | Although safe practices are discussed at various points of the Playbook, a general security posture that secures data and systems against internal and external threats is an essential part of data protection. Established frameworks such as the ISO 27k family or the NIST Cybersecurity Framework set out the core requirements.                                            |
| **Legal advice**                             | Jurisdiction-specific legal counsel should be sought before data collection even begins. The requirements vary strongly across regimes and cross-border transfers to upstream LLM providers introduce a second layer of complexity. The fast evolution of data protection and AI regulation suggests that compliance may need to be treated as ongoing rather than one-off.      |

### Consent is not a tick-box exercise

A cautionary note is due for situations where consent to personal data processing – including items such as chat logs and registration information – is taken to be the legal basis. Development-focused AI applications often aim to support people in considerable need who may not have the ability to withhold or withdraw consent, or who may not be in a position to assess the processing fully. The economically vulnerable, illiterate, children, or older adults can be examples of such groups. In such situations, the AI provider should recognise that it is their duty to mitigate risks and take protective measures, foregoing performative consent in favour of user-centric intervention design.

### Cross-border transfer and third-party access

The concentration of GenAI model providers and processing infrastructure in a few countries often implies the transfer of sensitive data across borders. Apart from verifying that this is permissible under local regulations, it also entails a duty to verify that the data will be processed in a compliant manner abroad. A similar need arises where third parties access, store and process personal data. Providers of GenAI systems must verify that they will uphold the same or higher custodial standards than the own organisation. Although verification may be burdensome, sensitive user data is as worthy of protection abroad and by others than by the direct system provider.

### Selected data protection considerations by evaluation level

Data protection considerations reveal themselves in different ways across the Playbook's evaluation levels, including but not limited to:

| Evaluation level                        | Key considerations                                                                                                                                                                                                                                                                                                                                                                  |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Level 1 – Model (System) Evaluation** | Training data may itself be sensitive and require filtering or scrubbing to avoid leakage risks. Bias audits reliant on access to protected characteristics such as ethnic background pose additional governance challenges.                                                                                                                                                        |
| **Level 2 – Product evaluation**        | Usage tracking data can be sensitive, and the same is true for session-level analytics and potential metadata, such as location.                                                                                                                                                                                                                                                    |
| **Level 3 – User evaluation**           | Chat logs reveal thoughts, misconceptions, personal circumstances, representing the most sensitive interaction data. PII scrubbers for LLM inputs/outputs can partly mitigate resulting risks, but not eliminate them. Responsible data practices enable product optimisation as they can support the trust that is required to evaluate some L3 aspects adequately and accurately. |
| **Level 4 – Impact evaluation**         | Administrative data linkages, such as health records, exam scores, and social registries, represent data risks, including re-identification risk for de-identified or pseudonymized data. Primary data collection for evaluation may be held separately, but needs to be treated as carefully.                                                                                      |


# Process Evaluations

Each level of the 4-level framework answers a distinct question about whether an AI intervention is working; results at each level can also help explain what is or isn't happening at the next. A process evaluation (PE) complements and extends the ability to understand why something isn’t working. Where the framework's levels 1 to 3 focus on the user journey, a process evaluation (PE) widens the lens by examining the full program delivery system to document what is happening during program delivery and compare it to what was intended.

As PE asks:

> Are the right actors doing the right things, at the right time, in response to the AI tool and within the program delivery system? It does not estimate counterfactual impact. Instead, it investigates whether the intervention is implemented as planned and surfaces insights to inform refinement.

PEs can be conducted before, during, or after Levels 2, 3 and 4 to diagnose bottlenecks, refine delivery, and ensure that subsequent evaluations test the best possible versions of the program. As highlighted in the Process Evaluation boxes across Levels 1-3, they can help explain adoption and retention at Level 2 and interrogate the enablers and constraints of behavior change at Level 3. When impact is mixed or null at Level 4, PEs can help distinguish whether limitations stem from the AI itself or from the activities and conditions surrounding its delivery. And by contrast- if impacts are positive, PEs are helpful in articulating what a realistic path to impact at scale might require. At Level 4, process data collected across study sites can enhance interpretation of impact—through mediation analysis, subgroup analysis (e.g., across indicators of implementation context), or by constructing a treatment intensity variable to estimate the causal effect of variation in delivery fidelity.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=linkages-across-levels%2Fprocess-evaluations>" %}

</details>


# Do I need a Process Evaluation?

While PEs can help diagnose bottlenecks and inform refinements, conducting a PEs for each levels 2-4 is not a requirement to advance your program. However, here are some situations under which a PE would be especially valuable:

* **When the user is not the beneficiary.** Levels 1-3 focuses on the user. Tools targeting frontline workers as users - midwives, teachers, extension agents - typically aim to benefit someone else downstream: a pregnant woman, student, farmer, or patient. If the beneficiary does not receive, understand, or act on the information, impact will not follow. Therefore, you may want to construct funnels for each stakeholder in the delivery of your intervention, program, or social service.
* **When the product influences actors outside the program to behave in a way that can impact outcomes.** AI products can shift the work of people who never touch your product, intervention, or program. For example, teachers may shift from lecturing to coaching when students begin using a tutoring app that delivers core instruction– or they may become less motivated to teach a topic if they believe the app has “taken over,” resulting in reduced teacher effort or preparation). If routines, responsibilities, or incentives shift, these changes can be understood, supported, and/or mitigated via a PE.
* **When the product is implemented in relation to other programmatic systems already in place.** Digital products rarely operate alone: they must either fit into existing systems (e.g., data or reporting systems) or transform them. For example, if an AI product flags patients that should follow-up with a provider, this information should flow into existing health record systems and be used by the providers who see them. If data cannot be synced, matched to the right person, or acted on, outcomes do not improve, even when the tool itself is used correctly. A PE supports investigating these factors.

If a program moves to a Level 4 trial without incorporating a process evaluation it risks finding null or even negative impacts for an otherwise promising product. Without insight into implementation, it becomes difficult to determine whether the results reflect a failure of the product itself or weaknesses in how it was delivered and integrated into the broader program system.

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=linkages-across-levels%2Fprocess-evaluations%2Fdo-i-need-a-process-evaluation>" %}

</details>


# What does it take to do a process evaluation?

Process evaluations systematically and empirically describe what happens during program implementation, then carefully compare this description to ex-ante expectations in the implementation plan and theory of change. In brief, it proceeds along the following lines:

* **Program theory of change:** Construct user funnels for all stakeholders (human actors) that interact with the AI product and influence development outcomes. Explicitly articulate what each actor must do differently for the AI intervention to work. Make sure to surface the assumptions that link each step: what needs to be true for inputs to lead to activities, for activities to generate outputs, and for outputs to translate into outcomes. These assumptions are often where the [chain breaks](#user-content-fn-1)[^1].
* **Adopt an implementation framework:** Established Implementation Science (IS) frameworks such as CFIR ([Consolidated Framework for Implementation Research](https://link.springer.com/article/10.1186/s13012-022-01245-0)) (Damschroder et al., 2009) offer pre-specified domains for systematically identifying the factors that shape whether an intervention takes hold in a real-world setting. These and similar frameworks have been applied extensively to digital interventions (e.g. [Greenhalgh et al. 2017](https://pubmed.ncbi.nlm.nih.gov/29092808/)). PEs of AI-enabled tools can draw on this body of work to systematically assess how characteristics of the broader intervention, the implementing organization (e.g., leadership engagement, infrastructure), and the external environment (e.g., policy, incentives) shape real-world implementation dynamics.
* **Use mixed methods:** Combine routine and administrative data, structured process indicators, document review, qualitative (interviews, observations, focus groups) and quantitative methods (representative surveys), to interrogate weak links in the theory of change.
* **Iterate on programme design (beyond model or product**): Treat findings as input into redesigning training, supervision, accountability structures, and integration with existing systems.
* **Stage-gate the next level of evaluation:** Make satisfactory implementation improvements an explicit precondition for proceeding to the next level of evaluation.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=linkages-across-levels%2Fprocess-evaluations%2Fwhat-does-it-take-to-do-a-process-evaluation>" %}

</details>

[^1]: For illustrative examples, see Smith, M., Huang, C.H., Carter, S. et al., "Where AI Interventions Succeed or Fail," IDinsight, 2026. <https://www.idinsight.org/article/where-ai-interventions-succeed-or-fail/>

	Create Shared Practices	Use consistent, credible, and comparable practices to assess what works and drive learning across the industry.
	Improve Products and Programs	Identify issues early through continuous evaluation and build better products over time.
	Demonstrate Accountability	Show stakeholders measurable progress from model performance to impact.
			Cover image
	Implementors and Program Managers	Improve your products and programs with credible evaluation practices.	/files/K3RO2hRVfb7kdPAuvmJ9
	Funders and Policy Makers	Make informed investments by assessing an organization’s ability to evaluate and improve their product.	/files/x4g04elKvnWNAMp5kT9g

Models Evaluation	Does the AI system perform as intended? Level 1 →	/spaces/VDHDXE8axdWQfu0OFCHP/pages/DeMcUC7YhehF7wXhEazC	Models & Behaviour
Product Evaluation	Does the overall product engage and retain users? Level 2 →	/spaces/zpcawBg21nKa217FyRsG/pages/BRhAcSDI4fzmQttWpxZl	Implementors & Program Managers
User Evaluation	Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome? Level 3 →	/spaces/R1fawv6icuZEAPmz1pnB/pages/wcgHi9eru7seyBhXPjew	User Experience
Impact Evaluation	Do users with access to the product improve development outcomes? Level 4 →	/spaces/DNdX3hzAtddLuS4lBI4e/pages/YnZKseJWPCqdwrTYVLTE	Social Impact

	Build your team	To build a GenAI product for social impact, you need the right team that brings together development sector expertise with skillsets that are newer to the field. This section describes the relevant skillsets. Learn more →	/pages/V91mgmS1QmGOVyIVTszQ
	Build the infrastructure	Before diving into the four levels, teams should establish several key conceptual and technical building blocks that ease evaluations. This section describes what should be developed before diving in. Learn more →	/pages/tSN6S6uJ2o6t9Y6o4LYF
Area of Expertise	Roles in Evaluation	Responsibilities
Engineers (AI, Backend/Data, MLOps)	Lead: Level 1 Support: Level 2, Level 3, Level 4	Orchestrate prompts, knowledge bases and other components of a modern AI system; Create/maintain benchmark datasets and set up automated metrics/human judges/LLM judges to run offline and online tests; Track and improve model performance; Perform error analysis and ensure data quality; Build and fine-tune models if necessary; ensure relevance and safety; log outputs for downstream use. Domain-specific inputs (e.g., educators for tutor bots) are also essential.
Product Managers	Lead: Level 2 Support: Level 1, Level 3	Integrate AI into workflows; define product metrics, maintain shared dashboards; design/implement experiments in collaboration with Domain Experts and User Researchers and track outcomes of A/B tests; manage product versions and releases; align product metrics with user behavior research.
Data Scientists	Support: Level 2, Level 3, Level 4	Analyze data from Level 2, including definition of metrics. Contribute to both routine monitoring and analysis of A/B tests.
User researchers (can include behavioral/psychological scientists)	Lead: Level 3 Support: Level 2, Level 4	Measure user outcomes (cognitive, affective, and behavioral) and run A/B tests on these outcomes; run surveys and interviews; co-design metrics with end users; and integrate qualitative insights from interviews, focus groups, and direct observation with Level 2 product metrics.
Social scientists	Lead: Level 4	Evaluate long-term outcomes (e.g., learning, health, income); define theory of change; run impact evaluations
Domain Experts	Support: Level 1, Level 2, Level 3, Level 4	Help to define rubrics for Level 1, and validate Level 1 metrics. Support definition of Level 2 and Level 4 metrics and their real-world relevance. Contribute to the theory of change.
		Cover image
Static Knowledge	Used alone, they cannot access real-time information (e.g., current weather in a rural village) so are limited to the training data they have received.	https://images.unsplash.com/photo-1584184200374-73d7f6c6a175?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHxjb25jcmV0ZXxlbnwwfHx8fDE3NzI2NDAwMzB8MA&ixlib=rb-4.1.0&q=85
Limited Context	The model will not have access to personal information or your proprietary documents unless explicitly engineered to do so. As a result, models may lack the context to generate actionable, personalized, or even accurate outputs for a given task.	https://images.unsplash.com/photo-1586769852836-bc069f19e1b6?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxpbmZvcm1hdGlvbnxlbnwwfHx8fDE3NzI2NDAwNjh8MA&ixlib=rb-4.1.0&q=85
Instruction Following	Models may struggle to adhere to complex instructions or fail to follow constraints consistently, leading to results that do not fully meet expected criteria.	https://images.unsplash.com/photo-1508726096737-5ac7ca26345f?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxvYmV5fGVufDB8fHx8MTc3MjY0MDA5OHww&ixlib=rb-4.1.0&q=85
Task Mismatch	AI models are not the right “tool” for every task; for example, they may confidently make errors in math calculations which are trivial for a calculator. Understanding where they shine and augmenting them with capabilities they lack is key to using them well.	https://images.unsplash.com/photo-1613905780946-26b73b6f6e11?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwyfHx3cm9uZ3xlbnwwfHx8fDE3NzI2NDAxNDB8MA&ixlib=rb-4.1.0&q=85
Component	Workflow Steps
Pre-processing	Check input for malicious or off-topic content (filtering model) Translate query from Pulaar to English (translation model)
Context Preparation	Retrieve relevant agricultural content from the database Retrieve specific information about the farmer from the context window Generate response to processed user input (large language model)
Post-processing	Verify the answer is grounded in the content provided in your knowledge base Translate the response back to Pulaa (translation model)
Dimension	What to Measure	Target
Accuracy/ Usefulness	The quality of the AI’s response and whether it sufficiently addresses the task at hand	“The response must address the user’s specific question instead of giving a generic answer and it must be medically accurate.”
Qualitative / Branding	The "personality" and tone of the AI.	"The response must be professional and never use jargon."
Safety & Sensitivity	Identifying sensitive issues specific to your use case and specify any unacceptable behaviours.	"The AI system must never provide legal advice or comment on [Sensitive Topic X]."
Robustness & Stability	The system's ability to remain consistent when the same question is asked in different ways.	"The core answer should not change if the user uses different phrasing or synonyms."
Linguistic Consistency	For multi-language apps, ensuring performance doesn't drop across languages.	"The Swahili and Sheng question must receive the same level of detail as the English version."
Service-Level Performance	The "cost of doing business."	"The end-to-end response time must be less than 2 seconds at a cost of <$0.01 per query."
Method / Scorer	Example Metrics	Example Use Case	Pros/Cons
Statistical scorers These are based on the words in the LLM output and don’t take the semantic meaning into account.	Precision/ Recall/ F1, Mean squared error, BLEU, ROUGE,METEOR, WER	An NGO evaluates a literacy chatbot that generates short reading comprehension questions in Swahili. BLEU and ROUGE are used to compare the chatbot’s questions to a set of human-written reference questions to assess linguistic overlap.	Speed: Accuracy: Cost (lower is better):
Model-based scorers These are small language models trained to do one specific task.	AlignScore / LIM-RA,BLEURT, BARTScore,COMET	A health information NGO uses BLEURT, a pre-trained model designed to score text quality, to evaluate the responses of an AI assistant that explains vaccination schedules to parents. The model-based scorer assesses how semantically faithful and understandable each generated message is compared to a trusted reference explanation.	Speed: Accuracy: Cost (lower is better):
LLM-based scorers a.k.a LLM-as-judge Since they use LLMs, they are flexible and powerful. But it can also be expensive and slow.	G-Eval,RARR	A digital agriculture platform uses a large language model (LLM) as a judge to evaluate the quality of pest management advice generated by smaller domain models. The LLM judge scores each message for accuracy, clarity, and farmer-friendliness, comparing them to expert agronomist responses.	Speed: Accuracy: Cost (lower is better):
Human evaluation For tasks requiring nuances and complex reasoning, or detecting subtle hallucinations, humans are ideal -- though not without their own biases.	Human evaluation	A mental health NGO tests a GenAI counseling tool for youth. Human evaluators (e.g., psychologists and peer mentors) manually rate the empathy, appropriateness, and emotional resonance of responses.	Speed: Accuracy: Cost (lower is better):
		Cover image
Goal-setting	Product Owners / Domain Experts Define the qualitative goal. For example, "The AI should be trustworthy".	https://images.unsplash.com/photo-1628440501245-393606514a9e?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHx0YXJnZXR8ZW58MHx8fHwxNzcyNjQyNDM2fDA&ixlib=rb-4.1.0&q=85
Measurement	Engineers Map the goal to a measurable proxy. For "trustworthy," you might select a Factual Consistency Score or an AlignScore.	https://images.unsplash.com/photo-1602503497726-dc6cfaab7e17?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxtZWFzdXJlfGVufDB8fHx8MTc3MjY0MjQ0Nnww&ixlib=rb-4.1.0&q=85
Validation	Product Owners Review the technical metric to ensure it accurately reflects the organization’s intent (or intended impact).	https://images.unsplash.com/photo-1516382799247-87df95d790b7?crop=entropy&cs=srgb&fm=jpg&ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjaGVja3xlbnwwfHx8fDE3NzI2NDI0NTh8MA&ixlib=rb-4.1.0&q=85