# Introduction

{% hint style="info" %}
This playbook is a living playbook. We’ll keep updating this playbook and collaborating more deeply with specialists to co-create shared evaluation tools, refine methodologies, and support their practical use in real-world settings.
{% endhint %}

The use of generative AI (GenAI) tools in low- and middle-income countries is multiplying – from AI-powered math tutors for children to digital advisory tools for farmers. While studies have shown that AI-powered applications can improve human and economic development outcomes (e.g., [Henkel et al., 2024](https://arxiv.org/abs/2402.09809)), others warn of harmful uses (e.g., [Bastani et al., 2024](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486)). The common thread is that outcomes depend critically on how AI is designed and used.

**Evaluations can help developers build more impactful AI tools,** **yet there has been little agreement on what evaluation actually means.** Organizations often find themselves caught between two extremes of evaluation: tech teams prioritize product performance, often overlooking impact, while impact evaluators focus on outcomes but may neglect the underlying technology.

**These foci are often pursued in isolation, even though they are inherently complementary.** This playbook offers a unified framework to bridge the gap, laying out how to evaluate GenAI services in the development sector: what evaluation should include, and the standard practices implementers should follow. It is organized around a four-level framework:

<table data-card-size="large" data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-target data-type="content-ref"></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><i class="fa-gear-code">:gear-code:</i> </td><td><h4><strong>Level 1 - Model evaluation</strong></h4></td><td>Does the AI system perform as intended?</td><td><a href="../level-1-model-evaluation">level-1-model-evaluation</a></td><td></td></tr><tr><td><i class="fa-box-isometric">:box-isometric:</i></td><td><h4><strong>Level 2 - Product evaluation</strong></h4></td><td>Does the overall product engage and retain users?</td><td><a href="../level-2-product-evaluation/overview">overview</a></td><td></td></tr><tr><td><i class="fa-user">:user:</i></td><td><h4><strong>Level 3 - User evaluation</strong></h4></td><td>Does the product change users’ thoughts, feelings, knowledge, and behavior towards the development outcome?</td><td><a href="../level-3-user-evaluation/overview">overview</a></td><td></td></tr><tr><td><i class="fa-chart-column">:chart-column:</i></td><td><h4><strong>Level 4 - Impact evaluation</strong></h4></td><td>Do users with access to the product improve development outcomes?</td><td><a href="../level-4-impact-evaluation/overview">overview</a></td><td></td></tr></tbody></table>

Even if the boundaries may sometimes be blurry in practice, the four levels form a logical progression. Users are unlikely to stay engaged (Level 2) if the GenAI system fails to perform (Level 1), and development outcomes are unlikely to improve (Level 4) if users disengage or their feelings, knowledge, and behaviors are harmed (Level 3).

The central element of this framework is continuous evaluation. Unlike earlier rule-based tools, GenAI performance is highly sensitive to underlying models, training data, prompts, context, configuration, and other parameters. This complexity demands new evaluation methods. Moreover, AI’s core inputs evolve far faster than earlier technologies. Amidst this continuously evolving technology, continuous evaluation enables rapid iteration, maintains expected behavior, and improves performance and impact over time.

<figure><img src="https://364364967-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F7QyW332zLXP50hE2nhPE%2Fuploads%2FX81qFLfIUIvvXaEZW0ZZ%2Fimage.png?alt=media&#x26;token=37a0fd09-9644-4432-9775-54538e8d2bc2" alt=""><figcaption></figcaption></figure>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=introduction>" %}

</details>
