Page cover

About this playbook

From math tutors to farmer advisory tools, generative AI (GenAI) is rapidly expanding across low- and middle-income countries. This playbook provides a 4-level framework and recommends practices for evaluating these GenAI tools.

Why we need this playbook

Evaluating GenAI products can mean different things depending on who you ask. Tech teams prioritize performance, often overlooking impact, while impact evaluators focus on outcomes but may neglect the underlying technology. Even within disciplines, the sophistication and quality of evaluations can differ.

This playbook establishes a unified set of expectations and practices for evaluating GenAI products in global development.

Create Shared Practices

Use consistent, credible, and comparable practices to assess what works and drive learning across the industry.

Improve Products and Programs

Identify issues early through continuous evaluation and build better products over time.

Demonstrate Accountability

Show stakeholders measurable progress from model performance to impact.

Who is this playbook for

Cover

Implementors and Program Managers

Improve your products and programs with credible evaluation practices.

Cover

Funders and Policy Makers

Make informed investments by assessing an organization’s ability to evaluate and improve their product.

How to use this playbook

The playbook is organized around a 4-level framework that asks the following evaluation questions:

Models Evaluation

Does the AI system perform as intended? Level 1 →

Product Evaluation

Does the overall product engage and retain users? Level 2 →

User Evaluation

Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome? Level 3 →

Impact Evaluation

Do users with access to the product improve development outcomes? Level 4 →

Each level outlines detailed evaluation practices for organizations building AI products to pursue.

The four levels form a logical progression. Users are unlikely to stay engaged (Level 2) if the GenAI system fails to perform (Level 1), and development outcomes are unlikely to improve (Level 4) if users disengage or their feelings, knowledge, and behaviors are harmed (Level 3).

The playbook helps implementers conduct continuous evaluation across levels. Often, the results of one level of evaluation may require revisiting the performance of a preceding level. Amidst evolving technology, this enables rapid iteration, maintains expected behavior, and improves performance and impact over time.

Setting the Foundation

Build your team

To build a GenAI product for social impact, you need the right team that brings together development sector expertise with skillsets that are newer to the field. This section describes the relevant skillsets. Learn more →

Build the infrastructure

Before diving into the four levels, teams should establish several key conceptual and technical building blocks that ease evaluations. This section describes what should be developed before diving in. Learn more →

Additional Resources

FAQs | Glossary | Minimal Viable Evaluations | Tools & Templates

Stay involved


This is a living playbook. It will be updated regularly, with deeper collaboration with specialists to co-create shared evaluation tools, refine methodologies, and support their practical use in real-world settings.

Last updated

Was this helpful?