> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/additional-resources/frequently-asked-questions.md).

# Frequently Asked Questions

### Who should use this guide?

This guide is meant for people working in software, product, data science, behavioral science, impact evaluation and specific development sectors (e.g. health and education). Roles often map to different levels—engineers on model behavior (Level 1), product managers and data scientists on analytics (Level 2), social scientists and user researchers on user experience and behavior (Level 3), and impact evaluators on social impact (Level 4)—but silos can block progress. Effective GenAI for development requires cross-domain collaboration beyond any single level.

### Does this playbook dictate a specific practice or method for evaluating generative AI applications?

This playbook offers multiple approaches for each evaluation level, but remains prescriptive. Where options exist, we highlight their pros and cons. Where minimum standards or proven methods exist, we identify them.

### Does this framework imply a linear process from L1 to L4?

The levels imply order, but evaluation is cyclical. Teams may start with model benchmarks (Level 1), move to usability and engagement in deployment (Level 2), and return to Level 1 if usage drops. If engagement holds, evaluation can progress to user thoughts and behaviors (Level 3) and, eventually, long-term development outcomes (Level 4).

### Is this playbook just focused on GenAI evaluations?

Yes. We recognize that predictive AI, agentic AI, and other AI technologies are often associated with and part of AI products. This playbook does not currently focus on them though future iterations may do so.

### How rigorous should organizations be when resources are limited?

Do not try to measure everything. We advocate for Minimum Viable Evaluations (MVE). You should start small as you pursue each level of evaluation and avoid building complex automated AI evaluation pipelines on Day 1. We highlight what constitutes MVEs in each section.

### Does this playbook help me identify which level of evaluation my organization should pursue for our AI application?

The playbook defines each evaluation level, but there is no single path. Teams may start at Level 1 and move upward, often looping back as products evolve. AI iteration can be fast once a process is in place, while user surveys in Levels 3–4 can be slower. The key challenge is choosing the right level and intensity at each stage—not to “reach” a level, but to decide what is enough evaluation given the context, risks, and questions.

### Are the evaluations described in this playbook all I need to develop a socially impactful product?

This playbook focuses on doing GenAI evaluation well—what to measure, how to measure it, and how to generate evidence with rigor and speed. It is not a full product development guide. While Levels 1–3 inform product decisions, they are necessary but not sufficient. Effective products also rely on process evaluation, UX design, and content strategy, which identify user pain points and shape how products function and feel, often running in parallel to or ahead of evaluation.

### What makes evaluation for development outcomes different from evaluation for commercial use?

Commercial apps often optimize for retention; development apps must optimize for welfare. This means looking beyond engagement to real-world behavior change (Level 3) and life outcomes (Level 4). A long chat may signal success in commercial settings, but in maternal health, a brief exchange that prompts a clinic visit is the real win.

### Do I need to re-evaluate every time the product launches in a new market?

Commercial products evolve and are localized as they enter new markets, supported by well-defined workflows. Evaluations must also adapt to context. When deploying an existing product in a new setting, factors like health system variation, disease epidemiology, or student–teacher ratios will shape the right metrics and benchmarks. For example, in one country, a cough-based AI TB detector may consider chest radiographs as the diagnostic “gold standard,” whereas for a different country, serological biomarkers might be used to ascertain the ground truth. Evaluation is therefore part of product development at every stage—from early prototyping to multi-country launch.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Ffrequently-asked-questions>" %}

</details>