> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/social-impact/level-4-impact-evaluation/overview.md).

# Overview

Impact evaluation provides strong evidence for understanding causal social impact. While Level 3 measures shifts in thoughts and feelings, Level 4 measures the ultimate results: improved crop yields, higher test scores, or better health outcomes. By using a counterfactual—comparing those who use your product to a similar group that does not—you can isolate the true impact of your AI intervention from the "noise" of a messy world.

***

#### Key Motivation

Policy makers, donors, and governments require credible evidence before they invest in scaling a solution. Level 4 evaluation is critical because:

* **Causal Attribution:** It proves that improvements were caused by your product, not by coincidence or external trends.
* **Informing Scale:** It provides the cost-effectiveness data needed to justify large-scale budget allocations.
* **Identifying Unintended Effects:** Rigorous trials can surface hidden negative consequences or surprising positive spillovers that simpler metrics miss.

<a href="/pages/nDEp5z31imLnvAXYixVk" class="button primary">Read more -></a>

***

#### Core Concept: The Counterfactual

To know if your AI tool works, you must estimate what would have happened to the same people *without* it. We do this by creating a comparison group.

| Method                        | How it Works                                                           | Best Used When...                                      |
| ----------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------ |
| **RCT**                       | Randomly assign users to "Treatment" or "Control."                     | You have a large sample and high control over rollout. |
| **Difference-in-Differences** | Compare groups that follow "parallel trends" over time.                | Randomization is not feasible or ethical.              |
| **Regression Discontinuity**  | Compare people just above/below a specific cutoff (e.g., test scores). | Resources are allocated based on a strict threshold.   |

<a href="/pages/tVhuFj0GRCyKNwFJGA1l" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 4 is a high-investment undertaking. It should only be performed when Levels 1–3 are strong and your product is mature.

1. **Select the Right Counterfactual:** Decide what you are comparing against. Is it "Business as Usual" (no tech), a "Non-AI digital tool," or "Human-delivered services"?
2. **Manage Product Dynamism:** AI products change fast. Avoid biasing your study by tagging versions and, if possible, maintaining a holdout group on a frozen baseline version.
3. **Measure True Capabilities:** Use objective, industry-standard assessments. Ensure students aren't just "copy-pasting" AI answers; test them when they *don't* have access to the tool.
4. **Account for Spillovers:** GenAI is "leaky"—users share advice with neighbors. Use Cluster Randomization (by school or village) to prevent the control group from accidentally being "treated."
5. **Monitor Attrition:** Digital tools often have high drop-off. Use Level 2 engagement data to monitor who leaves the study and ensure it doesn't skew your final results.

**When to Start?**

Do not rush into an Impact Evaluation. You are ready for Level 4 when:

* ✅ Level 1–3 evidence is consistent.
* ✅ Scale-up is being considered by major partners.
* ✅ You have the technical bandwidth to coordinate with independent researchers.

<a href="/pages/gnarBxNy7gjgKIjcRcSH" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Foverview>" %}

</details>