Overview

Do users with access to the product improve development outcomes?

Impact evaluation is the "gold standard" of evidence. While Level 3 measures shifts in thoughts and feelings, Level 4 measures the ultimate results: improved crop yields, higher test scores, or better health outcomes. By using a counterfactual—comparing those who use your product to a similar group that does not—you can isolate the true impact of your AI intervention from the "noise" of a messy world.


Key Motivation

Policy makers, donors, and governments require credible evidence before they invest in scaling a solution. Level 4 evaluation is critical because:

  • Causal Attribution: It proves that improvements were caused by your product, not by coincidence or external trends.

  • Informing Scale: It provides the cost-effectiveness data needed to justify large-scale budget allocations.

  • Identifying Unintended Effects: Rigorous trials can surface hidden negative consequences or surprising positive spillovers that simpler metrics miss.

Read more ->


Core Concept: The Counterfactual

To know if your AI tool works, you must estimate what would have happened to the same people without it. We do this by creating a comparison group.

Method
How it Works
Best Used When...

RCT

Randomly assign users to "Treatment" or "Control."

You have a large sample and high control over rollout.

Quasi-Experimental

Compare groups that follow "parallel trends" over time.

Randomization is not feasible or ethical.

Regression Discontinuity

Compare people just above/below a specific cutoff (e.g., test scores).

Resources are allocated based on a strict threshold.

Read more ->


How to Evaluate

Level 4 is a high-investment undertaking. It should only be performed when Levels 1–3 are strong and your product is mature.

  1. Select the Right Counterfactual: Decide what you are comparing against. Is it "Business as Usual" (no tech), a "Non-AI digital tool," or "Human-delivered services"?

  2. Manage Product Dynamism: AI products change fast. Avoid biasing your study by tagging versions and, if possible, maintaining a holdout group on a frozen baseline version.

  3. Measure True Capabilities: Use objective, industry-standard assessments. Ensure students aren't just "copy-pasting" AI answers; test them when they don't have access to the tool.

  4. Account for Spillovers: GenAI is "leaky"—users share advice with neighbors. Use Cluster Randomization (by school or village) to prevent the control group from accidentally being "treated."

  5. Monitor Attrition: Digital tools often have high drop-off. Use Level 2 engagement data to monitor who leaves the study and ensure it doesn't skew your final results.

When to Start?

Do not rush into an Impact Evaluation. You are ready for Level 4 when:

  • ✅ Level 1–3 evidence is consistent.

  • ✅ Scale-up is being considered by major partners.

  • ✅ You have the technical bandwidth to coordinate with independent researchers.

Read more ->


💬 Want to suggest edits or provide feedback?

Last updated

Was this helpful?