Overview
Do users with access to the product improve development outcomes?
Impact evaluation is the "gold standard" of evidence. While Level 3 measures shifts in thoughts and feelings, Level 4 measures the ultimate results: improved crop yields, higher test scores, or better health outcomes. By using a counterfactual—comparing those who use your product to a similar group that does not—you can isolate the true impact of your AI intervention from the "noise" of a messy world.
Key Motivation
Policy makers, donors, and governments require credible evidence before they invest in scaling a solution. Level 4 evaluation is critical because:
Causal Attribution: It proves that improvements were caused by your product, not by coincidence or external trends.
Informing Scale: It provides the cost-effectiveness data needed to justify large-scale budget allocations.
Identifying Unintended Effects: Rigorous trials can surface hidden negative consequences or surprising positive spillovers that simpler metrics miss.
Core Concept: The Counterfactual
To know if your AI tool works, you must estimate what would have happened to the same people without it. We do this by creating a comparison group.
RCT
Randomly assign users to "Treatment" or "Control."
You have a large sample and high control over rollout.
Quasi-Experimental
Compare groups that follow "parallel trends" over time.
Randomization is not feasible or ethical.
Regression Discontinuity
Compare people just above/below a specific cutoff (e.g., test scores).
Resources are allocated based on a strict threshold.
How to Evaluate
Level 4 is a high-investment undertaking. It should only be performed when Levels 1–3 are strong and your product is mature.
Select the Right Counterfactual: Decide what you are comparing against. Is it "Business as Usual" (no tech), a "Non-AI digital tool," or "Human-delivered services"?
Manage Product Dynamism: AI products change fast. Avoid biasing your study by tagging versions and, if possible, maintaining a holdout group on a frozen baseline version.
Measure True Capabilities: Use objective, industry-standard assessments. Ensure students aren't just "copy-pasting" AI answers; test them when they don't have access to the tool.
Account for Spillovers: GenAI is "leaky"—users share advice with neighbors. Use Cluster Randomization (by school or village) to prevent the control group from accidentally being "treated."
Monitor Attrition: Digital tools often have high drop-off. Use Level 2 engagement data to monitor who leaves the study and ensure it doesn't skew your final results.
When to Start?
Do not rush into an Impact Evaluation. You are ready for Level 4 when:
✅ Level 1–3 evidence is consistent.
✅ Scale-up is being considered by major partners.
✅ You have the technical bandwidth to coordinate with independent researchers.
Last updated
Was this helpful?