# How is Level 4 evaluation performed? Performing a Level 4 evaluation requires rigorous experimental or quasi-experimental designs to isolate the effect of the AI from other external factors. #### 1. Choosing Your Methodology At its core, impact evaluation compares a Treatment Group (those using the AI) to a Control/Comparison Group (those who are not). | **Method** | **Best Used When...** | | ------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | RCTs | You have a large sample and can randomly assign access to ensure groups are identical. | | Propensity Score Matching | You have a large dataset of users and non-users and need to statistically "match" them based on similar traits. | | Quasi-Experimental | Randomization isn't possible, but you can compare trends before and after the intervention between two similar groups. | | Regression Discontinuity | The intervention is delivered based on a strict numeric cutoff (e.g., test scores or income level). | Read more -> *** #### 2. High-Level Steps for AI Impact Evaluation **Step A: Select the Right Counterfactual** You must define what "the world without the AI" looks like. In AI evaluations, the comparison isn't always "nothing"—it might be a static chatbot, a human teacher, or a traditional paper-based process. **Step B: Account for "Product Dynamism"** Unlike a static pill or a physical textbook, AI products change constantly. To maintain scientific rigour: * **Tag Versions:** Log exactly which model version every user interacts with. * **Maintain a Hold-out Group:** Keep a small group on the "baseline" version of the AI to see if updates actually improve outcomes. * **Coordinate with Tech:** Ensure the engineering roadmap doesn't accidentally "break" the evaluation design. **Step C: Measure True Outcomes, Not Proxies** Ensure the evaluation measures actual welfare or capability gains. * **Avoid Gaming:** Don't use tests that users can pass simply by repeating AI-generated answers. * **Use Validated Tools:** Rely on industry-standard assessments or administrative data (e.g., health records, employment rates). **Step D: Manage Spillovers and Attrition** AI tools are easily shared, which creates a risk of "contamination" (the control group getting access to the AI). * **Cluster Randomization:** Randomize by village or school rather than by individual to prevent sharing. * **Monitor Drop-outs:** Use Level 2 (Usage) and Level 3 (Behavior) data to see who stops using the tool and why, as high attrition can ruin your statistical power. Read more -> *** #### 3. Common Pitfalls * **Underpowered Studies:** Assuming 100% of people will use the AI. In reality, uptake is often low; plan for a larger sample size than you think you need. * **The "Black Box" Problem:** If the AI evolves mid-study without version tracking, you won't know *which* version of the product caused the impact. * **Transparency vs. Adaptability:** Use a Pre-Analysis Plan to define how you will handle product changes before the study begins. Read more -> ***

💬 Want to suggest edits or provide feedback?

{% embed url="" %}