# Common pitfalls to avoid

Impact evaluations are high-leverage, high-effort undertakings. Avoiding a few predictable errors can significantly improve the value – and credibility – of your results. While these issues face many non-AI impact evaluations, here we have tried to capture the ways these risks are manifesting differently in early impact evaluations of AI products.

#### Being underpowered

Even real impacts can go undetected in underpowered studies. For AI products, low uptake—especially early on—is a key risk. Overly optimistic uptake assumptions can leave treatment groups too small to detect effects. Set realistic expectations by piloting uptake with groups similar to the intended treatment population, involve skeptics in planning, and use recent Level 2 evaluations to inform assumptions.

As discussed earlier, you track your target population and key sub-groups across all four evaluation stages. At Level 4, it is critical to have sufficient sample size to detect statistically significant, programmatically meaningful effects, including differences across groups. This challenge is not AI-specific but applies to any sub-group analysis; however, AI interventions may see groups participate in different ways and at different rates. Insights from Levels 1–3 should inform Level 4 sample design and outcome measurement. If budget allows, keep samples and outcomes broad enough to detect unintended positive or negative effects not flagged earlier.

#### Mismanaging transparency

Impact evaluations should build confidence by involving credible, independent investigators, sharing data where appropriate, and pre-specifying key measures and analyses. But transparency should not come at the expense of adaptability. Researchers and implementers need to have an agreement on what they are evaluating, and if e.g. the intervention is to remain static or not and if not what data will be available to understand the dynamic nature of the intervention. Given how quickly AI interventions can change, establish mechanisms for early and ongoing coordination during implementation.

#### Letting product evolution obscure the analysis

If the product may change during the study, pre-specify how changes will be handled analytically. One option is to freeze a version for the trial; if that is not feasible, define and log substantive changes, tag version exposure, and use this metadata to test for improvements or degradations over time. While this creates risk, it is also an opportunity for GenAI evaluations. Unlike analog interventions—where changes often went unobserved—version tracking, and even repeating Levels 1–3 evaluations after major updates, can enable much richer analysis during the impact evaluation period.

#### Underestimating the risks of attrition

Attrition—through disengagement or loss to follow-up—can seriously weaken power and interpretability. In digital interventions, only a small share of sign-ups may engage, and drop-off is easy. Plan for this: track engagement early, power studies accordingly, and use passive data where possible. If attrition is unavoidable, pre-specify how it will be handled and report it transparently. Use Level 2 and 3 data to monitor attrition early, adjust design, and link it to version tracking to understand who drops out and when.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-4-impact-evaluation%2Fhow-is-level-4-evaluation-performed%2Fcommon-pitfalls-to-avoid>" %}

</details>
