Who is most involved in this level of evaluation?
In Level 3, researchers evaluate users’ attitudes and behaviors using quantitative and/or qualitative methods. Contributors are often trained in behavioral science, social psychology, public health, or behavioral economics. They should have a mix of quantitative and qualitative skills, or be knowledgeable enough to support a multi-methods team using:
Quantitative approaches (analyzing logs, surveys, and conversation data) to infer users’ state or traits. These approaches measure constructs such as knowledge, beliefs, intention, norms, feelings, behaviors, etc.
Qualitative approaches (interviews, focus groups, usability tests, and ethnographic methods) to understand how users interact with a product. These approaches help researchers validate assumptions, understand mechanisms, surface unintended effects, and expose contextual or environmental drivers of user pain points.
User Researchers
Data Scientists / Engineers
Develop and apply evaluation methods with the proper measurement tools
Build and deploy surveys and experiments within the product Support the design of A/B tests and randomized experiments
Why is this level of evaluation important?
Once an AI system is functioning as expected (Level 1), and the product is engaging users as intended (Level 2), we can ask a deeper question: Is the product influencing how users think, feel, or act—and in ways that advance a development outcome of interest?
The success of commercial AI products is often measured via user satisfaction ratings or Net Promoter Score (NPS)—essentially asking, 'Do you like this product enough to recommend it?' But in the development sector, satisfaction is not a proxy for impact. A student might enjoy a tutoring app (high NPS) without actually mastering the curriculum. A patient may favorably review a health provider, even when harmed by sub-standard care.
In Level 3 evaluation, we identify and measure specific behaviors, beliefs, or feelings that predict long-term improvements in health, education, or livelihoods. We will use a program’s Theory of Change (TOC) to specify the “stepping stones” that users traverse on their path toward impact. Instead of waiting years to see if health or education outcomes improve, we will identify intermediate changes in how users think, feel, or act to serve as early signals of success.
To do this, organizations should address 5 key issues:
Measures: Which specific user-level changes actually matter to our Theory of Change? Can we measure these short-term changes relatively cheaply and frequently?
Attribution: Can we plausibly claim these changes are caused by our AI product?
Trajectory: Are metrics trending in the right direction? Do sub-groups behave differently?
Malleability: Can we shift metrics by altering the product experience? Do users show increased drive to act (e.g., asking proactive questions or expressing intent to change) when we intervene with product improvements?
Perception: Do users feel more empowered to act (e.g., do they have a clearer understanding of their next steps), even if they do not immediately take action? User perceptions can be predictive of outcomes, for example when a student recognizes the learning gains they have achieved while using an app.
By defining and tracking intermediate outcomes, Level 3 helps you conduct fast product iterations during pilots and ongoing feature development, setting the stage for a successful Level 4 evaluation down the road.
Last updated
Was this helpful?