Overview
Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome?
Once an AI system is reliable (Level 1) and engaging (Level 2), we must ask the deeper question: Is it actually working? In the development sector, "liking" a product is not a proxy for impact. Level 3 evaluates the "stepping stones" of change—the intermediate cognitive and affective shifts that predict long-term life improvements in health, education, or livelihoods.
Key Motivation
Unlike commercial sectors that rely on satisfaction scores (NPS), development outcomes require objective evidence of change. Level 3 is essential because:
Predictive Power: Intermediate changes (e.g., increased confidence or knowledge) serve as early signals of success long before distal outcomes (e.g., higher income) materialize.
Beyond "Vanity Metrics": It distinguishes between a user who is merely "addicted" to an interface and one who is actually gaining agency or mastering a skill.
Fast Iteration: It allows you to run experiments on psychological "states" (like motivation or trust) to refine your product during pilots.
Core Concept: Intermediate Outcomes
Level 3 measures how an "adequate dosage" of your AI product shifts the user across several dimensions. We look for changes in the following constructs:
Cognitive
Knowledge acquisition, belief updating, and reasoning complexity.
Affective
Emotional valence, sense of safety, trust, and perceived empathy.
Behavioral
Intent to act, application of info, and proactive help-seeking.
Motivational
Self-efficacy, intrinsic curiosity, and persistence vs. dependency.
Relational
Quality of interpersonal communication and trust in human vs. AI sources.
How to Evaluate
Level 3 combines the experimental rigor of Level 2 with deeper psychological and linguistic analysis.
Generate hypotheses based on a theory of change: Based on the theory of change, define intermediate cognitive, affective, or behavioral outcomes that are plausibly linked to your targeted social impact.
Identify outcome metrics (Digital Traces): E.g. Analyze conversation logs for "on-platform" behaviors that signal growth, such as increased query depth, technical vocabulary, or proactive follow-up questions.
Define guardrail metrics and measure potential harm: Specifically measure potential harms, such as "AI dependency" (reduced willingness to attempt tasks without help) or "social displacement."
Consider constructing proxies for long-term development outcomes: We propose constructing a "Surrogate Index", consisting of Level 2 and Level 3 metrics, to serve as a proxy for longer-term Level 4 outcomes.
Consider conducting experiments to improve the selected key metrics and running process evaluations: After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes.
Last updated
Was this helpful?