Overview

Does the product change users' thoughts, feelings, knowledge and behaviour towards the development outcome?

Once an AI system is reliable (Level 1) and engaging (Level 2), we must ask the deeper question: Is it actually working? In the development sector, "liking" a product is not a proxy for impact. Level 3 evaluates the "stepping stones" of change—the intermediate cognitive and affective shifts that predict long-term life improvements in health, education, or livelihoods.


Key Motivation

Unlike commercial sectors that rely on satisfaction scores (NPS), development outcomes require objective evidence of change. Level 3 is essential because:

  • Predictive Power: Intermediate changes (e.g., increased confidence or knowledge) serve as early signals of success long before distal outcomes (e.g., higher income) materialize.

  • Beyond "Vanity Metrics": It distinguishes between a user who is merely "addicted" to an interface and one who is actually gaining agency or mastering a skill.

  • Fast Iteration: It allows you to run experiments on psychological "states" (like motivation or trust) to refine your product during pilots.

Read more ->


Core Concept: Intermediate Outcomes

Level 3 measures how an "adequate dosage" of your AI product shifts the user across several dimensions. We look for changes in the following constructs:

Outcome Category
What we measure

Cognitive

Knowledge acquisition, belief updating, and reasoning complexity.

Affective

Emotional valence, sense of safety, trust, and perceived empathy.

Behavioral

Intent to act, application of info, and proactive help-seeking.

Motivational

Self-efficacy, intrinsic curiosity, and persistence vs. dependency.

Relational

Quality of interpersonal communication and trust in human vs. AI sources.

Read more ->


How to Evaluate

Level 3 combines the experimental rigor of Level 2 with deeper psychological and linguistic analysis.

  1. Generate hypotheses based on a theory of change: Based on the theory of change, define intermediate cognitive, affective, or behavioral outcomes that are plausibly linked to your targeted social impact.

  2. Identify outcome metrics (Digital Traces): E.g. Analyze conversation logs for "on-platform" behaviors that signal growth, such as increased query depth, technical vocabulary, or proactive follow-up questions.

  3. Define guardrail metrics and measure potential harm: Specifically measure potential harms, such as "AI dependency" (reduced willingness to attempt tasks without help) or "social displacement."

  4. Consider constructing proxies for long-term development outcomes: We propose constructing a "Surrogate Index", consisting of Level 2 and Level 3 metrics, to serve as a proxy for longer-term Level 4 outcomes.

  5. Consider conducting experiments to improve the selected key metrics and running process evaluations: After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes.

Read more ->


💬 Want to suggest edits or provide feedback?

Last updated

Was this helpful?