# Overview Once an AI system is reliable (Level 1) and engaging (Level 2), we must ask the deeper question: Is it actually working? In the development sector, "liking" a product is not a proxy for impact. Level 3 evaluates the "stepping stones" of change—the intermediate cognitive and affective shifts that predict long-term life improvements in health, education, or livelihoods. *** #### Key Motivation Unlike commercial sectors that rely on satisfaction scores (NPS), development outcomes require objective evidence of change. Level 3 is essential because: * Predictive Power: Intermediate changes (e.g., increased confidence or knowledge) serve as early signals of success long before distal outcomes (e.g., higher income) materialize. * Beyond "Vanity Metrics": It distinguishes between a user who is merely "addicted" to an interface and one who is actually gaining agency or mastering a skill. * Fast Iteration: It allows you to run experiments on psychological "states" (like motivation or trust) to refine your product during pilots. Read more -> *** #### Core Concept: Intermediate Outcomes Level 3 measures how an "adequate dosage" of your AI product shifts the user across several dimensions. We look for changes in the following constructs: | Outcome Category | What we measure | | ---------------- | ------------------------------------------------------------------------- | | **Cognitive** | Knowledge acquisition, belief updating, and reasoning complexity. | | **Affective** | Emotional valence, sense of safety, trust, and perceived empathy. | | **Behavioral** | Intent to act, application of info, and proactive help-seeking. | | **Motivational** | Self-efficacy, intrinsic curiosity, and persistence vs. dependency. | | **Relational** | Quality of interpersonal communication and trust in human vs. AI sources. | Read more -> *** #### How to Evaluate Level 3 combines the experimental rigor of Level 2 with deeper psychological and linguistic analysis. 1. **Identify Proxies (Digital Traces):** Analyze conversation logs for "on-platform" behaviors that signal growth, such as increased query depth, technical vocabulary, or proactive follow-up questions. 2. **Collect Survey Data:** Use short, validated scales integrated directly into the chat flow to capture self-reported shifts in confidence or emotional state. 3. **Analyze Content (NLP):** Use tools like Sentiment Analysis or LLM-based Text Analysis to score user utterances for themes like "independence" or "anxiety" at scale. 4. **Define Guardrail Metrics:** Specifically measure potential harms, such as "AI dependency" (reduced willingness to attempt tasks without help) or "social displacement." 5. **De-coupled Assessment:** Conduct off-platform quizzes, interviews, or observer reports (e.g., from teachers) to validate that skills learned in-app translate to the real world. Read more -> ***

💬 Want to suggest edits or provide feedback?

{% embed url="" %}