# Overview

Once an AI system is reliable (Level 1) and engaging (Level 2), we must ask the deeper question: Is it actually working? In the development sector, "liking" a product is not a proxy for impact. Level 3 evaluates the "stepping stones" of change—the intermediate cognitive and affective shifts that predict long-term life improvements in health, education, or livelihoods.

***

#### Key Motivation

Unlike commercial sectors that rely on satisfaction scores (NPS), development outcomes require objective evidence of change. Level 3 is essential because:

* Predictive Power: Intermediate changes (e.g., increased confidence or knowledge) serve as early signals of success long before distal outcomes (e.g., higher income) materialize.
* Beyond "Vanity Metrics": It distinguishes between a user who is merely "addicted" to an interface and one who is actually gaining agency or mastering a skill.
* Fast Iteration: It allows you to run experiments on psychological "states" (like motivation or trust) to refine your product during pilots.

<a href="overview/why-is-this-level-of-evaluation-important" class="button primary">Read more -></a>

***

#### Core Concept: Intermediate Outcomes

Level 3 measures how an "adequate dosage" of your AI product shifts the user across several dimensions. We look for changes in the following constructs:

| Outcome Category | What we measure                                                           |
| ---------------- | ------------------------------------------------------------------------- |
| **Cognitive**    | Knowledge acquisition, belief updating, and reasoning complexity.         |
| **Affective**    | Emotional valence, sense of safety, trust, and perceived empathy.         |
| **Behavioral**   | Intent to act, application of info, and proactive help-seeking.           |
| **Motivational** | Self-efficacy, intrinsic curiosity, and persistence vs. dependency.       |
| **Relational**   | Quality of interpersonal communication and trust in human vs. AI sources. |

<a href="overview/who-is-the-user-being-evaluated" class="button primary">Read more -></a>

***

#### How to Evaluate

Level 3 combines the experimental rigor of Level 2 with deeper psychological and linguistic analysis.

1. **Identify Proxies (Digital Traces):** Analyze conversation logs for "on-platform" behaviors that signal growth, such as increased query depth, technical vocabulary, or proactive follow-up questions.
2. **Collect Survey Data:** Use short, validated scales integrated directly into the chat flow to capture self-reported shifts in confidence or emotional state.
3. **Analyze Content (NLP):** Use tools like Sentiment Analysis or LLM-based Text Analysis to score user utterances for themes like "independence" or "anxiety" at scale.
4. **Define Guardrail Metrics:** Specifically measure potential harms, such as "AI dependency" (reduced willingness to attempt tasks without help) or "social displacement."
5. **De-coupled Assessment:** Conduct off-platform quizzes, interviews, or observer reports (e.g., from teachers) to validate that skills learned in-app translate to the real world.

<a href="how-is-level-3-evaluation-performed" class="button primary">Read more -></a>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Foverview>" %}

</details>
