# Consider conducting experiments to improve the selected key metrics and running process evaluations

After identifying intermediate outcomes that serve as early indicators of the development outcome of interest, the next step is to run experiments to assess how product changes influence Level 3 outcomes without bringing harm. The evaluation methods remain the same as in Level 2, but are applied to a different set of outcomes (e.g., A/B testing: Feature A vs. Feature B; multi-armed bandits: performance-based adaptive allocation; holdout testing: e.g., AI vs. non-AI).

We also recommend running process evaluations to gain an understanding on why and when Level 3 metrics are not changing.

A PE ([see primer here](https://eval.playbook.org.ai/linkages-across-levels/process-evaluations)) linked to a level 3 evaluation can surface what is or isn’t enabling cognitive, affective, or behavioral outcomes, informing what to test and where to focus program improvements.

At this level it is important to zoom out from the user into the broader program ToC (Figure 3 in Building Blocks section). Behavior change is shaped by the program delivery system and social context outside the program: the workflows, organizational, and social conditions that determine whether product use translates into changed thoughts, feelings, and behavior.

| ToC Domain / Assumption | Example PE questions                                                                                                                                   | Methods                                                                                                                                                                                        |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Interpretation          | Are users receiving and interpreting AI outputs in the way the program intended, and does this differ across subgroups?                                | Semi-structured interviews; cognitive walk-throughs with purposive sample of users; analysis of in-conversation signals (e.g., follow-up questions, expressed confusion), quantitative surveys |
| Opportunity to act      | What contextual factors- social norms, competing demands, or structural constraints- are enabling or blocking users from acting on AI recommendations? | Focus groups and ethnographic observation; barrier/enabler mapping using implementation science frameworks                                                                                     |
| Externalities           | Has the tool shifted the roles or behaviors of others in the program ecosystem — intentionally or not? Are these shifts undermining outcomes?          | Key informant interviews and/or surveys with supervisors and non-user staff; focus groups; administrative data on workload, staffing, or service utilization                                   |

A PE at Level 3 typically requires richer qualitative inquiry than at Level 2, since barriers to behavior change are often rooted in context, relationships, and norms. When possible, run the PE alongside or before the Level 3 evaluation, so findings can directly inform program refinements.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-3-user-evaluation%2Fhow-is-level-3-evaluation-performed%2Fwhy-arent-thoughts-feelings-and-behavior-changing>" %}

</details>