# Minimum Viable Evaluations

| Level 1 - Model evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox"><a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/pMp8WTGTVjysMbVi5Lmw">2-3 rubrics for model success</a> with at least one robust safety/guardrail metric computed on your Golden Dataset.</li><li><input type="checkbox">In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment</li><li><input type="checkbox">Develop a <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/nlovOPA1IMARfE9MwQau">Golden Dataset</a> with at least 30-50 items representing key, diverse user interactions</li><li><input type="checkbox">Establish a process for <a href="/spaces/VDHDXE8axdWQfu0OFCHP/pages/dmxTilPPMauBPlkqtQJw">expert review of AI system </a>responses for inputs in the Golden Dataset, as you iterate on your system configuration</li></ul> |

| Level 2 - Product evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Instrument the product to capture events automatically</li><li><input type="checkbox">Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)</li><li><input type="checkbox">Look for patterns in the data, and talk to users to identify opportunities for improvement</li><li><input type="checkbox">Test these ideas for improvement against these metrics with an A/B test</li></ul> |

| Level 3 - User evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).</li><li><input type="checkbox">Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.</li><li><input type="checkbox">Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.</li><li><input type="checkbox">Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)</li></ul> |

| Level 4 - Impact evaluation MVE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul class="contains-task-list"><li><input type="checkbox">Conduct an <a data-footnote-ref href="#user-content-fn-1">impact evaluation</a> with <a data-footnote-ref href="#user-content-fn-2">counterfactual</a> and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)</li><li><input type="checkbox">Implement strong version control with either a frozen version or a limited number of product versions to be tested</li><li><input type="checkbox">Cost data collection</li></ul> |

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fminimum-viable-evaluations>" %}

</details>

[^1]: An MVE Impact evaluation can also be done inexpensively. There are a number of resources on how to reduce costs and effort and still do a rigorous impact evaluation.

[^2]: Choose the counterfactual judiciously. Focus on the policy relevant choice. While it might be interesting to see how an intervention delivered by humans compares to an AI delivery, if human delivery would be too expensive to be feasible, focus on a counterfactual where the intervention is not delivered.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eval.playbook.org.ai/additional-resources/minimum-viable-evaluations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
