# Minimum Viable Evaluations | Level 1 - Model evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

2-3 rubrics for model success with at least one robust safety/guardrail metric computed on your Golden Dataset.
In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment
Develop a Golden Dataset with at least 30-50 items representing key, diverse user interactions
Establish a process for expert review of AI system responses for inputs in the Golden Dataset, as you iterate on your system configuration

| | Level 2 - Product evaluation MVE | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Instrument the product to capture events automatically
Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)
Look for patterns in the data, and talk to users to identify opportunities for improvement
Test these ideas for improvement against these metrics with an A/B test

| | Level 3 - User evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).
Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.
Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.
Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)

| | Level 4 - Impact evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Conduct an impact evaluation with counterfactual and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)
Implement strong version control with either a frozen version or a limited number of product versions to be tested
Cost data collection

| ***

💬 Want to suggest edits or provide feedback?

{% embed url="" %}

[^1]: An MVE Impact evaluation can also be done inexpensively. There are a number of resources on how to reduce costs and effort and still do a rigorous impact evaluation. [^2]: Choose the counterfactual judiciously. Focus on the policy relevant choice. While it might be interesting to see how an intervention delivered by humans compares to an AI delivery, if human delivery would be too expensive to be feasible, focus on a counterfactual where the intervention is not delivered. --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://eval.playbook.org.ai/additional-resources/minimum-viable-evaluations.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.