> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/additional-resources/minimum-viable-evaluations.md). # Minimum Viable Evaluations | Level 1 - Model evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

2-3 rubrics for model success with at least one robust safety/guardrail metric computed on your Golden Dataset.
In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment
Develop a Golden Dataset with at least 30-50 items representing key, diverse user interactions
Establish a process for expert review of AI system responses for inputs in the Golden Dataset, as you iterate on your system configuration

| | Level 2 - Product evaluation MVE | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Instrument the product to capture events automatically
Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)
Look for patterns in the data, and talk to users to identify opportunities for improvement
Test these ideas for improvement against these metrics with an A/B test

| | Level 3 - User evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).
Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.
Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.
Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)

| | Level 4 - Impact evaluation MVE | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |

Conduct an impact evaluation with counterfactual and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)
Implement strong version control with either a frozen version or a limited number of product versions to be tested
Cost data collection

| ***

💬 Want to suggest edits or provide feedback?

{% embed url="" %}

[^1]: An MVE Impact evaluation can also be done inexpensively. There are a number of resources on how to reduce costs and effort and still do a rigorous impact evaluation. [^2]: Choose the counterfactual judiciously. Focus on the policy relevant choice. While it might be interesting to see how an intervention delivered by humans compares to an AI delivery, if human delivery would be too expensive to be feasible, focus on a counterfactual where the intervention is not delivered.