2-3 rubrics for model success with at least one robust safety/guardrail metric computed on your Golden Dataset.
In consultation with product and business owners, set a success criteria or threshold for each rubric/metric that needs to be passed before it is ready for deployment
Develop a Golden Dataset with at least 30-50 items representing key, diverse user interactions
Establish a process for expert review of AI system responses for inputs in the Golden Dataset, as you iterate on your system configuration
Instrument the product to capture events automatically
Use the events data to produce two metrics: activation (used once), and retention (used repeatedly)
Look for patterns in the data, and talk to users to identify opportunities for improvement
Test these ideas for improvement against these metrics with an A/B test
Define 1-2 outcome metrics tied to the theory of change (focus on the most decision-relevant cognitive/behavioral outcomes), and include at least one early-warning indicator of harm (e.g., over-reliance, disengagement).
Combine at least one behavioral/trace metric with a brief, contextualized self-report measure (≤3 items) to capture meaningful user change.
Include a minimal external check (e.g., focused group discussion, offline data, or stakeholder validation) to ensure on-platform measures reflect real-world outcomes.
Consider testing product changes on selected outcomes using simple experimental methods (e.g., A/B tests)
Conduct an impact evaluation with counterfactual and enough of a sample size to measure the key outcome(s) of interest, including among sub-populations of interest (e.g. by gender, geography)
Implement strong version control with either a frozen version or a limited number of product versions to be tested
Cost data collection
Last updated 29 days ago
Was this helpful?