Who is most involved in this level of evaluation?

Execute 🟢
Support 🟡

Product Managers

Data Scientists

Directly responsible for product metrics at this level. Works cross-functionally to prioritize the most promising hypotheses to test.

Apply evaluation methods with the proper measurement tools. Ensure accuracy and availability of product metrics (data pipelines).

Why is this level of evaluation important?

An AI system that produces perfect responses is worthless if users do not use it. Once you deploy your AI system as a product (e.g., a chatbot or app), you must track a few critical user signals, like:

  • Engagement: How many users are using the product?

  • Retention: How likely are they to continue using it?

If users never engage—or stop interacting because they see no value—they are unlikely to change their behavior in ways that improve their life outcomes.

Like AI system evals, Level 2 evaluation is a continuous, iterative cycle, not a one-time exercise. We track user interaction metrics over time, and look for unexpected drop-off or intended improvements, for example when a promising new feature is released as part of an A/B test. Product evaluations are critical for iterative improvement, but they can also be a matter of safety. Suppose you have an experimental new feature in your chatbot – but you’re not sure how people will react. It might be risky to roll out this new feature to all users, all at once.


💬 Want to suggest edits or provide feedback?

Last updated

Was this helpful?