Develop a golden dataset
By this step, you have defined a rubric (describing desirable system behaviors), metrics (quantitative measures for system performance), and scorers (tools or algorithms that calculate your metric values). To verify if your solution is actually improving along the rubric’s dimensions, you need a Golden Dataset: a set of records representing an optimal or ideal user interaction with the system. This represents your performance target. You will use this dataset to benchmark the AI system’s performance over time, or to compare performance across different variants of your AI system.
Golden datasets include sample inputs to the AI system, paired outputs, and associated labels. Creating the dataset is often the most time-consuming part of a Level 1 evaluation, and it requires a cross-functional team (e.g. domain experts, annotators, product owners, and quality assurance). Inputs and outputs often are annotated by human raters, who create ideal reference answers or define how a given output should be scored according to your rubric items and metrics.
We offer three different approaches to building this dataset:
Past Transaction Data: If you are adding AI to an existing application or program, leverage your historical data to define “ideal” inputs and outputs. For example, if human support staff have answered user queries in the past, extract high quality and representative question-answer pairs from these non-AI interactions to form the Golden Dataset. You can use LLMs to pre-process or clean this data, and involve domain experts in labeling.
Human-Annotated Data: If you are building a new product, you must generate labeled datasets from scratch. To generate inputs, you may want to crowdsource initial questions from real potential users. To generate ideal outputs or responses, you will tap domain experts (e.g., nurses, agriculture tech advisors, tutors). Because experts are expensive resources, you may be tempted to save time by using an LLM to generate the "ideal" answers to user queries– and then invite experts to just review, validate, and correct. However, even experts might take shortcuts; as reviewers, they are likely to skim and accept a "plausible" AI answer rather than rigorously correcting it. This risks validating hallucinations or mediocrity, and it may result in a low-quality Golden Dataset (jeopardizing your entire product development process). We recommend having experts produce answers to user queries from scratch.
Customized public datasets: If a high-quality public dataset exists that closely matches your use case, it can serve as a starting point. You can extract only the most relevant examples from the large dataset, and augment or refine them to better reflect your specific context.
How do I know that my Golden Dataset is good enough?
Ideally, your Golden Dataset will cover the full range of user interactions you expect the product to support. But achieving that is near impossible because, no matter how thorough your planning, users will find new and surprising ways to interact with your AI system. Do not wait to prepare your Golden Dataset; you will miss out on the opportunity of learning with real users. We strongly recommend you to adopt the mindset of “Minimum Viable Evaluation”, in this case building the smallest dataset needed to adequately represent key user interactions. You may need to conduct qualitative research or user observation sessions to identify these. Specifically, your dataset should include:
Various modes of user behavior: Consider not just what users will ask, but also how they will ask. This includes the communication medium (e.g., voice, text) as well as the tone, language, and phrasing. User interactions may reflect code-switching, informality, spelling errors, and varying levels of verbosity. There may also be multiple “personas” or user demographic groups. Should your AI pipeline take into account gender of the user when responding? Consider the diversity of user types and interaction modes that your solution should support – and ensure that your Golden Dataset represents these cases.
Out of scope requests: Remember to incorporate inputs or questions that the AI application does not support, to ensure that they are handled appropriately. These might be off-topic requests or topical requests that you do not want to handle, e.g, “write me a poem about pregnant women eating avocados”.
Adversarial or malicious requests: We recommend including malicious/unsafe questions (e.g. abusive inputs) in the Golden Dataset to test safety performance as well. You might also include examples of prompt injection, jailbreaking, and data or privacy attacks.
Some use an LLM to start with expert-suggested questions, and then generate variations in different tones, dialects, or levels of verbosity. But synthetic generation of input/output pairs can be risky, because the languages, cultures, and dialogue patterns of people in the “global majority” are under-represented in foundation models. Most commercially available models are trained using published materials and online content, which is heavily biased toward higher-income contributors in wealthier countries and excludes content and norms from oral communication.
Last updated
Was this helpful?