> For the complete documentation index, see [llms.txt](https://eval.playbook.org.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/2.-decide-on-metrics.md).

# Decide on metrics

Once you have defined a rubric, the next step is to define metrics you will use to track performance along each dimension in the rubric. The metrics you define can range from “benchmarks” (i.e., industry-standard metrics that evaluate foundation model performance on common tasks) to context-specific measures that examine whether the system performs for your specific use case.

We advise focusing on metrics that assess the AI system against the criteria that matter most for your solution. Industry benchmarks are primarily used to choose the right foundation model for your context, enabling comparisons on common tasks like word error rate (for translation tasks) or accuracy (for automatic speech recognition tasks). More specific measures should be used to track performance over time and to evaluate the effectiveness of modifications.

To actually compute metrics, data scientists and engineers will define “scorers” (i.e., algorithms or analytic strategies to assess the AI system against a performance target). Scorers typically fall under one of these categories, each with its own pros and cons:

* **Statistical and model-based scorers** are designed to deliver metrics for narrow, specific tasks. You cannot use them interchangeably. Examples of common metrics (and the associated analytic strategies) include:
  * [Precision/Recall/F1](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall): For measuring classification accuracy
  * [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate) (WER): For accuracy of transcription in speech recognition
  * [AlignScore](https://github.com/yuh-zha/AlignScore): Use for checking factual consistency

{% hint style="warning" %}
Be aware of the weaknesses of each method. For instance, metrics like WER only check the overlap between predicted and reference transcript – but don’t compare the meaning, making them less reliable. For meaning preservation, consider [alternative methods](https://www.google.com/url?q=https://research.google/blog/assessing-asr-performance-with-meaning-preservation/\&sa=D\&source=editors\&ust=1770879886951494\&usg=AOvVaw1juT7BpZ-ZygpNlan3KKDu).
{% endhint %}

* **LLM-as-Judge** uses an LLM to score AI system outputs flexibly and comes in many variants. Approaches include:
  * Direct Prompting: Asking the LLM to score the output based on a text-encoded rubric
  * Comparison with reference: Asking the LLM to score the output by comparing it to a reference answer.
  * Chain-of-Thought: Asking the LLM to explain its reasoning before scoring.
  * Claim Extraction: Breaking a response into specific claims and checking each against a reference text (ideal for hallucination detection).

{% hint style="warning" %}
[Evidence suggests](https://aclanthology.org/2024.findings-naacl.148.pdf) LLM-as-judge methods may perform poorly when evaluating low-resource languages.
{% endhint %}

* **Human-as-Judge** remains the "gold standard" for catching subtle nuances and context that automated scoring tools miss. However, human raters are slow, expensive, and prone to [their own biases](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md). Therefore, do not use humans to score your entire dataset. Instead, reserve them for high-leverage tasks:
  * Prototyping: Human feedback will help you move faster at the beginning, when no LLM judges exist
  * Rubric creation: Humans create better rubrics after having reviewed a few outputs themselves
  * Alignment: Check if the LLM judges are aligned to human experts. Set aside a small set of inputs, obtain both the LLM and human judgements and compare to ensure they are aligned.
  * Quality Assurance: Perform a final human safety check on high-stakes examples before a major launch.

The examples below provide a high-level view of common existing scoring methods, though they are not comprehensive. Each has its pros and cons; and the ideal metrics and analytic strategies will likely be a combination of these approaches.

<table data-header-hidden data-full-width="true"><thead><tr><th>Method / Scorer</th><th width="142.7734375">Example Metrics</th><th width="254.33984375">Example Use Case</th><th>Pros/Cons</th></tr></thead><tbody><tr><td>Statistical scorers These are based on the words in the LLM output and don’t take the semantic meaning into account.</td><td>​<a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall">Precision/ Recall/ F1</a>, <a href="https://www.geeksforgeeks.org/maths/mean-squared-error/">Mean squared error</a>, <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>, <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a>,​<a href="https://en.wikipedia.org/wiki/METEOR">METEOR</a>, <a href="https://en.wikipedia.org/wiki/Word_error_rate">WER</a>​</td><td>An NGO evaluates a literacy chatbot that generates short reading comprehension questions in Swahili. BLEU and ROUGE are used to compare the chatbot’s questions to a set of human-written reference questions to assess linguistic overlap.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i></p></td></tr><tr><td>Model-based scorers These are small language models trained to do one specific task.</td><td>​<a href="https://github.com/yuh-zha/AlignScore">AlignScore</a> / <a href="https://arxiv.org/pdf/2404.06579">LIM-RA</a>,​<a href="https://github.com/google-research/bleurt">BLEURT</a>, <a href="https://arxiv.org/pdf/2106.11520">BARTScore</a>,​<a href="https://unbabel.github.io/COMET/html/index.html">COMET</a>​</td><td>A health information NGO uses BLEURT, a pre-trained model designed to score text quality, to evaluate the responses of an AI assistant that explains vaccination schedules to parents. The model-based scorer assesses how semantically faithful and understandable each generated message is compared to a trusted reference explanation.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr><tr><td>LLM-based scorers a.k.a LLM-as-judge Since they use LLMs, they are flexible and powerful. But it can also be expensive and slow.</td><td>​<a href="https://arxiv.org/abs/2303.16634">G-Eval</a>,​<a href="https://arxiv.org/abs/2210.08726">RARR</a>​</td><td>A digital agriculture platform uses a large language model (LLM) as a judge to evaluate the quality of pest management advice generated by smaller domain models. The LLM judge scores each message for accuracy, clarity, and farmer-friendliness, comparing them to expert agronomist responses.</td><td><p>Speed: <i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr><tr><td>Human evaluation For tasks requiring nuances and complex reasoning, or detecting subtle hallucinations, humans are ideal -- though not without their <a href="https://arxiv.org/pdf/2307.03025">own</a> <a href="https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md">biases</a>.</td><td>​<a href="https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/basics.md">Human evaluation</a>​</td><td>A mental health NGO tests a GenAI counseling tool for youth. Human evaluators (e.g., psychologists and peer mentors) manually rate the empathy, appropriateness, and emotional resonance of responses.</td><td><p>Speed: <i class="fa-star">:star:</i></p><p>Accuracy: <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p><p>Cost (lower is better): <i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i><i class="fa-star">:star:</i></p></td></tr></tbody></table>

#### How do I know that I have selected the right metrics?

To choose the right metrics for your rubric, you must bridge the gap between "what we value" (the qualitative rubric defined by product managers) and "what we can measure" (the quantitative scorers implemented by engineers). The process of selecting metrics is a translation exercise between roles:

<table data-view="cards"><thead><tr><th></th><th></th><th data-hidden data-card-cover data-type="image">Cover image</th></tr></thead><tbody><tr><td><strong>Goal-setting</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Product Owners / Domain Experts</p></div><p>Define the qualitative goal. For example, "The AI should be trustworthy".</p></td><td><a href="https://images.unsplash.com/photo-1628440501245-393606514a9e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHx0YXJnZXR8ZW58MHx8fHwxNzcyNjQyNDM2fDA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1628440501245-393606514a9e?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw3fHx0YXJnZXR8ZW58MHx8fHwxNzcyNjQyNDM2fDA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Measurement</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Engineers</p></div><p>Map the goal to a measurable proxy. For "trustworthy," you might select a Factual Consistency Score or an AlignScore.</p></td><td><a href="https://images.unsplash.com/photo-1602503497726-dc6cfaab7e17?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxtZWFzdXJlfGVufDB8fHx8MTc3MjY0MjQ0Nnww&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1602503497726-dc6cfaab7e17?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHw0fHxtZWFzdXJlfGVufDB8fHx8MTc3MjY0MjQ0Nnww&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr><tr><td><strong>Validation</strong></td><td><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>Product Owners</p></div><p>Review the technical metric to ensure it accurately reflects the organization’s intent (or intended impact).</p></td><td><a href="https://images.unsplash.com/photo-1516382799247-87df95d790b7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjaGVja3xlbnwwfHx8fDE3NzI2NDI0NTh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85">https://images.unsplash.com/photo-1516382799247-87df95d790b7?crop=entropy&#x26;cs=srgb&#x26;fm=jpg&#x26;ixid=M3wxOTcwMjR8MHwxfHNlYXJjaHwzfHxjaGVja3xlbnwwfHx8fDE3NzI2NDI0NTh8MA&#x26;ixlib=rb-4.1.0&#x26;q=85</a></td></tr></tbody></table>

Not all metrics work for all tasks. You must select your metric and "scorer" based on the needs for speed, cost, and nuance. Standard industry benchmarks often fail in development contexts, particularly for low-resource languages or specific technical domains (like agriculture), so you may need to invent a custom metric.

Remember, do not try to measure everything. While you may want your AI system to be "friendly, on-brand, concise, complete, and curious," a long list creates conflicting tradeoffs (e.g., concise vs. complete) and increases evaluation costs.

#### How do I know that my LLM-based scorer is working?

In our experience, it is difficult to build an LLM-as-judge workflow that is adequately aligned with human reviewers, especially in the language and cultural contexts we encounter in the development sector. If there is any room for ambiguity, LLMs will produce wild variation in judgement. They may also fail to pick up nuances in human expert evaluations, if implicit. Unless the LLM judge is given precise instructions for handling different situations and nuances, its judgement will not match human experts. The process of tuning or instructing the LLM judge, to make it consistent with human experts, is called “alignment”. Here is an example of how such an alignment process looks like:

1. Create a set of 100-200 input/output pairs from the AI system, either from a sample of real user queries or generating a few queries (if no user queries exist) based on your knowledge of the key user interactions.
2. Pick a rubric item that is important to you, e.g. helpfulness, and write the instructions for an LLM judge on how to score it.
3. Have 2 independent human raters score the outputs for the same rubric item. It is strongly [recommended](https://hamel.dev/blog/posts/llm-judge/#step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques) to start by asking the raters to mark the output as binary pass/fail instead of scoring between 1 to 5 or 1 to 10. The resource linked above explains the rationale for this and not starting with binary pass/fail ratings is one of the key reasons why teams fail to produce aligned LLM judges.
4. Calculate the [Inter-annotator agreement](https://surge-ai.medium.com/inter-annotator-agreement-an-introduction-to-cohens-kappa-statistic-dcc15ffa5ac4) for your human reviewers: how correlated are their ratings?
   1. If the agreement is low, work on calibration across reviewers, and iterate on the instructions for your rubric (in this case, “helpfulness”) to clearly define what it means.
   2. Ask (ideally) a new set of independent reviewers to rate the outputs.
   3. Repeat this process until the agreement is good enough
5. Once you obtain high agreement, run your LLM Judge on this “alignment dataset” for that rubric item.
6. Check the agreement between the LLM’s score and your human raters
   1. If the agreement is high (> 0.8), you can be confident about the scores given by your LLM judge
   2. If low, continue improving your LLM Judge by modifying its prompt, updating the instructions for the rubric item, adding examples of input-output-judgement pairs, or use [more advanced methods](https://hamel.dev/blog/posts/llm-judge/). Then, repeat this step.

For most use cases, performing the steps above diligently should give you a well-aligned LLM judge. However, you might face other foundational challenges:

* The LLM Judge may not work well on low-resource languages (as mentioned above, these foundation models are trained on datasets dominated by high-resource languages).
* It may not be possible to verbalise the nuances of what makes a response "good" for the specific use case.

For such cases, training a smaller foundation model for your specific use case (“fine-tuning”) might be needed. Explaining the details of [fine-tuning](https://parlance-labs.com/education/#fine-tuning) is beyond the scope of this playbook.

{% hint style="success" %}

### Case Studies

[Jacaranda Health (JH)](https://www.google.com/url?q=https://jacarandahealth.org/\&sa=D\&source=editors\&ust=1770879886973046\&usg=AOvVaw2iOiR4afs3h9T1HMKYDH8y) recently added voice capabilities to its service for pregnant women and new mothers, for users with difficulty reading or seeing text. With voice, mothers can access maternal health guidance more easily. To train the foundational voice model, JH initially used audio samples from Mozilla Common Voice. However, the source had too many male voices and was not specific to their use case. They recorded a balanced Swahili‑English voice corpus from rural and urban mothers across Kenya, then fine‑tuned OpenAI’s Whisper model with those data. Over successive iterations, they drove Word Error Rate (WER) down from 87 percent to 15 percent, inching toward their 6 percent target (which matches the speech-to-text performance for top‑tier languages). Hitting each new milestone meant trading off the volume of diverse accents in the training set with the computing and annotation budget they had available.\ <img src="/files/Pg574WxLUqk5THyluIVr" alt="" data-size="original">\
They also modified their target metric as they iterated. Standard WER tallies substitutions, insertions, and deletions without regard for meaning. That metric penalizes Swahili’s flexible word order and complex verb forms, even when the intent is clear. For an alternative measure of the model’s performance, Jacaranda now measures semantic accuracy using a custom metric based on [cosine similarity](https://www.google.com/url?q=https://en.wikipedia.org/wiki/Cosine_similarity\&sa=D\&source=editors\&ust=1770879886975498\&usg=AOvVaw0DFsmBIr2nNZu7VVj7XrZ6). This experimental approach rewards transcripts that convey the same health guidance, even if they differ in exact phrasing. Hence, it is an example of non-standard metrics developed to make an AI system work in a new context. Jacaranda has been [transparent about their work](https://www.google.com/url?q=https://jacarandahealth.org/jacaranda-launches-open-source-llm-in-five-african-languages/\&sa=D\&source=editors\&ust=1770879886976200\&usg=AOvVaw1LlZXUv4okIuQLyQbBQi3U) on [Swahili fine-tuned models](https://www.google.com/url?q=https://huggingface.co/Jacaranda\&sa=D\&source=editors\&ust=1770879886976377\&usg=AOvVaw1ndzVJUxxTGUlN4oNf23ov), which has helped them capture community feedback and advance more quickly.

\
In a similar vein, to benchmark Automatic Speech Recognition (ASR) models in agriculture, [Digital Green](https://www.google.com/url?q=https://digitalgreen.org\&sa=D\&source=editors\&ust=1770879886976870\&usg=AOvVaw3PLu1q4JQFOlwYUQdjDMsp) (DG) began with metrics such as Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER). However, they had to introduce a custom Agri‑Weighted WER that penalizes errors in key agricultural terms more heavily. Using weighted metrics, DG could track progress on agricultural ASR performance across Hindi, Telugu, and Odia datasets and could tailor improvements to support scalable, farmer‑focused advisory systems.
{% endhint %}

Defining AI system metrics is an area of active research, and newer methods are being developed all the time. The [Huggingface Evaluation Guidebook](https://github.com/huggingface/evaluation-guidebook) is a great resource for understanding model benchmarking and discovering the right metrics for your use case.

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F2.-decide-on-metrics>" %}

</details>