# Red-teaming

Beyond evaluating your solution against known criteria (e.g. those captured in your Golden Dataset), you may also want to actively try to break or pressure test your AI system before releasing it into the wild. ​​The goal is to find vulnerabilities, biases, and failure modes before your users do. You must adopt the mindset of a malicious actor, a confused user, or a creative edge-case generator to trick the system into behaving in ways it shouldn't.

While all AI solutions benefit from adversarial testing, it is non-negotiable in the following high-stakes scenarios:

* **Access to PII**: If your system handles sensitive data, the risk of privacy breaches increases. You must ensure adversarial prompting cannot manipulate the AI system into leaking restricted information.
* **Fine-Tuned Foundation Models**: Custom training can inadvertently weaken a base foundation model’s built-in safety guardrails or introduce new biases. You must re-test to confirm the foundation model remains aligned after modification.
* **Agentic & Flexible Solutions**: The more autonomy a system has (e.g., browsing the web, executing code), the more pathways exist for failure. Increased freedom demands increased adversarial testing.
* **Long Conversations**: Evaluators often test single-turn Q\&A exchanges, missing the cumulative risks in conversational interfaces (e.g., mental health companions or tutors). Long interactions are susceptible to cumulative errors. A small misunderstanding in turn 1 can be amplified by turn 10, causing the system to "drift into unsafe or nonsensical territory." Red-teaming must explicitly test these long-context scenarios.
* **High-Risk Domains**: In sectors like maternal health or financial planning, failure causes severe harm. Red-teaming is essential to identify and mitigate dangerous advice.
* **Population-Scale Deployments**: When deploying to a large, anonymous user base, you must assume two things: 1) you do not know how users will interact with the system; and 2) at scale, even improbable "edge cases" will occur somewhat frequently.

Specialist red-teaming services can be prohibitively expensive. For most social sector organizations, it is more practical to build this capability internally.

To do this effectively, follow a simple three-step workflow:

{% stepper %}
{% step %}
**Plan: define the scope and the team.**

* **Define "Redlines"**: Establish your threat model by identifying worst-case scenarios and the specific behaviors the system must never exhibit (e.g., leaking PII or giving medical advice). Core threat categories include misuse, loss of control, robustness failures (e.g. the system performs well in lab conditions but breaks in real-world variability).
* **Assemble the Team**: Gather a diverse mix of technical engineers and domain experts.
* **Choose the Method**: Decide if you will test via model APIs (faster, automated) or the Product UI (more realistic user experience), or both.
  {% endstep %}

{% step %}
**Probe: Adopt an adversarial mindset.**

* **Attack the AI system**: Act like a malicious actor, a confused user, or an edge-case generator. Try to distract, exploit, and stress-test the system.
* **Hunt for Failures**: Specifically look for unsafe, biased, or nonsensical responses, particularly in long conversations or sensitive domains.
* **Log Everything**: For every failure, capture the specific Input, the Output, and the Context to ensure reproducibility.
  {% endstep %}

{% step %}
**Prioritize: Not all failures are equal.**

* **Rank Risks**: Review findings based on severity (impact) and likelihood (frequency).
* **Assign Fixes**: Allocate owners to address the highest-priority vulnerabilities.
* **Re-Test**: After mitigations are applied, rerun the tests to ensure the fix worked and didn't break anything else.
  {% endstep %}
  {% endstepper %}

For practical templates and sample prompts, see:

* [Red-Teaming AI for Social Good Playbook](https://humane-intelligence.org/insights/research/) (UNESCO & Human Intelligence, 2024)
* [Planning Red-Teaming for Large Language Models](https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/red-teaming) (Microsoft Learn, 2024)

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=level-1-model-evaluation%2Fhow-is-level-1-evaluation-performed%2F6.-red-teaming>" %}

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eval.playbook.org.ai/model-behaviour/how-to-evaluate/6.-red-teaming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
