# Tools & Templates

## Level 1

### LLM evaluations

* [LLM Evals: Everything You Need to Know](https://hamel.dev/blog/posts/evals-faq/)
* [Multi-Turn Chat Evals](https://hamel.dev/notes/llm/officehours/evalmultiturn.html)
* [How do I evaluate agentic workflows?](https://hamel.dev/blog/posts/evals-faq/#q-how-do-i-evaluate-agentic-workflows)
* [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
* [Hierarchical AI Evaluation](https://gamma.app/docs/AI-QA-Hierarchical-Evaluation-Architecture-9t79y026n43d7op?mode=doc) by Gamma

***

### LLM evaluation in the social sector

* [Generative AI for Health in Low & Middle Income Countries](https://cdh.stanford.edu/research-portfolio/generative-ai-health-low-middle-income-countries)
* [Evaluation framework of PROMPTS at Jacaranda Health](https://www.google.com/url?q=https://cdh.stanford.edu/generative-ai-health-low-middle-income-countries\&sa=D\&source=editors\&ust=1770879887027623\&usg=AOvVaw2tnpWpMI0955H3SybGibhB) (pg 33)
* [Evaluation framework at Precision Development](https://precisiondev.org/evaluating-ai-for-learning-a-framework/) ([slide](https://www.google.com/url?q=https://docs.google.com/presentation/d/1agCgpDWNVWtbOFhdlDYUpLM3OxyHP5CxyzON_tn61x0/edit?slide%3Did.p%23slide%3Did.p\&sa=D\&source=editors\&ust=1770879887028358\&usg=AOvVaw37SXt8aprD7bCVCdrsfQAW))
* [Evaluation of Farmer.Chat at Digital Green](https://arxiv.org/abs/2409.08916)
* [Evaluation of mMitra at Armman](https://docs.google.com/presentation/d/1mAF1lI8tkTjLLW3SjwrV8mdz4VDkTdog/edit?slide=id.p1#slide=id.p1)

## Level 2

The tech industry has published numerous guidebooks and tools to help you define, collect, and analyze user funnel metrics. For details on how to construct common metrics, consider reviewing [The Agency Fund’s User Funnel Playbook](https://theagencyfund.substack.com/p/user-funnel-playbook-for-the-social).

In addition, you can leverage these reference materials:

* [The Amplitude Guide to Product Metrics](https://info.amplitude.com/rs/138-CDN-550/images/The%20Amplitude%20Guide%20to%20Product%20Metrics.pdf)
* [User Analytics for ChatGPT Enterprise and Edu](https://www.google.com/url?q=https://help.openai.com/en/articles/10875114-user-analytics-for-chatgpt-enterprise-and-edu-public-beta\&sa=D\&source=editors\&ust=1770879887075105\&usg=AOvVaw005sJUqXHiBchsro_K4jTY)
* [What We Know About Using Non-Engagement Signals in Content Ranking](https://arxiv.org/abs/2402.06831#:~:text=What%20We%20Know%20About%20Using%20Non%2DEngagement%20Signals%20in%20Content%20Ranking,-Tom%20Cunningham%2C%20Sana\&text=Many%20online%20platforms%20predominantly%20rank,for%20society%20as%20a%20whole.)

For more details on A/B testing, please review these resources:

* <https://www.youth-impact.org/insights/a-b-testing-toolkit>
* [Optimizely: What is A/B testing?](https://www.optimizely.com/optimization-glossary/ab-testing/)
* [Amplitude: What is A/B testing? How it works and when to use it](https://amplitude.com/blog/ab-testing)

## Level 3

Case Study: [ChatSEL](https://agency-fund.github.io/chatsel-docs/docs/t1-intro) is a GenAI coach developed at the Agency Fund that provides teachers with evidence-based and context-sensitive guidance on understanding and implementing SEL programs in a low-resource classroom. Please see the following document for how we might measure Level 3 outcomes in the context of ChatSEL.

[User Evaluation Workshop - ChatSEL](https://docs.google.com/document/d/18AXtIeDx6HsidhMKTJ2kIDb7hUwHEkEnwGPZuC9JJo0/edit?tab=t.0)

## Process Evaluations

* IDinsight. “Process Evaluation.” IDinsight Impact Measurement Guide,[ https://guide.idinsight.org/process-evaluation/](https://guide.idinsight.org/process-evaluation/)
* World Health Organization. Monitoring and Evaluating Digital Health Interventions: A Practical Guide to Conducting Research and Assessment. World Health Organization, 2016. <https://saluddigital.com/wp-content/uploads/2019/06/WHO.-Monitoring-and-Evaluating-Digital-Health-Interventions.pdf>
* Implementation Monitoring and Process Evaluation (Practical Guidebook) Bliss, M. J., & Emshoff, J. G. (2018). Implementation Monitoring and Process Evaluation. SAGE Publications.​

<br>

***

<details>

<summary>💬 Want to suggest edits or provide feedback?</summary>

{% embed url="<https://tally.so/r/A788l0?originPage=references%2Fadditional-resources>" %}

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eval.playbook.org.ai/additional-resources/additional-resources.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
