Agent Quality
Evaluation
Evals Flashcards – Hamel’s Blog - Hamel Husain
Notes on applied AI engineering, machine learning, and data science.
Evaluating Deep Agents: Our Learnings
Over the past month at LangChain, we shipped four applications on top of the Deep Agents harness:
* DeepAgents CLI: a coding agent
* LangSmith Assist: an in-app agent to help with various things in LangSmith
* Personal Email Assistant: an email assistant that learns from interactions with each user
* Agent Builder: a no-code agent building platform powered by meta deep agents
Building and shipping these agents meant adding evals for each of them, and we learned a lot along the way! In this
Parloa's Bayesian Framework to A/B Test AI Agents
Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups
How to Correctly Report LLM-as-a-Judge Evaluations
Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture
Evaluation-Driven Development of LLM Agents
Unlike deterministic systems, an LLM agent’s output is often probabilistic, meaning multiple responses may be valid within a given scenario.
Eval Driven System Design - From Prototype to Production
This cookbook provides a practical, end-to-end guide on how to effectively use evals as the core process in creating a production-grade a...
An LLM-as-Judge Won't Save The Product—Fixing Your Process Will
Applying the scientific method, building via eval-driven development, and monitoring AI output.
Building product evals is simply the scientific method in disguise. That’s the secret sauce. It’s a cycle of inquiry, experimentation, and analysis.
Building resilient prompts using an evaluation flywheel
This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts. A resilient prom...
Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation
Key Contributors: Lily Sierra, Nour Alkhatib, Steven Gross, Jacquelene Obeid, Kyle Swint, Monta Shen, Gary Song, Riddhima Sejpal, Jatin…
Evaluating Long-Context Question & Answer Systems
Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.
The "think" tool: Enabling Claude to stop and think \ Anthropic
A blog post for developers, describing a new method for complex tool-use situations
The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.
Evaluating Quality in Large Language Models: A Comprehensive Approach using the legal industry as a…
Evaluating the quality of outputs from Large Language Models (LLMs) is an intricate task due to the open-ended nature of many LLM tasks…
Check grounding with RAG | Vertex AI Agent Builder | Google Cloud
Check grounding with RAG
Creating a LLM-as-a-Judge That Drives Business Results –
A step-by-step guide with my learnings from 30+ AI implementations.
SCIPE - Systematic Chain Improvement and Problem Evaluation
Related to LangChain
Creating a LLM-as-a-Judge That Drives Business Results –
A step-by-step guide with my learnings from 30+ AI implementations.
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.
RAG Evaluation - Hugging Face Open-Source AI Cookbook
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
LlamaIndex: RAG Evaluation Showdown with GPT-4 vs. Open-Source Prometheus Model — LlamaIndex, Data Framework for LLM Applications
LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).
Using LLM-as-a-judge 🧑⚖️ for an automated and versatile evaluation - Hugging Face Open-Source AI Cookbook
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
RAG Evaluation - Hugging Face Open-Source AI Cookbook
We’re on a journey to advance and democratize artificial intelligence through open source and open science.