Evaluation

22 bookmarks

Custom sorting

Agent Quality

·up.raindrop.io·Dec 18, 2025

Agent Quality

Evals Flashcards – Hamel’s Blog - Hamel Husain

Notes on applied AI engineering, machine learning, and data science.

·hamel.dev·Dec 3, 2025

Evals Flashcards – Hamel’s Blog - Hamel Husain

Evaluating Deep Agents: Our Learnings

Over the past month at LangChain, we shipped four applications on top of the Deep Agents harness: * DeepAgents CLI: a coding agent * LangSmith Assist: an in-app agent to help with various things in LangSmith * Personal Email Assistant: an email assistant that learns from interactions with each user * Agent Builder: a no-code agent building platform powered by meta deep agents Building and shipping these agents meant adding evals for each of them, and we learned a lot along the way! In this

·blog.langchain.com·Dec 3, 2025

Evaluating Deep Agents: Our Learnings

Parloa's Bayesian Framework to A/B Test AI Agents

Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups

·parloa.com·Dec 3, 2025

Parloa's Bayesian Framework to A/B Test AI Agents

How to Correctly Report LLM-as-a-Judge Evaluations

·arxiv.org·Nov 27, 2025

How to Correctly Report LLM-as-a-Judge Evaluations

Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture

Evaluation-Driven Development of LLM Agents

Unlike deterministic systems, an LLM agent’s output is often probabilistic, meaning multiple responses may be valid within a given scenario.

#Eval-driven development

·arxiv.org·Oct 6, 2025

Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture

Eval Driven System Design - From Prototype to Production

This cookbook provides a practical, end-to-end guide on how to effectively use evals as the core process in creating a production-grade a...

#Eval-driven development

·cookbook.openai.com·Oct 6, 2025

Eval Driven System Design - From Prototype to Production

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Applying the scientific method, building via eval-driven development, and monitoring AI output.

Building product evals is simply the scientific method in disguise. That’s the secret sauce. It’s a cycle of inquiry, experimentation, and analysis.

#Evaluation

·eugeneyan.com·Oct 6, 2025

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Building resilient prompts using an evaluation flywheel

This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts. A resilient prom...

#AI Evals #Evaluation

·cookbook.openai.com·Oct 6, 2025

Building resilient prompts using an evaluation flywheel

Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation

Key Contributors: Lily Sierra, Nour Alkhatib, Steven Gross, Jacquelene Obeid, Kyle Swint, Monta Shen, Gary Song, Riddhima Sejpal, Jatin…

·tech.instacart.com·Jul 15, 2025

Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation

Evaluating Long-Context Question & Answer Systems

Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.

·eugeneyan.com·Jun 25, 2025

Evaluating Long-Context Question & Answer Systems

The "think" tool: Enabling Claude to stop and think \ Anthropic

A blog post for developers, describing a new method for complex tool-use situations

The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.

#Prompting #Thinking #GenAI #LLM

·anthropic.com·Apr 6, 2025

The "think" tool: Enabling Claude to stop and think \ Anthropic