Over the past month at LangChain, we shipped four applications on top of the Deep Agents harness:
* DeepAgents CLI: a coding agent
* LangSmith Assist: an in-app agent to help with various things in LangSmith
* Personal Email Assistant: an email assistant that learns from interactions with each user
* Agent Builder: a no-code agent building platform powered by meta deep agents
Building and shipping these agents meant adding evals for each of them, and we learned a lot along the way! In this
Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups
The "think" tool: Enabling Claude to stop and think \ Anthropic
A blog post for developers, describing a new method for complex tool-use situations
The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.