Evaluation

Evaluation

17 bookmarks
Custom sorting
An LLM-as-Judge Won't Save The Product—Fixing Your Process Will
An LLM-as-Judge Won't Save The Product—Fixing Your Process Will
Applying the scientific method, building via eval-driven development, and monitoring AI output.
Building product evals is simply the scientific method in disguise. That’s the secret sauce. It’s a cycle of inquiry, experimentation, and analysis.
·eugeneyan.com·
An LLM-as-Judge Won't Save The Product—Fixing Your Process Will
The "think" tool: Enabling Claude to stop and think \ Anthropic
The "think" tool: Enabling Claude to stop and think \ Anthropic
A blog post for developers, describing a new method for complex tool-use situations
The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.
·anthropic.com·
The "think" tool: Enabling Claude to stop and think \ Anthropic