Building an internal agent: Evals to validate workflows
Whenever a new pull request is submitted to our agent’s GitHub repository,
we run a bunch of CI/CD operations on it.
We run an opinionated linter, we run typechecking, and we run a bunch of unittests.
All of these work well, but none of them test entire workflows end-to-end.
For that end-to-end testing, we introduced an eval pipeline.
This is part of the Building an internal agent series.
Why evals matter
The harnesses that run agents have a lot of interesting nuance, but they’re
generally pretty simple: some virtual file management, some tool invocation,
and some context window management.
However, it’s very easy to create prompts that don’t work well, despite the
correctness of all the underlying pieces.
Evals are one tool to solve that, exercising your prompts and tools together
and grading the results.