Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.
There are seven key patterns.
We can group metrics into two categories: context-dependent or context-free.
First, there’s poor correlation between these metrics and human judgments.
Second, these metrics often have poor adaptability to a wider variety of tasks.
Third, these metrics have poor reproducibility.
Building solid evals should be the starting point for any LLM-based system or product
we can start by collecting a set of task-specific evals
These evals will then guide prompt engineering, model selection, fine-tuning, and so on.
Eval Driven Development (EDD)
Rather than asking an LLM for a direct evaluation (via giving a score), try giving it a reference and asking for a comparison. This helps with reducing noise.
Dense vector retrieval serves as the non-parametric component while a pre-trained LLM acts as the parametric component.
When evaluating an ANN index, some factors to consider include:
Some popular techniques include:
To retrieve documents with low latency at scale, we use approximate nearest neighbors (ANN).