AI Builder and Data Storyteller
This post is about evaluating production LLM applications such as support copilots, RAG assistants, document extraction agents, and workflow automations that users depend on every day.
The core problem is simple to describe and hard to solve. Offline tests can look good while production quality drops fast. You can ship with a high pass rate and still get brittle behavior under real user requests.
This happens because LLM applications are pipelines, not single models. Retrieval, prompting, tool calls, post processing, and safety checks all affect the final answer. Each step can fail in a different way, so evaluation must happen at component level and end to end level.
A common example is a policy assistant that passes synthetic tests but fails on real documents with mixed versions. Another example is a document parser that works on clean PDFs but breaks when tables are merged or scanned at low quality.
I start by defining a strict contract for each test case. This keeps runs reproducible and helps isolate failures.
{
"id": "case_0127",
"task_type": "rag_qa",
"user_query": "What are the refund conditions?",
"gold_answer": "...",
"gold_citations": ["doc_44#p2", "doc_11#p5"],
"required_tools": ["search_policy"],
"risk_level": "medium"
}
With this structure, every run can produce comparable artifacts such as final answer, retrieved chunks, tool traces, and judge output.
For example, in an extraction task I hard fail any response that misses mandatory keys. In a RAG answer task I hard fail any citation that does not map to retrieved context.
I report per dimension distributions, not only a global mean. Mean score can look stable while failure clusters grow in one dimension.
For RAG systems, retrieval is evaluated separately from generation. This avoids wasting time tuning prompts when the real issue is poor context recall.
In practice, groundedness catches many polished but unsupported answers. This is especially common when retrieval recall is low and the model still responds with high confidence.
Every candidate release must pass explicit quality gates. Example gates include minimum correctness, minimum groundedness, no increase in high risk failures, and bounded latency increase versus previous baseline.
{
"min_correctness": 0.85,
"min_groundedness": 0.9,
"max_high_risk_failure_rate": 0.01,
"max_p95_latency_increase_pct": 15
}
I also track cost per request and failure rate together with quality. A small quality gain is not useful if cost and latency explode.
After each run, I classify failures into a compact taxonomy. Missing context, wrong tool selection, weak reasoning over valid context, and unsafe output are usually enough categories to start. Then each category maps to one owner action such as retriever tuning, tool policy update, prompt constraints, or guardrail refinement.
This loop is what makes progress predictable. Build, measure, inspect, fix one class of errors, and run again.
The main takeaway is that evaluation should be treated like product infrastructure. Clear measurement creates clear decisions, and clear decisions create reliable systems.
← Back to Blog