The Almanack of Pablo

AI Builder and Data Storyteller

Navigating the Chaos: A Pragmatic Approach to LLM Evaluation

December 26, 2025

This post is about evaluating production LLM applications such as support copilots, RAG assistants, document extraction agents, and workflow automations that users depend on every day.

The core problem is simple to describe and hard to solve. Offline tests can look good while production quality drops fast. You can ship with a high pass rate and still get brittle behavior under real user requests.

This happens because LLM applications are pipelines, not single models. Retrieval, prompting, tool calls, post processing, and safety checks all affect the final answer. Each step can fail in a different way, so evaluation must happen at component level and end to end level.

A common example is a policy assistant that passes synthetic tests but fails on real documents with mixed versions. Another example is a document parser that works on clean PDFs but breaks when tables are merged or scanned at low quality.

Evaluation Contract First

I start by defining a strict contract for each test case. This keeps runs reproducible and helps isolate failures.

{
  "id": "case_0127",
  "task_type": "rag_qa",
  "user_query": "What are the refund conditions?",
  "gold_answer": "...",
  "gold_citations": ["doc_44#p2", "doc_11#p5"],
  "required_tools": ["search_policy"],
  "risk_level": "medium"
}

With this structure, every run can produce comparable artifacts such as final answer, retrieved chunks, tool traces, and judge output.

Three Layer Evaluation Stack

  1. Deterministic checks. Validate schema, required fields, citation existence, tool usage constraints, and policy rules.
  2. Model based judging. Use a fixed judge prompt with explicit scoring dimensions and short reasons per score.
  3. Human audit. Review high risk and high disagreement cases to catch subtle quality regressions.

For example, in an extraction task I hard fail any response that misses mandatory keys. In a RAG answer task I hard fail any citation that does not map to retrieved context.

Scoring Dimensions I Actually Use

I report per dimension distributions, not only a global mean. Mean score can look stable while failure clusters grow in one dimension.

RAG Specific Metrics

For RAG systems, retrieval is evaluated separately from generation. This avoids wasting time tuning prompts when the real issue is poor context recall.

In practice, groundedness catches many polished but unsupported answers. This is especially common when retrieval recall is low and the model still responds with high confidence.

Release Gates

Every candidate release must pass explicit quality gates. Example gates include minimum correctness, minimum groundedness, no increase in high risk failures, and bounded latency increase versus previous baseline.

{
  "min_correctness": 0.85,
  "min_groundedness": 0.9,
  "max_high_risk_failure_rate": 0.01,
  "max_p95_latency_increase_pct": 15
}

I also track cost per request and failure rate together with quality. A small quality gain is not useful if cost and latency explode.

Failure Triage Loop

After each run, I classify failures into a compact taxonomy. Missing context, wrong tool selection, weak reasoning over valid context, and unsafe output are usually enough categories to start. Then each category maps to one owner action such as retriever tuning, tool policy update, prompt constraints, or guardrail refinement.

This loop is what makes progress predictable. Build, measure, inspect, fix one class of errors, and run again.

The main takeaway is that evaluation should be treated like product infrastructure. Clear measurement creates clear decisions, and clear decisions create reliable systems.

← Back to Blog