LLM Evaluation Harness in Python — Deep Dive
Evaluation is what separates teams that ship reliable LLM applications from teams that ship and hope. A production evaluation harness needs custom metrics, statistical rigor, cost management, and CI integration. This guide covers how to build one in Python.
1) Evaluation dataset design
Your dataset is the foundation. Design it with coverage in mind:
from dataclasses import dataclass
@dataclass
class EvalCase:
id: str
input: str
expected_output: str | None # None for reference-free evaluation
category: str # "factual", "reasoning", "creative", "adversarial"
difficulty: str # "easy", "medium", "hard"
metadata: dict # domain-specific labels
eval_dataset = [
EvalCase(
id="q001",
input="What is the capital of France?",
expected_output="Paris",
category="factual",
difficulty="easy",
metadata={"domain": "geography"},
),
EvalCase(
id="q002",
input="Explain why the sky appears blue in two sentences.",
expected_output=None, # reference-free; judge by criteria
category="reasoning",
difficulty="medium",
metadata={"domain": "physics"},
),
]
Build the dataset incrementally: start with common cases, then add failures you discover in production. Tag categories and difficulty levels so you can analyze performance by segment.
2) Custom metric framework
Build metrics as composable functions:
from abc import ABC, abstractmethod
class Metric(ABC):
name: str
@abstractmethod
def score(self, input: str, output: str, expected: str | None) -> float:
"""Return a score between 0 and 1."""
...
class ExactMatch(Metric):
name = "exact_match"
def score(self, input: str, output: str, expected: str | None) -> float:
if expected is None:
return 0.0
return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0
class ContainsKeyFacts(Metric):
name = "contains_key_facts"
def __init__(self, key_facts: list[str]):
self.key_facts = key_facts
def score(self, input: str, output: str, expected: str | None) -> float:
output_lower = output.lower()
hits = sum(1 for fact in self.key_facts if fact.lower() in output_lower)
return hits / len(self.key_facts) if self.key_facts else 0.0
class JsonValid(Metric):
name = "json_valid"
def score(self, input: str, output: str, expected: str | None) -> float:
import json
try:
json.loads(output)
return 1.0
except json.JSONDecodeError:
return 0.0
Compose multiple metrics per evaluation case. A customer-support bot might need factual_accuracy + tone_appropriateness + json_valid.
3) LLM-as-judge implementation
For nuanced quality assessment, use a strong model as an evaluator:
from openai import OpenAI
client = OpenAI()
class LLMJudge(Metric):
name = "llm_judge"
def __init__(self, criteria: str, model: str = "gpt-4o"):
self.criteria = criteria
self.model = model
def score(self, input: str, output: str, expected: str | None) -> float:
prompt = f"""Evaluate this AI response on a scale of 1-5.
Criteria: {self.criteria}
User input: {input}
AI response: {output}
{f'Expected answer: {expected}' if expected else ''}
Return ONLY a JSON object: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
resp = client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
return result["score"] / 5.0 # normalize to 0-1
Calibrating the judge
LLM judges have biases: they favor verbose responses, their own writing style, and first-presented options. Mitigate by:
- Using low temperature (0) for consistency.
- Providing specific rubrics instead of vague criteria.
- Running the same evaluation twice with swapped order (for comparison tasks) and averaging.
- Periodically validating judge scores against human ratings on a subset.
4) RAG-specific evaluation
RAG systems need metrics for both retrieval and generation:
class ContextRelevance(Metric):
"""Measures whether retrieved contexts are relevant to the query."""
name = "context_relevance"
def score(self, input: str, output: str, expected: str | None,
contexts: list[str] | None = None) -> float:
if not contexts:
return 0.0
prompt = f"""For each context, rate relevance to the query (1=irrelevant, 5=highly relevant).
Query: {input}
Contexts:
{chr(10).join(f'{i+1}. {c[:200]}' for i, c in enumerate(contexts))}
Return JSON: {{"scores": [<score1>, <score2>, ...]}}"""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
scores = json.loads(resp.choices[0].message.content)["scores"]
return sum(scores) / (len(scores) * 5)
class Faithfulness(Metric):
"""Measures whether the answer is grounded in the provided contexts."""
name = "faithfulness"
def score(self, input: str, output: str, expected: str | None,
contexts: list[str] | None = None) -> float:
if not contexts:
return 0.0
context_text = "\n".join(contexts)
prompt = f"""Evaluate whether every claim in the answer is supported by the contexts.
Contexts: {context_text[:2000]}
Answer: {output}
Return JSON: {{"faithfulness_score": <0.0-1.0>, "unsupported_claims": [<list>]}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)["faithfulness_score"]
The RAGAS framework provides these metrics out of the box, but building custom versions lets you tune criteria for your domain.
5) Statistical significance
A 2% improvement might be noise. Use bootstrap confidence intervals:
import numpy as np
def bootstrap_ci(scores: list[float], n_bootstrap: int = 1000, ci: float = 0.95) -> tuple[float, float, float]:
scores_arr = np.array(scores)
means = []
for _ in range(n_bootstrap):
sample = np.random.choice(scores_arr, size=len(scores_arr), replace=True)
means.append(sample.mean())
means = sorted(means)
lower = means[int((1 - ci) / 2 * n_bootstrap)]
upper = means[int((1 + ci) / 2 * n_bootstrap)]
return float(scores_arr.mean()), lower, upper
# Usage: mean, lower, upper = bootstrap_ci(metric_scores)
# If confidence intervals of two models don't overlap, the difference is significant
Report confidence intervals alongside point estimates. “Accuracy: 84.2% (CI: 81.1-87.0%)” is far more informative than just “84.2%“.
6) Harness runner with cost tracking
import asyncio
import time
class EvalRunner:
def __init__(self, model: str, metrics: list[Metric]):
self.model = model
self.metrics = metrics
self.results: list[dict] = []
async def run(self, dataset: list[EvalCase]) -> dict:
total_tokens = 0
start_time = time.time()
for case in dataset:
output, tokens = await self._generate(case.input)
total_tokens += tokens
scores = {}
for metric in self.metrics:
scores[metric.name] = metric.score(case.input, output, case.expected_output)
self.results.append({
"id": case.id,
"category": case.category,
"scores": scores,
"output": output,
})
duration = time.time() - start_time
return {
"model": self.model,
"total_cases": len(dataset),
"duration_s": round(duration, 1),
"total_tokens": total_tokens,
"metrics": self._aggregate(),
}
def _aggregate(self) -> dict:
agg = {}
for metric in self.metrics:
scores = [r["scores"][metric.name] for r in self.results]
mean, ci_low, ci_high = bootstrap_ci(scores)
agg[metric.name] = {
"mean": round(mean, 4),
"ci_95": [round(ci_low, 4), round(ci_high, 4)],
}
return agg
7) CI integration pattern
Run evaluations as part of your deployment pipeline:
# .github/workflows/eval.yml
name: LLM Evaluation
on:
pull_request:
paths: ['prompts/**', 'src/llm/**']
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements-eval.txt
- run: python -m eval.run --dataset eval/dataset.json --threshold 0.80
- uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval/results/
The --threshold flag fails the pipeline if any key metric drops below the minimum. This prevents regression deployments.
8) Evaluation anti-patterns
- Testing on training data — if your evaluation cases leaked into prompt examples or fine-tuning data, scores are meaningless.
- Single metric — accuracy alone misses tone, safety, and format. Use multiple metrics.
- Stale datasets — as your product evolves, old test cases become irrelevant. Review and update quarterly.
- No human baseline — without knowing how well a human performs on your tasks, you cannot interpret whether 85% is good or bad.
- Overfitting to the eval — repeatedly tweaking prompts to pass the test set without understanding why. Treat evaluation as a diagnostic tool, not an optimization target.
The one thing to remember: A production evaluation harness combines curated datasets, composable metrics (including LLM-as-judge), statistical rigor, and CI integration — it is the quality infrastructure that makes confident model and prompt changes possible.
See Also
- Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
- Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
- Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
- Python Llm Function Calling Function calling lets an AI ask your Python code for help — like a chef who can read a recipe but needs someone else to actually open the fridge.
- Python Prompt Chaining Think of prompt chaining as a relay race where each runner hands a baton to the next — except the runners are AI prompts building on each other's work.