LLM Evaluation Harness in Python — Deep Dive

Architect production LLM evaluation systems in Python with custom metric pipelines, LLM-as-judge calibration, statistical significance testing, RAG-specific metrics, and CI integration patterns.

Evaluation is what separates teams that ship reliable LLM applications from teams that ship and hope. A production evaluation harness needs custom metrics, statistical rigor, cost management, and CI integration. This guide covers how to build one in Python.

1) Evaluation dataset design

Your dataset is the foundation. Design it with coverage in mind:

from dataclasses import dataclass

@dataclass
class EvalCase:
    id: str
    input: str
    expected_output: str | None  # None for reference-free evaluation
    category: str  # "factual", "reasoning", "creative", "adversarial"
    difficulty: str  # "easy", "medium", "hard"
    metadata: dict  # domain-specific labels

eval_dataset = [
    EvalCase(
        id="q001",
        input="What is the capital of France?",
        expected_output="Paris",
        category="factual",
        difficulty="easy",
        metadata={"domain": "geography"},
    ),
    EvalCase(
        id="q002",
        input="Explain why the sky appears blue in two sentences.",
        expected_output=None,  # reference-free; judge by criteria
        category="reasoning",
        difficulty="medium",
        metadata={"domain": "physics"},
    ),
]

Build the dataset incrementally: start with common cases, then add failures you discover in production. Tag categories and difficulty levels so you can analyze performance by segment.

2) Custom metric framework

Build metrics as composable functions:

from abc import ABC, abstractmethod

class Metric(ABC):
    name: str

    @abstractmethod
    def score(self, input: str, output: str, expected: str | None) -> float:
        """Return a score between 0 and 1."""
        ...

class ExactMatch(Metric):
    name = "exact_match"

    def score(self, input: str, output: str, expected: str | None) -> float:
        if expected is None:
            return 0.0
        return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0

class ContainsKeyFacts(Metric):
    name = "contains_key_facts"

    def __init__(self, key_facts: list[str]):
        self.key_facts = key_facts

    def score(self, input: str, output: str, expected: str | None) -> float:
        output_lower = output.lower()
        hits = sum(1 for fact in self.key_facts if fact.lower() in output_lower)
        return hits / len(self.key_facts) if self.key_facts else 0.0

class JsonValid(Metric):
    name = "json_valid"

    def score(self, input: str, output: str, expected: str | None) -> float:
        import json
        try:
            json.loads(output)
            return 1.0
        except json.JSONDecodeError:
            return 0.0

Compose multiple metrics per evaluation case. A customer-support bot might need factual_accuracy + tone_appropriateness + json_valid.

3) LLM-as-judge implementation

For nuanced quality assessment, use a strong model as an evaluator:

from openai import OpenAI

client = OpenAI()

class LLMJudge(Metric):
    name = "llm_judge"

    def __init__(self, criteria: str, model: str = "gpt-4o"):
        self.criteria = criteria
        self.model = model

    def score(self, input: str, output: str, expected: str | None) -> float:
        prompt = f"""Evaluate this AI response on a scale of 1-5.

Criteria: {self.criteria}

User input: {input}
AI response: {output}
{f'Expected answer: {expected}' if expected else ''}

Return ONLY a JSON object: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""

        resp = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"},
        )
        result = json.loads(resp.choices[0].message.content)
        return result["score"] / 5.0  # normalize to 0-1

Calibrating the judge

LLM judges have biases: they favor verbose responses, their own writing style, and first-presented options. Mitigate by:

Using low temperature (0) for consistency.
Providing specific rubrics instead of vague criteria.
Running the same evaluation twice with swapped order (for comparison tasks) and averaging.
Periodically validating judge scores against human ratings on a subset.

4) RAG-specific evaluation

RAG systems need metrics for both retrieval and generation:

class ContextRelevance(Metric):
    """Measures whether retrieved contexts are relevant to the query."""
    name = "context_relevance"

    def score(self, input: str, output: str, expected: str | None,
              contexts: list[str] | None = None) -> float:
        if not contexts:
            return 0.0
        prompt = f"""For each context, rate relevance to the query (1=irrelevant, 5=highly relevant).

Query: {input}

Contexts:
{chr(10).join(f'{i+1}. {c[:200]}' for i, c in enumerate(contexts))}

Return JSON: {{"scores": [<score1>, <score2>, ...]}}"""

        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"},
        )
        scores = json.loads(resp.choices[0].message.content)["scores"]
        return sum(scores) / (len(scores) * 5)

class Faithfulness(Metric):
    """Measures whether the answer is grounded in the provided contexts."""
    name = "faithfulness"

    def score(self, input: str, output: str, expected: str | None,
              contexts: list[str] | None = None) -> float:
        if not contexts:
            return 0.0
        context_text = "\n".join(contexts)
        prompt = f"""Evaluate whether every claim in the answer is supported by the contexts.

Contexts: {context_text[:2000]}
Answer: {output}

Return JSON: {{"faithfulness_score": <0.0-1.0>, "unsupported_claims": [<list>]}}"""

        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"},
        )
        return json.loads(resp.choices[0].message.content)["faithfulness_score"]

The RAGAS framework provides these metrics out of the box, but building custom versions lets you tune criteria for your domain.

5) Statistical significance

A 2% improvement might be noise. Use bootstrap confidence intervals:

import numpy as np

def bootstrap_ci(scores: list[float], n_bootstrap: int = 1000, ci: float = 0.95) -> tuple[float, float, float]:
    scores_arr = np.array(scores)
    means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(scores_arr, size=len(scores_arr), replace=True)
        means.append(sample.mean())
    means = sorted(means)
    lower = means[int((1 - ci) / 2 * n_bootstrap)]
    upper = means[int((1 + ci) / 2 * n_bootstrap)]
    return float(scores_arr.mean()), lower, upper

# Usage: mean, lower, upper = bootstrap_ci(metric_scores)
# If confidence intervals of two models don't overlap, the difference is significant

Report confidence intervals alongside point estimates. “Accuracy: 84.2% (CI: 81.1-87.0%)” is far more informative than just “84.2%“.

6) Harness runner with cost tracking

import asyncio
import time

class EvalRunner:
    def __init__(self, model: str, metrics: list[Metric]):
        self.model = model
        self.metrics = metrics
        self.results: list[dict] = []

    async def run(self, dataset: list[EvalCase]) -> dict:
        total_tokens = 0
        start_time = time.time()

        for case in dataset:
            output, tokens = await self._generate(case.input)
            total_tokens += tokens

            scores = {}
            for metric in self.metrics:
                scores[metric.name] = metric.score(case.input, output, case.expected_output)

            self.results.append({
                "id": case.id,
                "category": case.category,
                "scores": scores,
                "output": output,
            })

        duration = time.time() - start_time
        return {
            "model": self.model,
            "total_cases": len(dataset),
            "duration_s": round(duration, 1),
            "total_tokens": total_tokens,
            "metrics": self._aggregate(),
        }

    def _aggregate(self) -> dict:
        agg = {}
        for metric in self.metrics:
            scores = [r["scores"][metric.name] for r in self.results]
            mean, ci_low, ci_high = bootstrap_ci(scores)
            agg[metric.name] = {
                "mean": round(mean, 4),
                "ci_95": [round(ci_low, 4), round(ci_high, 4)],
            }
        return agg

7) CI integration pattern

Run evaluations as part of your deployment pipeline:

# .github/workflows/eval.yml
name: LLM Evaluation
on:
  pull_request:
    paths: ['prompts/**', 'src/llm/**']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements-eval.txt
      - run: python -m eval.run --dataset eval/dataset.json --threshold 0.80
      - uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval/results/

The --threshold flag fails the pipeline if any key metric drops below the minimum. This prevents regression deployments.

8) Evaluation anti-patterns

Testing on training data — if your evaluation cases leaked into prompt examples or fine-tuning data, scores are meaningless.
Single metric — accuracy alone misses tone, safety, and format. Use multiple metrics.
Stale datasets — as your product evolves, old test cases become irrelevant. Review and update quarterly.
No human baseline — without knowing how well a human performs on your tasks, you cannot interpret whether 85% is good or bad.
Overfitting to the eval — repeatedly tweaking prompts to pass the test set without understanding why. Treat evaluation as a diagnostic tool, not an optimization target.

The one thing to remember: A production evaluation harness combines curated datasets, composable metrics (including LLM-as-judge), statistical rigor, and CI integration — it is the quality infrastructure that makes confident model and prompt changes possible.

pythonllm-evaluationtestingml-opsproduction