Snapshot Testing for AI Systems

Snapshot and golden file testing for AI: capturing expected outputs, managing updates, structural snapshots, semantic similarity assertions, and tools for snapshot testing.

Added 28 Mar 2026 4 min read Updated 30 May 2026

#snapshot-testing #testing #regression #ai-engineering #semantic-similarity

Learn this your way

Read Guided course

Snapshot testing captures a known-good output and compares future outputs against it. When the output changes, the test fails, forcing a developer to review the change and either fix a regression or intentionally update the snapshot. For AI systems, traditional exact-match snapshots are too brittle because model outputs vary. This guide covers snapshot strategies adapted for non-deterministic AI outputs.

Traditional Snapshots for Deterministic Components

Parts of your AI pipeline are deterministic and suit exact-match snapshots perfectly.

Prompt template snapshots. Capture the fully rendered prompt and compare it on every test run. This catches unintentional prompt changes immediately.

python

def test_system_prompt_snapshot(snapshot):
    prompt = build_system_prompt(
        persona="helpful assistant",
        guardrails=["no medical advice", "no code generation"],
        version="v3"
    )
    snapshot.assert_match(prompt, "system_prompt_v3.txt")

When a developer changes the prompt template, the snapshot test fails. They review the diff, confirm the change is intentional, and update the snapshot with pytest --snapshot-update.

Configuration snapshots. Capture model parameters, pipeline configuration, and feature flags.

python

def test_pipeline_config_snapshot(snapshot):
    config = pipeline.get_config()
    snapshot.assert_match(json.dumps(config, indent=2), "pipeline_config.json")

Structural Snapshots

Instead of comparing exact content, compare the structure of AI outputs. This is useful when the exact text varies but the shape should be consistent.

python

def extract_structure(response: dict) -> dict:
    """Extract the structural skeleton of a response, ignoring exact values."""
    def structure_of(value):
        if isinstance(value, dict):
            return {k: structure_of(v) for k, v in value.items()}
        if isinstance(value, list):
            if len(value) == 0:
                return "list(empty)"
            return [structure_of(value[0]), f"...({len(value)} items)"]
        return type(value).__name__
    return structure_of(response)

def test_response_structure_snapshot(snapshot):
    response = pipeline.run("What is Python?")
    structure = extract_structure(response.to_dict())
    snapshot.assert_match(json.dumps(structure, indent=2), "response_structure.json")

A structural snapshot might look like:

json

{
  "answer": "str",
  "confidence": "float",
  "sources": [
    {"id": "str", "text": "str", "score": "float"},
    "...(3 items)"
  ],
  "metadata": {
    "model": "str",
    "latency_ms": "int"
  }
}

This catches structural regressions (missing fields, wrong types, changed nesting) without breaking on text variation.

Semantic Similarity Assertions

Replace exact string comparison with semantic similarity. Two responses that say the same thing in different words should both pass.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def assert_semantically_similar(actual: str, expected: str, threshold: float = 0.80):
    """Assert that two texts are semantically similar above a threshold."""
    embeddings = model.encode([actual, expected])
    similarity = float(util.cos_sim(embeddings[0], embeddings[1]))
    assert similarity >= threshold, (
        f"Semantic similarity {similarity:.3f} below threshold {threshold}. "
        f"Expected: '{expected[:100]}...', Got: '{actual[:100]}...'"
    )

def test_response_semantically_matches_golden():
    golden = "Python is a high-level programming language created by Guido van Rossum."
    response = pipeline.run("What is Python?")
    assert_semantically_similar(response.answer, golden, threshold=0.75)

Semantic similarity assertions tolerate paraphrasing but catch responses that are off-topic or factually wrong. Set the threshold based on empirical testing: too high (0.95) and you get false failures from paraphrasing; too low (0.50) and you miss real regressions.

Managing Snapshot Updates When Models Change

Model updates change outputs. When you upgrade from one model version to another, many snapshots will break. Manage this with a process.

Batch update workflow:

Run the full snapshot suite against the new model
Review all failures as a batch: are changes improvements, regressions, or neutral variations?
Update snapshots for improvements and neutral changes
Fix regressions before accepting the model update

Version your snapshots. Include the model version in snapshot filenames or directories.

tests/snapshots/
  gpt-4o-2024-08-06/
    system_prompt_v3.txt
    response_structure.json
  gpt-4o-2025-01-15/
    system_prompt_v3.txt
    response_structure.json

This creates a clear history of how outputs change across model versions.

LLM-as-Judge for Snapshot Comparison

For complex outputs where semantic similarity is insufficient, use an LLM to judge whether the new output is equivalent to the golden output.

python

def llm_judge_equivalent(golden: str, actual: str, criteria: str) -> bool:
    """Use an LLM to judge whether actual output is equivalent to golden."""
    prompt = f"""Compare these two responses and determine if they are functionally equivalent.

Golden response: {golden}
Actual response: {actual}
Criteria: {criteria}

Answer with only "EQUIVALENT" or "NOT_EQUIVALENT" followed by a brief reason."""

    result = judge_model.generate(prompt, temperature=0)
    return result.strip().startswith("EQUIVALENT")

def test_response_judged_equivalent():
    golden = load_fixture("golden/python_answer.txt")
    actual = pipeline.run("What is Python?").answer
    assert llm_judge_equivalent(golden, actual, "Factually accurate and complete")

This is expensive (it requires a model call per assertion) and should be used sparingly: for critical outputs in a scheduled eval suite, not in every PR test run.

Tools

pytest-snapshot (Python): Integrates snapshot testing with pytest. Simple API: snapshot.assert_match(value, snapshot_name).

syrupy (Python): A more full-featured snapshot library with custom serializers and inline snapshots.

Jest snapshots (JavaScript): Built into Jest with expect(value).toMatchSnapshot(). Supports inline snapshots and custom serializers.

Custom semantic diff: Build a thin wrapper that combines structural comparison for JSON shape and semantic similarity for text fields. This gives you the best of both approaches: strict validation of structure, flexible validation of content.

python

def semantic_snapshot_compare(actual: dict, golden: dict, text_fields: list[str]):
    """Compare structure exactly, text fields semantically."""
    for key, value in golden.items():
        assert key in actual, f"Missing key: {key}"
        if key in text_fields:
            assert_semantically_similar(actual[key], value, threshold=0.75)
        elif isinstance(value, dict):
            semantic_snapshot_compare(actual[key], value, text_fields)
        else:
            assert actual[key] == value, f"Mismatch on {key}: {actual[key]} != {value}"

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session