Snapshot Testing for AI Systems
Snapshot and golden file testing for AI: capturing expected outputs, managing updates, structural snapshots, semantic similarity assertions, and tools for snapshot testing.
Snapshot testing captures a known-good output and compares future outputs against it. When the output changes, the test fails, forcing a developer to review the change and either fix a regression or intentionally update the snapshot. For AI systems, traditional exact-match snapshots are too brittle because model outputs vary. This guide covers snapshot strategies adapted for non-deterministic AI outputs.
Traditional Snapshots for Deterministic Components
Parts of your AI pipeline are deterministic and suit exact-match snapshots perfectly.
Prompt template snapshots. Capture the fully rendered prompt and compare it on every test run. This catches unintentional prompt changes immediately.
def test_system_prompt_snapshot(snapshot):
prompt = build_system_prompt(
persona="helpful assistant",
guardrails=["no medical advice", "no code generation"],
version="v3"
)
snapshot.assert_match(prompt, "system_prompt_v3.txt")
When a developer changes the prompt template, the snapshot test fails. They review the diff, confirm the change is intentional, and update the snapshot with pytest --snapshot-update.
Configuration snapshots. Capture model parameters, pipeline configuration, and feature flags.
def test_pipeline_config_snapshot(snapshot):
config = pipeline.get_config()
snapshot.assert_match(json.dumps(config, indent=2), "pipeline_config.json")
Structural Snapshots
Instead of comparing exact content, compare the structure of AI outputs. This is useful when the exact text varies but the shape should be consistent.
def extract_structure(response: dict) -> dict:
"""Extract the structural skeleton of a response, ignoring exact values."""
def structure_of(value):
if isinstance(value, dict):
return {k: structure_of(v) for k, v in value.items()}
if isinstance(value, list):
if len(value) == 0:
return "list(empty)"
return [structure_of(value[0]), f"...({len(value)} items)"]
return type(value).__name__
return structure_of(response)
def test_response_structure_snapshot(snapshot):
response = pipeline.run("What is Python?")
structure = extract_structure(response.to_dict())
snapshot.assert_match(json.dumps(structure, indent=2), "response_structure.json")
A structural snapshot might look like:
{
"answer": "str",
"confidence": "float",
"sources": [
{"id": "str", "text": "str", "score": "float"},
"...(3 items)"
],
"metadata": {
"model": "str",
"latency_ms": "int"
}
}
This catches structural regressions (missing fields, wrong types, changed nesting) without breaking on text variation.
Semantic Similarity Assertions
Replace exact string comparison with semantic similarity. Two responses that say the same thing in different words should both pass.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def assert_semantically_similar(actual: str, expected: str, threshold: float = 0.80):
"""Assert that two texts are semantically similar above a threshold."""
embeddings = model.encode([actual, expected])
similarity = float(util.cos_sim(embeddings[0], embeddings[1]))
assert similarity >= threshold, (
f"Semantic similarity {similarity:.3f} below threshold {threshold}. "
f"Expected: '{expected[:100]}...', Got: '{actual[:100]}...'"
)
def test_response_semantically_matches_golden():
golden = "Python is a high-level programming language created by Guido van Rossum."
response = pipeline.run("What is Python?")
assert_semantically_similar(response.answer, golden, threshold=0.75)
Semantic similarity assertions tolerate paraphrasing but catch responses that are off-topic or factually wrong. Set the threshold based on empirical testing: too high (0.95) and you get false failures from paraphrasing; too low (0.50) and you miss real regressions.
Managing Snapshot Updates When Models Change
Model updates change outputs. When you upgrade from one model version to another, many snapshots will break. Manage this with a process.
Batch update workflow:
- Run the full snapshot suite against the new model
- Review all failures as a batch: are changes improvements, regressions, or neutral variations?
- Update snapshots for improvements and neutral changes
- Fix regressions before accepting the model update
Version your snapshots. Include the model version in snapshot filenames or directories.
tests/snapshots/
gpt-4o-2024-08-06/
system_prompt_v3.txt
response_structure.json
gpt-4o-2025-01-15/
system_prompt_v3.txt
response_structure.json
This creates a clear history of how outputs change across model versions.
LLM-as-Judge for Snapshot Comparison
For complex outputs where semantic similarity is insufficient, use an LLM to judge whether the new output is equivalent to the golden output.
def llm_judge_equivalent(golden: str, actual: str, criteria: str) -> bool:
"""Use an LLM to judge whether actual output is equivalent to golden."""
prompt = f"""Compare these two responses and determine if they are functionally equivalent.
Golden response: {golden}
Actual response: {actual}
Criteria: {criteria}
Answer with only "EQUIVALENT" or "NOT_EQUIVALENT" followed by a brief reason."""
result = judge_model.generate(prompt, temperature=0)
return result.strip().startswith("EQUIVALENT")
def test_response_judged_equivalent():
golden = load_fixture("golden/python_answer.txt")
actual = pipeline.run("What is Python?").answer
assert llm_judge_equivalent(golden, actual, "Factually accurate and complete")
This is expensive (it requires a model call per assertion) and should be used sparingly: for critical outputs in a scheduled eval suite, not in every PR test run.
Tools
pytest-snapshot (Python): Integrates snapshot testing with pytest. Simple API: snapshot.assert_match(value, snapshot_name).
syrupy (Python): A more full-featured snapshot library with custom serializers and inline snapshots.
Jest snapshots (JavaScript): Built into Jest with expect(value).toMatchSnapshot(). Supports inline snapshots and custom serializers.
Custom semantic diff: Build a thin wrapper that combines structural comparison for JSON shape and semantic similarity for text fields. This gives you the best of both approaches: strict validation of structure, flexible validation of content.
def semantic_snapshot_compare(actual: dict, golden: dict, text_fields: list[str]):
"""Compare structure exactly, text fields semantically."""
for key, value in golden.items():
assert key in actual, f"Missing key: {key}"
if key in text_fields:
assert_semantically_similar(actual[key], value, threshold=0.75)
elif isinstance(value, dict):
semantic_snapshot_compare(actual[key], value, text_fields)
else:
assert actual[key] == value, f"Mismatch on {key}: {actual[key]} != {value}"
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops