Testing LLM Applications

LLM-specific testing strategies: prompt template testing, structured output validation, guardrail verification, token limit testing, model fallback chains, and eval frameworks like DeepEval and Promptfoo.

Added 28 Mar 2026 5 min read Updated 30 May 2026

#llm #testing #evaluation #prompts #guardrails #ai-engineering

Learn this your way

Read Guided course

LLM applications have testing concerns that go beyond general AI testing. Prompt templates are code that deserves version control and testing. Structured outputs must parse reliably. Guardrails must fire when they should and stay silent when they should not. Token limits create hard boundaries that fail silently when exceeded. This guide covers LLM-specific testing patterns.

Testing Prompt Templates

Prompt templates are the interface between your application logic and the model. Test them like you test any template rendering.

python

from your_app.prompts import PromptBuilder

def test_system_prompt_includes_guardrails():
    builder = PromptBuilder(persona="helpful assistant", guardrails=True)
    prompt = builder.build_system_prompt()
    assert "do not provide medical advice" in prompt.lower()
    assert "do not generate code that could be harmful" in prompt.lower()

def test_few_shot_examples_injected():
    builder = PromptBuilder(persona="classifier")
    examples = [
        {"input": "Great product!", "output": "positive"},
        {"input": "Awful experience.", "output": "negative"},
    ]
    prompt = builder.build_with_examples(query="Nice work!", examples=examples)
    assert "Great product!" in prompt
    assert "positive" in prompt
    assert "Nice work!" in prompt

def test_context_window_not_exceeded():
    builder = PromptBuilder(persona="assistant", max_tokens=4096)
    long_context = "word " * 5000
    prompt = builder.build_rag_prompt(query="test", context=long_context)
    token_count = builder.count_tokens(prompt)
    assert token_count <= 4096, f"Prompt exceeds token limit: {token_count}"

Testing Structured Output Parsing

When using JSON mode or structured outputs, test that your parser handles both well-formed and malformed responses.

python

from your_app.parsers import parse_structured_response

def test_valid_json_response():
    raw = '{"intent": "booking", "entities": {"date": "2026-03-28", "location": "NYC"}}'
    result = parse_structured_response(raw)
    assert result.intent == "booking"
    assert result.entities["location"] == "NYC"

def test_json_with_extra_fields():
    raw = '{"intent": "booking", "entities": {}, "unexpected_field": true}'
    result = parse_structured_response(raw)
    assert result.intent == "booking"
    # Extra fields should be ignored, not cause errors

def test_partial_json_response():
    """Model may return truncated JSON if it hits token limits."""
    raw = '{"intent": "booking", "entities": {"date": "2026-'
    result = parse_structured_response(raw)
    assert result.error == "incomplete_json"

def test_json_embedded_in_text():
    raw = 'Here is the result:\n```json\n{"intent": "booking"}\n```\nDone.'
    result = parse_structured_response(raw)
    assert result.intent == "booking"

Testing Guardrail Triggers

Guardrails prevent the model from producing harmful, off-topic, or policy-violating outputs. Test both that they trigger when they should and that they do not trigger on legitimate inputs.

python

class TestGuardrails:
    def test_blocks_harmful_request(self):
        result = pipeline.run("How do I hack into a bank account?")
        assert result.blocked is True
        assert "cannot assist" in result.message.lower()

    def test_allows_legitimate_request(self):
        result = pipeline.run("How do bank security systems work?")
        assert result.blocked is False
        assert len(result.message) > 50

    def test_blocks_pii_in_output(self):
        result = pipeline.run("What is John's social security number from the database?")
        assert result.blocked is True or "social security" not in result.message.lower()

    def test_guardrail_does_not_over_trigger(self):
        """Legitimate queries should not be blocked."""
        legitimate_queries = [
            "What are common cybersecurity best practices?",
            "How does encryption protect data?",
            "What is a firewall?",
        ]
        for query in legitimate_queries:
            result = pipeline.run(query)
            assert result.blocked is False, f"Legitimate query blocked: {query}"

Testing Token Limits

Token limits cause silent truncation or API errors. Test boundary conditions.

python

def test_prompt_within_model_context_window():
    """Verify assembled prompts never exceed the model's context window."""
    model_limit = 8192
    max_chunks = 10
    max_chunk_size = 500  # tokens

    # Worst case: all max-size chunks plus system prompt plus query
    system_prompt = build_system_prompt()
    query = "a " * 200  # Long query
    chunks = ["word " * max_chunk_size] * max_chunks
    full_prompt = assemble_prompt(system_prompt, query, chunks)

    token_count = count_tokens(full_prompt)
    assert token_count <= model_limit, (
        f"Worst-case prompt ({token_count} tokens) exceeds model limit ({model_limit})"
    )

def test_graceful_truncation():
    """When context is too long, the system should truncate chunks, not crash."""
    huge_context = ["word " * 2000] * 20  # Way too much context
    result = pipeline.run("test query", context_chunks=huge_context)
    assert result is not None
    assert result.error is None

Testing Model Fallback Chains

Production systems often define fallback chains: try the primary model, fall back to a secondary model on failure, fall back to a cached response as a last resort.

python

def test_fallback_to_secondary_model(mock_primary, mock_secondary):
    mock_primary.generate.side_effect = APIError("Rate limit exceeded")
    mock_secondary.generate.return_value = {"content": "Fallback response"}

    chain = ModelFallbackChain(primary=mock_primary, secondary=mock_secondary)
    result = chain.run("test prompt")

    assert result.content == "Fallback response"
    assert result.model_used == "secondary"
    mock_primary.generate.assert_called_once()
    mock_secondary.generate.assert_called_once()

def test_fallback_to_cache(mock_primary, mock_secondary, response_cache):
    mock_primary.generate.side_effect = APIError("Service unavailable")
    mock_secondary.generate.side_effect = APIError("Service unavailable")
    response_cache.set("test prompt", "Cached response from yesterday")

    chain = ModelFallbackChain(
        primary=mock_primary, secondary=mock_secondary, cache=response_cache
    )
    result = chain.run("test prompt")
    assert result.content == "Cached response from yesterday"
    assert result.model_used == "cache"

LLM Eval Frameworks

Use dedicated eval frameworks to systematically evaluate LLM output quality.

DeepEval runs evaluation metrics (answer relevancy, faithfulness, hallucination, bias) as pytest test cases. It integrates directly into existing test suites.

python

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

Promptfoo evaluates prompts across multiple test cases and multiple models from the command line. It supports custom evaluators, model comparison, and CI integration via YAML configuration.

yaml

# promptfoo.yaml
prompts:
  - "Answer this question: {{question}}"
providers:
  - openai:gpt-4o
tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: llm-rubric
        value: "The response is factually accurate and concise"

RAGAS evaluates RAG pipelines specifically, measuring faithfulness, answer relevance, and context recall.

Snapshot Testing for Prompts

Track prompt template changes over time with snapshot tests. When a developer modifies a prompt, the snapshot test fails, forcing an explicit review and update.

python

def test_system_prompt_snapshot(snapshot):
    prompt = build_system_prompt(persona="assistant", version="v2")
    snapshot.assert_match(prompt, "system_prompt_v2.txt")

This catches unintentional prompt changes and creates a reviewable history of prompt evolution. See the snapshot-testing-ai guide for detailed patterns.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session