Prompt engineering is the practice of constructing inputs to language models that reliably produce the outputs your application needs. In a prototype, a prompt is often a string written in an afternoon. In production, a prompt is a versioned artifact with a test suite, a deployment process, and a change history. This guide covers the techniques and operational practices that make the difference.

System Prompts

The system prompt establishes the model’s context, persona, constraints, and output requirements. In enterprise applications, it carries most of the engineering load. A well-structured system prompt has four sections:

Role and context - Tell the model what it is and what system it is operating within. Be specific about the domain, the organization, and the intended audience. Vague role definitions produce vague outputs.

Instructions - Explicit rules for what the model should and should not do. Use positive framing (do this) rather than negative framing (do not do that) where possible. Negative instructions are more often violated.

Output format - Specify exactly what structure the response should take. If the downstream system parses JSON, say so explicitly and provide the schema. If the response should be a markdown list with a specific number of items, state that.

Examples - One or two worked examples embedded in the system prompt are among the most effective techniques available. They demonstrate the standard of quality and the exact format expected.

For Amazon Bedrock deployments, store system prompts in AWS Bedrock Prompt Management rather than hardcoding them in application code. This decouples prompt changes from code deploys.

Few-Shot Examples

Few-shot prompting means providing examples of input/output pairs within the prompt itself. The model infers the pattern from the examples and applies it to new inputs.

Principles for effective few-shot examples:

  • Use examples that cover the range of input variation, not just the easy case
  • Include at least one example that shows correct handling of an edge case or ambiguous input
  • Keep examples consistent in format and length
  • For classification tasks, include examples of each category

The number of examples matters less than their quality. Two precise, representative examples outperform six mediocre ones.

Chain-of-Thought

Chain-of-thought (CoT) prompting instructs the model to reason step by step before producing a final answer. For complex tasks - multi-step calculations, legal analysis, diagnostic reasoning - CoT substantially improves accuracy.

The simplest implementation adds “Think through this step by step before answering” to the user message. More controlled implementations use a structured scratchpad format: the model emits its reasoning in a <thinking> block, then produces the final answer separately. The application can discard the thinking block before displaying output to users.

Use CoT for tasks where the quality of the reasoning matters or where errors in intermediate steps would corrupt the final answer. For straightforward classification or extraction tasks, CoT adds latency and tokens without improving quality.

Structured JSON Output

Production systems need deterministic output formats that downstream code can parse. The most reliable approach for JSON output:

  1. Specify the schema explicitly in the system prompt, including field names, types, and descriptions
  2. Provide a complete example of a valid response in the system prompt
  3. Instruct the model to output only the JSON object, with no preamble or explanation
  4. For Anthropic models, use the Prefill feature to start the assistant turn with { - this constrains the model to begin immediately with JSON

Even with these techniques, validate all model output before passing it to downstream systems. Use a JSON schema validator and catch parse errors. Design error handling to either retry with a clarifying message or fall back to a default value.

Guardrails in Prompts

Prompt-level guardrails define what the model should refuse or redirect. They are a first layer of defense, not a complete solution. For enterprise applications, combine prompt guardrails with AWS Bedrock Guardrails (a managed service that applies content filtering, PII detection, and topic restrictions at the API layer, outside the model itself).

Prompt-level guardrail patterns that work:

  • Explicitly list topics the model should decline to discuss in the system prompt
  • Instruct the model to respond with a specific refusal message when a request is out of scope
  • For sensitive operations (actions with real-world consequences), require the model to restate what it is about to do and ask for confirmation before proceeding

Prompt Versioning

Every production prompt should have a version number. When a prompt changes, the version changes. This allows:

  • Correlation of output quality changes with specific prompt changes in application logs
  • Rollback when a prompt update degrades quality
  • A/B testing between prompt versions

AWS Bedrock Prompt Management stores prompt versions with metadata. Reference prompts by version ARN in application code so deployments are deterministic.

A/B Testing Prompts

Prompt quality is empirical. The only reliable way to know whether a new prompt is better is to test it on representative inputs and measure the outputs against defined quality criteria.

A simple A/B testing approach:

  1. Define quality criteria in advance (accuracy on a labeled test set, user rating distribution, task completion rate)
  2. Route a percentage of production traffic to the new prompt variant using a feature flag
  3. Log inputs, outputs, and quality signals for both variants
  4. Compare distributions after sufficient sample size
  5. Promote the winning variant

Amazon Bedrock Prompt Management supports prompt variants natively. Pair it with a feature flag system (such as AWS AppConfig) for traffic routing control.

Operational Practices

  • Store prompts as version-controlled files alongside application code, even if they are also in a prompt management service
  • Write evaluation tests for every prompt: a set of inputs with expected outputs or quality criteria
  • Run evaluations in CI on every prompt change
  • Log the prompt version with every model call so outputs can be traced to the exact prompt that produced them
  • Set token budgets explicitly: both max_tokens for the response and, where supported, input token limits to prevent runaway costs from unexpectedly large inputs

Sources