Context engineering is the practice of deciding which tokens a model sees on each call, in what order, and at what cost. It treats the context window as a budget, not a bucket. Done well, it cuts token spend and raises answer accuracy at the same time, because the model stops drowning in irrelevant text.

A long bank of dark filing cabinets with a single drawer open and glowing amber. No em-dashes.
You pull one drawer from a hundred: that selection is the work, the same way curating the optimal set of tokens is the work the model cannot do for you.

Anthropic defines context engineering as “the set of strategies for curating and maintaining the optimal set of tokens during LLM inference,” and frames it as the natural progression of prompt engineering for multi-turn agents (Anthropic, 2025 ). This guide shows you what to select, how to order it, when to compact it, and how caching changes the cost math.

The context lifecycle

Every agent turn moves tokens through the same stages. Name them, and you can optimize each one.

Step 1 Select Pick only the chunks, tools, and history this turn needs. Drop the rest.
Step 2 Order Put the most relevant tokens at the start and end, not buried in the middle.
Step 3 Compact Summarize old history near the limit and reinitialize the window.
Step 4 Cache Reuse the stable prefix across calls at a steep discount.
Step 5 Retrieve Load extra data at runtime through tools, only when the turn calls for it.

Context engineering vs prompt engineering

Prompt engineering is writing the instructions: the system prompt, the few-shot examples, the wording of the task. You craft those once and reuse them.

Context engineering is broader. It decides the whole set of tokens that reaches the model on each inference, across many turns. Anthropic calls it the natural progression of prompt engineering for multi-turn agents (Anthropic, 2025 ). A single chatbot reply is mostly a prompt-engineering problem. A 40-turn agent that calls tools, reads files, and tracks state is a context-engineering problem.

The shift matters because context is finite. Anthropic frames it as “a finite resource with diminishing marginal returns” and uses the term “context rot”: as token count grows, accuracy and recall degrade (Anthropic, 2025 ; Anthropic context windows docs ). More tokens is not more intelligence. Past a point, more tokens is less.

The lost-in-the-middle and context-rot problem

Two studies explain why stuffing the window backfires.

Liu et al. found a U-shaped accuracy curve. Models answer best when the relevant information sits at the start or the end of the context, and accuracy degrades when that information lands in the middle. The effect holds even for models built for long contexts (Liu et al., TACL 2024 ). Picture a reader who remembers the first page and the last page of a report, but skims the hundred pages between.

Chroma’s “Context Rot” study tested 18 models and found performance degrades non-uniformly as input length grows, even on simple retrieval tasks (Hong, Troynikov, Huber, Chroma, 2025 ). The model does not fail gracefully at a clean cutoff. It gets unreliable in patches as the window fills.

Together these mean two things. First, position is a lever you control: put the load-bearing tokens at the edges. Second, length is a cost even when you are under the limit: a 1M-token window does not mean you should fill it. 1M-token windows are now common across frontier model lines as of 2026 (Anthropic context windows docs ), which makes disciplined selection more important, not less.

Techniques to cut tokens

Anthropic names a set of techniques for keeping the window lean (Anthropic, 2025 ). Use them together.

Selection and curation

Load only what the turn needs. If the user asks about one invoice, do not load the whole ledger. Selection is the highest-leverage step, because every token you do not add is a token that cannot rot the rest.

Ordering

Place the most relevant content where the model reads it best: at the start or the end. Liu et al. give you the rule directly (Liu et al., TACL 2024 ). Keep retrieved chunks ranked by relevance, and put the question last so it sits at the high-recall tail.

Compaction

When history approaches the window limit, summarize the earlier turns and reinitialize the context with that summary plus recent turns (Anthropic, 2025 ). You trade exact transcript for a compact gist, and you reclaim room to keep going. A long support conversation becomes a three-line summary plus the last few exchanges.

Structured note-taking and memory

Persist notes outside the window. The agent writes findings to a file or a store, then reads them back later instead of carrying every detail in-context (Anthropic, 2025 ). This is how an agent remembers across sessions without paying to re-read its whole history each turn.

Just-in-time retrieval

Load data at runtime through tools rather than front-loading it (Anthropic, 2025 ). The model asks for a record when it needs one, and the tool result enters context only then. Tool-result clearing removes those results once they are no longer needed, keeping the window from filling with stale lookups.

Sub-agent architectures

Hand focused work to sub-agents with clean windows. Each sub-agent runs its own narrow task and returns a condensed summary to the main agent (Anthropic, 2025 ). The main agent never sees the sub-agent’s full scratch work, only the result.

Here is a context-assembly function that selects and orders only the relevant chunks before the model ever sees them.

python
def assemble_context(query, chunks, max_chunk_tokens=4000):
    """Select and order chunks so the model reads the best ones at the edges."""
    # 1. Score every candidate chunk for relevance to the query.
    scored = [(score(query, c), c) for c in chunks]
    scored.sort(key=lambda pair: pair[0], reverse=True)

    # 2. Select only chunks that fit the budget. Drop the long tail.
    selected, used = [], 0
    for relevance, chunk in scored:
        if used + chunk.tokens > max_chunk_tokens:
            break
        selected.append((relevance, chunk))
        used += chunk.tokens

    # 3. Order so the strongest chunk sits first and the second sits last,
    #    keeping weaker chunks in the middle where recall matters least.
    selected.sort(key=lambda pair: pair[0], reverse=True)
    ranked = [chunk for _, chunk in selected]
    edges = ranked[:1] + ranked[2:] + ranked[1:2]

    # 4. Put the question last so it lands at the high-recall tail.
    return "\n\n".join(c.text for c in edges) + f"\n\nQuestion: {query}"

Prompt caching and the real discount

Caching reuses a processed prompt prefix across calls, so you pay full price for the stable part once instead of every request. The discounts are steep and worth designing around.

OpenAI caches automatically for prompts at or above 1,024 tokens, and states up to 80% lower latency and up to 90% lower input cost on cache hits (OpenAI prompt caching ). Anthropic prices cache reads at 0.1x base input, about 90% off, with cache writes at 1.25x base input for a 5-minute time-to-live or 2x for a 1-hour time-to-live (Anthropic prompt caching docs ). Google Gemini turns implicit caching on by default for Gemini 2.5 and newer, with a 4,096-token minimum, and passes the savings on automatically (Google Gemini context caching ).

The design rule that follows: put stable content first, volatile content last. A cached prefix only pays off if its bytes stay identical across calls. Below, the long system prompt and the document carry the cache breakpoint, while the per-request question stays outside it.

python
# Anthropic: mark the stable prefix as cacheable; keep the question outside it.
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STABLE_DOCUMENT,        # reused every call
            "cache_control": {"type": "ephemeral"},  # cache write 1.25x, read 0.1x
        }
    ],
    messages=[{"role": "user", "content": user_question}],  # changes per call, not cached
)

# Verify the cache is working. If this stays zero, a silent change is breaking the prefix.
print(response.usage.cache_read_input_tokens)

Watch one trap: any byte change in the prefix invalidates the cache for everything after it. A timestamp or a per-request ID near the top means you pay the write premium every call and never get a read.

SelectionCompactionStructured notesJust-in-time retrievalPrompt caching
What it doesLoads only relevant tokensSummarizes old historyPersists state outside windowLoads data at runtime via toolsReuses a stable prefix
When to useEvery turnHistory nears the limitState spans sessionsData is large or optionalPrefix repeats across calls
Typical savingRemoves the long tailReclaims window roomAvoids re-reading historyAvoids front-loading dataUp to 90% off cached input

Long context vs RAG

A 1M-token window invites a tempting shortcut: skip retrieval and paste everything in. The evidence says it depends.

Li et al. found that long context generally beats retrieval-augmented generation (RAG) on question-answering tasks. But summarization-based retrieval is comparable, RAG keeps advantages on dialogue and general queries, and neither approach is universally best (Li et al., 2024 ). RAG is the pattern of fetching relevant chunks from a store and adding them to the prompt, instead of carrying the full corpus.

So treat it as a routing decision, not a dogma. For a single dense document and a factual question, long context often wins. For dialogue, broad queries, or a corpus too large to ever fit, retrieval earns its place. And the two are not exclusive: you can retrieve into a long window, then apply selection and ordering inside it. Context engineering is what makes either approach pay off, because both still face context rot once the window fills.

Further reading