AI Agent Memory Management

How to give AI agents short-term and long-term memory using summarization, memory files, and vector retrieval, with patterns from MemGPT, Mem0, and Anthropic.

Added 23 Jun 2026 8 min read Updated 23 Jun 2026

#agents #memory #context-engineering #rag #embeddings #llm

Learn this your way

Read Guided course

An AI agent forgets everything the moment its context window fills or a session ends. Memory management is the practice of storing what matters outside the window and bringing it back when the agent needs it. This guide shows you the architecture, the memory types, and the concrete techniques that production agents use today.

Rows of dark metal storage lockers with red-glowing seams in an industrial corridor. No em-dashes. — A labelled locker holds one item until you open it: an agent's long-term store keeps facts out of the window until retrieval pulls the right one back.

The memory architecture at a glance

Agent memory works in layers. The working context lives inside the model’s window. Everything else lives in stores you control, and a retrieval step decides what to load.

Working context

System prompt Current messages in-window, finite, fast

Short-term thread memory

Message history Running summary one session, recent turns

Long-term store

Memory files Vector database Knowledge graph across sessions, durable

Retrieval

Similarity search Top-k selection picks what to load into context

Why agents need memory: the token-cost problem

The naive fix for forgetting is to paste everything into the prompt. That breaks fast. Anthropic frames context as “a finite resource with diminishing marginal returns” and warns about “context rot,” where quality degrades as the window fills [7]. Long prompts cost more tokens, run slower, and bury the signal the model needs.

Anthropic lists three mitigations: compaction, structured note-taking (agentic memory), and sub-agents [7]. Each one keeps the working context small while preserving what the agent learned. Memory management is the discipline of choosing which technique fits which need.

Short-term versus long-term memory

LangChain draws a clear line between the two scopes [8]. Short-term memory is thread-scoped: it holds the message history inside one session. Long-term memory is shared across threads and recallable at any time [8]. A chat that remembers the last ten turns uses short-term memory. An agent that recalls your name three weeks later uses long-term memory.

	Short-term memory	Long-term memory
Scope	One thread or session [8]	Shared across all threads [8]
Lifespan	Ends with the session	Persists indefinitely
Storage	Message history, running summary	Memory files, vector database, graph
Example	Recent turns in this chat	A user preference recalled weeks later

Memory types: episodic, semantic, procedural

LangChain maps human memory onto agents using three categories [8]:

Semantic memory stores facts and knowledge [8]. Example: “The user works in healthcare.”
Episodic memory stores past experiences and actions [8]. Example: “Last session the agent booked a flight to Lisbon.”
Procedural memory stores instructions and behavior [8]. Example: “Always confirm before sending an email.”

You design storage around these types. Semantic facts suit a vector store or key-value file. Episodic traces suit an append-only log. Procedural rules often live in the system prompt or a dedicated instructions file.

Technique 1: summarization and compaction

When the window approaches its limit, you compress older turns into a summary. Anthropic ships this as compaction, a server-side feature that auto-summarizes older context when the conversation nears the window limit (beta, January 2026) [6]. The agent keeps going without you managing the cutoff by hand.

You can also write compaction yourself. The pattern: watch the token count, and when it crosses a threshold, replace old messages with a model-generated summary.

python

def maybe_compact(messages, token_count, limit=150_000, keep_recent=6):
    """Summarize older turns when context nears the window limit."""
    if token_count < limit:
        return messages

    older = messages[:-keep_recent]
    recent = messages[-keep_recent:]

    summary = llm.complete(
        "Summarize the conversation below. Keep decisions, facts, "
        "and open tasks. Drop small talk.\n\n" + render(older)
    )

    return [{"role": "system", "content": f"Summary so far: {summary}"}, *recent]

This keeps the recent turns verbatim and folds the rest into one compact note. You trade some detail for a much smaller, cheaper context.

Technique 2: structured note-taking and memory files

Compaction loses detail by design. Memory files do not. Anthropic’s memory tool is a client-side, file-based store: a /memories directory with view, create, edit, and delete commands [5]. It persists across sessions, and it survives compaction [5]. The agent writes notes to disk and reads them back later.

This is structured note-taking, one of Anthropic’s three context mitigations [7]. The agent decides what to record. A common pattern: write semantic facts and procedural rules to memory files, then load the relevant file at the start of each session.

python

# The agent records a durable fact during a session.
memory.create("/memories/user-profile.md",
              "- Works in healthcare\n- Prefers concise replies")

# A later session reads it back into context.
profile = memory.view("/memories/user-profile.md")
context = f"What you remember about this user:\n{profile}"

Technique 3: vector memory and similarity retrieval

For large or open-ended memory, you cannot load every note. Vector databases serve as long-term memory by embedding stored experiences and facts, then retrieving by similarity [4][7]. This is the same RAG and embeddings pattern you use for documents, applied to the agent’s own history.

An embedding turns text into a numeric vector. Similar meanings land near each other. At recall time, you embed the query, search for the nearest stored vectors, and load the top matches into context.

python

def remember(text, store):
    """Store a fact with its embedding for later similarity search."""
    vector = embed(text)
    store.add(text=text, vector=vector)

def recall(query, store, k=5):
    """Retrieve the top-k most relevant memories for this query."""
    query_vector = embed(query)
    hits = store.search(query_vector, top_k=k)
    return "\n".join(hit.text for hit in hits)

# Inject retrieved memory into the prompt.
relevant = recall("What does the user do for work?", store)
prompt = f"Relevant memory:\n{relevant}\n\nUser: {user_message}"

You store many facts cheaply and load only the few that match the current turn. This is how an agent recalls one detail from thousands without filling the window.

MemGPT: virtual context management and Letta

MemGPT introduces “virtual context management,” an operating-system-inspired technique [1]. It pages data between a fixed in-window main context and external out-of-window context [1]. This extends the effective context beyond the model’s window, the way an OS pages memory between RAM and disk [1]. The agent itself decides what to page in and out. The open-source successor to MemGPT is Letta [1][9].

The lesson for builders: treat the window as scarce RAM and your store as cheap disk. Move data across that boundary on demand rather than holding everything resident.

Generative Agents: the memory stream and reflection

The Generative Agents work from Stanford and Google, published at UIST ‘23, shows a richer model [2]. Each agent stored a complete natural-language “memory stream” of its experiences [2]. It synthesized higher-level “reflections” from those raw memories, and it retrieved memories dynamically to plan behavior [2].

Step 1 Observe Record each experience to the memory stream as natural language.

→

Step 2 Reflect Synthesize higher-level insights from clusters of raw memories.

→

Step 3 Retrieve Pull relevant memories dynamically to plan the next action.

Reflection matters because raw logs grow noisy. Summarizing experiences into stable insights gives the agent a usable long-term view instead of an unbounded transcript.

Mem0: production results

Mem0 packages these ideas for production. It dynamically extracts, consolidates, and retrieves salient information, and a graph variant captures relationships between memories [3]. Rather than storing every message, it keeps the facts worth keeping.

On the LOCOMO benchmark, Mem0 reports a 26% relative improvement over OpenAI’s memory system, about 91% lower p95 latency, and over 90% token cost savings versus passing full context [3]. Those numbers show the payoff: a curated memory layer beats stuffing the window on quality, speed, and cost at once [3].

Putting it together

Use short-term memory for the live session and long-term stores for everything that must outlast it [8]. Compact when you need to stay under the window and can lose detail [6]. Write memory files when the agent must keep a fact exactly [5]. Use vector retrieval when the store grows past what fits in context [4]. Match the technique to the memory type, and the agent remembers what matters without paying for what it does not.

The memory architecture at a glance

Why agents need memory: the token-cost problem

Short-term versus long-term memory

Memory types: episodic, semantic, procedural

Technique 1: summarization and compaction

Technique 2: structured note-taking and memory files

Technique 3: vector memory and similarity retrieval

MemGPT: virtual context management and Letta

Generative Agents: the memory stream and reflection

Mem0: production results

Putting it together

Further reading