LLMs need access to knowledge beyond their training data. The two primary approaches are RAG (retrieve relevant chunks at query time) and long context (stuff the full knowledge base into the context window). As context windows have grown from 4K to millions of tokens, the tradeoffs between these approaches have shifted.

Overview

AspectRAGLong Context
Knowledge VolumeUnlimited (external store)Limited by context window
Retrieval QualityDepends on retrieval pipelineAll information available
LatencyRetrieval adds latencyHigher first-token latency
Cost Per QueryLower (smaller prompts)Higher (large context)
FreshnessReal-time (if index is current)Requires re-constructing context
AccuracyCan miss relevant chunksCan lose focus in large contexts
InfrastructureVector DB + embeddings + chunkingNone beyond the LLM

How RAG Works

RAG retrieves relevant document chunks based on the user’s query, then includes those chunks in the LLM’s context. The pipeline involves document chunking, embedding generation, vector storage, retrieval (often hybrid keyword + vector), and prompt construction. The model sees only the retrieved chunks, not the full knowledge base.

How Long Context Works

Long context approaches load the entire relevant knowledge base into the prompt. With context windows reaching 1M+ tokens, you can include hundreds of pages of documentation, codebases, or conversation history. The model has access to everything and can find relevant information through its attention mechanism.

Retrieval Quality

RAG’s retrieval step is both its strength and weakness. Good retrieval surfaces the most relevant information efficiently. Bad retrieval misses critical chunks, leading to incomplete or incorrect answers. Retrieval quality depends on chunking strategy, embedding model quality, query formulation, and hybrid search configuration. Building a high-quality retrieval pipeline requires significant engineering effort.

Long context avoids retrieval errors entirely - everything is in the context. However, models can exhibit “lost in the middle” behavior where information in the center of very long contexts gets less attention than information at the beginning or end. This effect varies by model and context length.

Cost and Latency

RAG is more cost-effective per query. A typical RAG prompt includes 1-5K tokens of retrieved context. Long context prompts can be 100K-1M+ tokens. At current per-token pricing, the cost difference is orders of magnitude for large knowledge bases.

Latency profiles differ. RAG adds retrieval latency (50-200ms for vector search) but has lower model latency due to shorter prompts. Long context eliminates retrieval latency but increases time-to-first-token proportional to context length. For very large contexts, the model processing time can be substantial.

Freshness and Updates

RAG supports real-time knowledge updates. When documents change, you re-index them, and subsequent queries retrieve the updated content. This makes RAG ideal for rapidly changing knowledge bases.

Long context requires reconstructing the prompt when knowledge changes. For static or slowly changing knowledge bases, this is manageable. For frequently updated content, long context is less practical.

When to Choose RAG

Choose RAG when your knowledge base exceeds the practical context window size, when cost per query matters, when knowledge updates frequently, when you need to cite specific sources in responses, or when your knowledge base spans diverse topics where only a small fraction is relevant to any given query.

When to Choose Long Context

Choose long context when your knowledge base fits within the context window, when retrieval quality is a concern, when the full context is needed for reasoning (code analysis, document comparison), or when the engineering cost of building and maintaining a RAG pipeline is not justified for your use case.

Practical Recommendation

RAG and long context are not mutually exclusive. A practical approach is to use RAG for large knowledge bases and long context for the retrieved chunks plus conversation history. This combination gets relevant information through retrieval and lets the model reason over sufficient context. For small knowledge bases (under 100 pages), try long context first - it is simpler and avoids retrieval errors. For larger knowledge bases, RAG is essential for both cost and quality.