RAG Implementation Patterns - Retrieval Augmented Generation in Practice
Practical patterns for building production RAG systems: chunking strategies, retrieval optimization, re-ranking, and the most common failure modes.
Retrieval Augmented Generation (RAG) is the most commonly deployed AI pattern in enterprise settings. It solves a fundamental limitation of LLMs: they do not know about your private data, your recent documents, or your organization’s specific knowledge. RAG provides that knowledge at query time by retrieving relevant documents and passing them to the model along with the question.
Building a RAG system that works in demos is straightforward. Building one that works reliably in production requires attention to a set of patterns that are not obvious at the outset.
Pattern 1 - Semantic Chunking Over Fixed-Size Chunking
The most common RAG implementation splits documents into fixed-size token chunks (e.g., 512 tokens) with overlap. This works but is not optimal. A 512-token chunk that spans two sections of a document contains mixed context that confuses retrieval.
Semantic chunking splits at natural content boundaries: paragraph breaks, section headers, list item transitions. For most enterprise document types (policies, contracts, reports), semantic chunking improves retrieval precision by 15-25% compared to fixed-size chunking with similar overlap.
Implementation: use a lightweight LLM or rule-based detector to identify section boundaries, then split there. Chunks that are too short (under 100 tokens) can be merged with adjacent chunks.
Pattern 2 - Hybrid Retrieval
Pure vector search finds semantically similar content but can miss exact matches. A query for “clause 14.3(b)” may not retrieve the correct document section because the vector representation of that query is not semantically close to the section header.
Hybrid retrieval combines vector similarity (semantic search) with BM25 (keyword search) using a weighted merge. The relative weights of the two scores are tuned based on query type - for enterprise knowledge bases with many specific terms, BM25 weight is typically 30-40% of the combined score.
OpenSearch, Elasticsearch, and pgvector all support hybrid retrieval. Amazon Bedrock Knowledge Bases implements hybrid retrieval by default.
Pattern 3 - Re-Ranking After Retrieval
Retrieve more, then re-rank. Initial retrieval returns the top-20 candidates by hybrid score. A cross-encoder re-ranker model then scores each candidate against the specific query and returns the top-5. Cross-encoders are more expensive than embedding similarity (they process query and document together) but significantly more accurate.
Cohere Rerank and cross-encoder models from the sentence-transformers library are common choices. Adding re-ranking to a naive RAG system typically shows the largest single quality improvement of any optimization step.
Pattern 4 - Query Rewriting
User queries are often short, colloquial, and reference context that is not in the query text. “What did we agree on with the vendor about support?” is a poor retrieval query. Rewriting it to “vendor support agreement terms service level SLA response time” before retrieval significantly improves results.
A lightweight LLM call rewrites the user query into a retrieval-optimized form before the retrieval step. For conversational systems, also include the last 2-3 conversation turns in the rewrite context to resolve coreferences (“the same policy” becomes the explicit policy name).
Pattern 5 - Citation and Source Attribution
Production RAG systems must return source references with every response. This is both a trust mechanism (users can verify answers) and a debugging tool (engineers can trace incorrect answers to their source chunks).
Store chunk source metadata (document name, section, page number, date) alongside embeddings in the vector store. Include this metadata in the LLM prompt and instruct the model to cite sources inline. Post-process the response to extract citations and return them as structured data alongside the answer text.
Failure Mode Reference
Retrieval recall failure - The correct document is not retrieved because its embedding is not close enough to the query embedding. Diagnosis: run your test queries and check whether the correct source appears in the top-10 results. Fix: improve chunking, add more semantically rich text around key facts, or lower the similarity threshold.
Context window overflow - Retrieved chunks plus query plus system prompt exceeds the model’s context window. Fix: limit retrieved chunks to the top-5 re-ranked results and enforce a maximum chunk size.
Answer grounded in wrong source - The model answers correctly from the retrieved context, but the retrieved context is outdated or incorrect. Fix: add document freshness metadata and filter the retrieval results to documents updated within a relevance window.
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops