Building RAG Systems - A Step-by-Step Guide

Document ingestion, chunking strategies, embedding models, vector stores, retrieval tuning, and generation with context for production RAG implementations.

Added 24 Mar 2026 7 min read Updated 14 Jun 2026

#ai-agents #advanced #rag #retrieval #vector-search #embeddings #knowledge-base

Learn this your way

Read Guided course

Retrieval-Augmented Generation (RAG) is the standard architecture for giving AI models access to private knowledge without fine-tuning. Instead of baking knowledge into model weights, RAG retrieves relevant documents at query time and includes them in the model’s context. The concept is simple; building a production system that works reliably is not.

Indexing

Ingest documents Clean and chunk Embed chunks Store vectors Runs once at setup, then incrementally as the knowledge base updates. Steps 1-4 below

Query

Embed query Retrieve top-k Re-rank Generate answer Runs on every user request. Steps 5-6 below

Step 1 - Document Ingestion

Before documents can be retrieved, they need to be in a form the system can work with.

Source Raw documents PDFs, Word, HTML, Markdown, plain text; each format needs its own parser

→

Clean Strip noise Remove headers, footers, page numbers, boilerplate; preserve structure and headings

→

Chunk Split content Semantic boundaries or 512-token windows with 50-100 token overlap

→

Index Embed and store Titan Text Embeddings V2 to vectors; stored in OpenSearch, pgvector, S3 Vectors, or Bedrock Knowledge Bases

Document ingestion covers:

Format handling - PDFs, Word documents, HTML, Markdown, and plain text all need different extraction pipelines. Amazon Textract handles PDFs with complex layouts (tables, multi-column, scanned documents). For clean text formats, simpler parsers suffice.

Content cleaning - Headers, footers, page numbers, boilerplate text, and navigation elements should be stripped before processing. These fragments reduce retrieval quality by adding noise.

Document structure preservation - Section headings, table structure, and list formatting carry meaning. Preserving them in the processed text improves both retrieval accuracy and generation quality.

Step 2 - Chunking Strategy

Documents are split into chunks for embedding and retrieval. Chunking strategy is one of the most impactful variables in RAG performance.

Fixed-size chunking - Split every N tokens with M token overlap. Simple and consistent. Overlap prevents important context being split across chunk boundaries. 512-token chunks with 50-100 token overlap is a common starting point.

Semantic chunking - Split at meaningful boundaries (paragraphs, sections) rather than fixed token counts. Preserves semantic coherence better but produces variable-size chunks.

Hierarchical chunking - Maintain relationships between parent and child chunks. A parent chunk (full section) and child chunks (paragraphs within the section) enable retrieving both specific passages and broader context.

For most use cases, start with semantic chunking at the paragraph level and iterate based on observed retrieval failures.

Step 3 - Embedding Models

Chunks are converted to dense vectors using embedding models. Vector similarity is the basis for retrieval. Embedding model choice affects:

Semantic richness (how well the vector captures meaning)
Multilingual capability (critical for European deployments)
Vector dimensions (affects storage and search cost)
Latency (embedding happens both at indexing time and query time)

Amazon Titan Text Embeddings V2 is a practical default for AWS-native implementations. It lets you configure the output vector size (256, 512, or 1024 dimensions), so you can trade a small amount of accuracy for lower storage and search cost: AWS reports that 512-dimension vectors retain about 99 percent and 256-dimension vectors about 97 percent of the accuracy of the full 1024-dimension output. For multilingual requirements, test against your specific language combinations - some models handle mixed-language content better than others. Amazon Bedrock also offers Cohere embedding models if you need an alternative.

Step 4 - Vector Store

Embeddings are stored in a vector database that supports approximate nearest neighbor (ANN) search. Options on AWS:

Amazon OpenSearch Service - OpenSearch Serverless or a managed cluster, both with a vector engine. Good for hybrid search (keyword + semantic) and existing OpenSearch users.
pgvector on Amazon Aurora PostgreSQL - The PostgreSQL pgvector extension for teams already on Aurora, good for smaller scale. Aurora PostgreSQL is the version Amazon Bedrock Knowledge Bases can quick-create as a vector store. You can also run pgvector on standard Amazon RDS for PostgreSQL outside of Knowledge Bases.
Amazon S3 Vectors - A storage-first vector store that became generally available on December 2, 2025. It stores and queries vectors directly in Amazon S3, scales to billions of vectors per index, and AWS positions it for large but less latency-sensitive RAG workloads at up to 90 percent lower cost than warm storage. It integrates with Bedrock Knowledge Bases as a quick-create option.
Amazon Bedrock Knowledge Bases - A managed service that handles ingestion, chunking, embedding, and retrieval, reducing operational overhead. It can sit in front of OpenSearch, Aurora PostgreSQL, Amazon Neptune Analytics, S3 Vectors, or third-party stores such as Pinecone and MongoDB Atlas.

For production systems over 1 million chunks, or with high query throughput, dedicated low-latency vector stores (OpenSearch) outperform general-purpose databases with vector extensions. S3 Vectors trades some query latency for much lower storage cost, so it suits large archives that are queried less frequently.

Step 5 - Retrieval Tuning

Retrieval quality is what makes or breaks a RAG system. At query time, the pipeline runs in reverse:

Query User question Optionally expanded to multiple phrasings before retrieval to improve recall

→

Retrieve Hybrid search Semantic similarity + BM25 keyword; fetch top 20 candidates from vector store

→

Re-rank Cross-encoder Score all 20; pass only top 5 to generation; significantly improves precision

→

Generate Answer with context Context + question → LLM → cited answer at temperature 0.1-0.3

Tuning approaches:

Hybrid search - Combine semantic similarity with keyword (BM25) search. Semantic handles conceptual queries; keyword handles specific terms, codes, and names. The combination outperforms either alone for most enterprise knowledge bases.

Re-ranking - Retrieve a larger candidate set (top 20) and re-rank with a cross-encoder model before passing the top 5 to generation. More computationally expensive but improves relevance significantly. On AWS you can use the managed Amazon Bedrock Rerank API, which currently supports Amazon Rerank 1.0 and Cohere Rerank 3.5, and wire it directly into Bedrock Knowledge Bases through the Retrieve and RetrieveAndGenerate APIs.

Query expansion - Use the LLM to generate multiple phrasings of the query before retrieving. Catches relevant documents that use different terminology than the query.

Step 6 - Generation with Context

Assembling the retrieved chunks and the query into a prompt that produces a good answer:

Put retrieved context before the question
Instruct the model to answer only from the provided context and say so when it cannot
Include source citations in the model’s output format
Set temperature low (0.1-0.3) for factual Q&A - you want determinism, not creativity

Production RAG systems benefit from evaluation infrastructure: regularly testing retrieval recall and answer quality against a ground truth set, and alerting when performance degrades after knowledge base updates.

Sources

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … and Kiela, D. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS (2020). https://arxiv.org/abs/2005.11401 , The original RAG paper introducing the retrieve-then-generate architecture.
Gao, Y. et al. “Retrieval-Augmented Generation for Large Language Models: A Survey.” (2023). https://arxiv.org/abs/2312.10997 , Comprehensive survey of RAG variants including Naive RAG, Advanced RAG, and Modular RAG architectures, with a taxonomy of chunking, retrieval, and generation strategies.
Robertson, S. and Zaragoza, H. “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval 3, no. 4 (2009): 333–389., The BM25 algorithm referenced in the hybrid search section; the standard keyword retrieval baseline.
Nogueira, R. and Cho, K. “Passage Re-ranking with BERT.” (2019). https://arxiv.org/abs/1901.04085 , Foundational work on cross-encoder re-ranking, which underpins the two-stage retrieve-then-rerank pattern described above.
Amazon Web Services. “Amazon S3 Vectors is now generally available with 40 times the scale of preview.” (2025). https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/ , Primary source for the S3 Vectors general availability date (December 2, 2025), scale, and the up-to-90-percent cost reduction claim.
Amazon Web Services. “Amazon Bedrock now supports Rerank API to improve accuracy of RAG applications.” (2024). https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-rerank-api-accuracy-rag-applications , Primary source for the managed Rerank API and the supported Amazon Rerank 1.0 and Cohere Rerank 3.5 models.
Amazon Web Services. “Amazon Titan Text Embeddings V2 now available in Amazon Bedrock.” (2024). https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock , Primary source for the configurable 256, 512, and 1024 dimension output sizes and their accuracy trade-offs.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session