Building Production RAG Pipelines: A Senior Engineer's Guide
Every RAG tutorial on the internet follows the same pattern: load a PDF, split it into chunks, embed the chunks with OpenAI, store them in a vector database, retrieve the top-k results, and send them to GPT-4 with a prompt. This works in a Jupyter notebook. It fails in production.
I have built RAG systems that serve regulated healthcare data under HIPAA, defense intelligence documents under ITAR, and internal knowledge bases at scale for enterprise customers. The notebook demo and the production system share approximately 20% of their architecture. The other 80% — the chunking strategy, the embedding pipeline, the retrieval optimization, the evaluation framework, the failure handling, and the operational monitoring — is what determines whether your RAG system gives accurate answers or confidently hallucinates with citations that look correct but are not.
This article covers the production 80%. I assume you already understand the basic RAG concept. We are going straight to the engineering decisions that matter.
The Production RAG Architecture
Here is the architecture running in production at enterprise scale. Every box represents a component that took at least a week to get right:
┌─────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Document Sources → Format Detection → Extraction │
│ (S3, SharePoint, (PDF, DOCX, (text, tables, │
│ Confluence, HTML, Markdown, images via OCR, │
│ databases) PPTX, XLSX) structured data) │
│ │
│ → Cleaning → Chunking → Embedding → Vector Store │
│ (dedup, (strategy (model (Pinecone, │
│ PII selection) selection) Weaviate, │
│ detection) pgvector, │
│ Qdrant) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ RETRIEVAL PIPELINE │
│ │
│ User Query → Query Analysis → Query Transformation │
│ (intent, (expansion, HyDE, │
│ entity decomposition) │
│ extraction) │
│ │
│ → Hybrid Search → Re-ranking → Context Assembly │
│ (vector + (cross- (dedup, ordering, │
│ keyword + encoder, metadata injection, │
│ metadata Cohere token budget) │
│ filter) Rerank) │
│ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ GENERATION PIPELINE │
│ │
│ Context + Query → Prompt Assembly → LLM Call │
│ (system prompt, (Claude, GPT-4, │
│ few-shot, Llama, Mistral) │
│ guardrails) │
│ │
│ → Citation Extraction → Hallucination Check │
│ (map claims to (verify against │
│ source chunks) retrieved context) │
│ │
│ → Response Formatting → User Response │
│ (markdown, tables, │
│ source links) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ EVALUATION & MONITORING │
│ │
│ Retrieval Metrics: Precision@k, Recall@k, MRR, NDCG │
│ Generation Metrics: Faithfulness, Relevance, Coherence │
│ Latency: P50, P95, P99 per pipeline stage │
│ Cost: Embedding tokens, LLM tokens, vector DB queries │
└─────────────────────────────────────────────────────────┘
Chunking: The Decision That Determines Everything
Chunking strategy is the single highest-leverage decision in a RAG pipeline. Get it wrong, and no amount of retrieval optimization or prompt engineering will save you. The chunk must contain enough context to be useful on its own, but not so much that it dilutes the specific information the query needs.
Fixed-Size Chunking (Baseline)
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_text(document_text)
Fixed-size chunking splits text into segments of a target token/character count with overlap between adjacent chunks. It is the default in every tutorial. It works adequately for homogeneous text — novels, articles, transcripts — where the information density is relatively uniform.
It fails for: - Technical documentation: A 512-token chunk might split a code block in half, or separate a function signature from its description. - Legal and compliance documents: Clause references span paragraphs. A chunk boundary in the middle of a regulatory citation produces two useless chunks. - Tables and structured data: Fixed-size splitting destroys table structure completely.
Semantic Chunking (Better)
Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=85,
)
chunks = semantic_splitter.split_text(document_text)
The algorithm computes embeddings for each sentence, calculates cosine similarity between consecutive sentences, and splits where the similarity drops below a threshold. This produces chunks that are topically coherent — each chunk covers one concept or topic.
The tradeoff: semantic chunking is 10-50x slower than fixed-size chunking because it requires an embedding call for every sentence during ingestion. For a corpus of 100,000 documents, this adds hours to the ingestion pipeline. In production, I use semantic chunking for high-value, slowly-changing documents (policy manuals, product documentation, compliance frameworks) and fixed-size chunking for high-volume, frequently-updated content (support tickets, chat logs, news articles).
Document-Structure-Aware Chunking (Best for Technical Content)
For technical documentation, the best approach respects the document's inherent structure:
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class StructuredChunk:
content: str
metadata: dict
section_hierarchy: tuple # ("Chapter 3", "Section 3.2", "Subsection 3.2.1")
chunk_type: str # "prose", "code_block", "table", "list"
def chunk_by_structure(markdown_text: str, max_chunk_tokens: int = 1024) -> list:
"""Split markdown by headers, preserving hierarchy and code blocks."""
chunks = []
current_hierarchy = []
current_content = []
current_type = "prose"
for line in markdown_text.split("\n"):
# Detect header level
header_match = re.match(r'^(#{1,6})\s+(.+)', line)
if header_match:
# Flush current content as a chunk
if current_content:
chunks.append(StructuredChunk(
content="\n".join(current_content),
metadata={"headers": list(current_hierarchy)},
section_hierarchy=tuple(current_hierarchy),
chunk_type=current_type,
))
current_content = []
level = len(header_match.group(1))
title = header_match.group(2)
# Update hierarchy
current_hierarchy = current_hierarchy[:level-1] + [title]
# Detect code blocks
if line.startswith("```"):
current_type = "code_block" if current_type == "prose" else "prose"
current_content.append(line)
# Flush final chunk
if current_content:
chunks.append(StructuredChunk(
content="\n".join(current_content),
metadata={"headers": list(current_hierarchy)},
section_hierarchy=tuple(current_hierarchy),
chunk_type=current_type,
))
return chunks
The key insight: the section hierarchy becomes metadata attached to every chunk. When a user asks "What are the authentication requirements in section 3.2?", the metadata filter narrows retrieval to chunks from that section before vector similarity runs. This is hybrid retrieval — metadata filtering plus semantic search — and it outperforms pure vector search by 15-25% on precision@5 in my production benchmarks.
Chunk Size Optimization
There is no universal optimal chunk size. It depends on your embedding model, your content type, and your query patterns. Here is a framework for finding it:
Embedding Model Recommended Chunk Size Why
────────────────────────────────────────────────────────────
text-embedding-3-small 256-512 tokens Small context window; larger chunks dilute embedding
text-embedding-3-large 512-1024 tokens Larger model captures more nuance
Cohere embed-v3 512-1024 tokens Optimized for longer passages
voyage-3 512-1024 tokens Strong on technical content
BGE-M3 (open source) 256-512 tokens Best at shorter, focused chunks
Run this experiment on your actual data: create chunk sets at 256, 512, 768, and 1024 tokens. Build a test set of 50 queries with known correct answers. Measure retrieval precision@5 and recall@10 at each chunk size. The optimal size is where precision peaks without recall dropping below an acceptable threshold (I target 85% recall@10).
Embedding Model Selection
The embedding model converts text into dense vectors that capture semantic meaning. Your choice here affects retrieval quality, latency, and cost.
Production embedding models (2026):
| Model | Dimensions | Max Tokens | Cost (per 1M tokens) | Strengths |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (or 256-3072 via dimensions param) | 8,191 | $0.13 | Best general-purpose; dimension reduction without reindexing |
| OpenAI text-embedding-3-small | 1536 | 8,191 | $0.02 | 5x cheaper; 90% of large model quality |
| Cohere embed-v3.0 | 1024 | 512 | $0.10 | Input type parameter (search_query vs search_document) |
| Voyage voyage-3 | 1024 | 32,000 | $0.06 | Best for code and technical content |
| BGE-M3 (self-hosted) | 1024 | 8,192 | Infrastructure cost | No API dependency; multilingual; runs on a single GPU |
For production systems with sensitive data (healthcare, defense, finance), I use self-hosted BGE-M3 on AWS SageMaker endpoints. The data never leaves your VPC. The latency is 15-30ms per batch of 32 chunks on an ml.g5.xlarge instance ($1.41/hour).
For non-sensitive workloads, OpenAI text-embedding-3-small offers the best cost-to-quality ratio. At $0.02 per million tokens, embedding a 100,000-document corpus (averaging 2,000 tokens per document) costs approximately $4.
Embedding Pipeline Architecture
Do not call the embedding API synchronously during ingestion. Use a pipeline:
import asyncio
from dataclasses import dataclass
@dataclass(frozen=True)
class EmbeddingBatch:
chunks: tuple
model: str = "text-embedding-3-small"
batch_size: int = 256
async def embed_corpus(chunks: list, client, model: str = "text-embedding-3-small"):
"""Embed chunks in batches with retry and rate limiting."""
results = []
batch_size = 256 # OpenAI max batch size is 2048, but 256 is safer for rate limits
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [chunk.content for chunk in batch]
response = await client.embeddings.create(
input=texts,
model=model,
)
for j, embedding_data in enumerate(response.data):
results.append({
"chunk": batch[j],
"embedding": embedding_data.embedding,
"token_count": response.usage.total_tokens,
})
return results
In production, this pipeline runs as an AWS Step Functions workflow:
S3 Upload Event → Lambda (format detection + extraction)
→ SQS Queue (chunking jobs)
→ Lambda (chunking + metadata extraction)
→ SQS Queue (embedding jobs)
→ Lambda or SageMaker Endpoint (embedding)
→ Lambda (vector store upsert)
→ DynamoDB (ingestion tracking + status)
Each stage is independently scalable. When you add 10,000 new documents, the SQS queues absorb the burst and Lambda scales out to process them in parallel. Total cost for embedding 10,000 documents: approximately $0.50-$2.00 depending on document length and model choice.
Retrieval Optimization: Where Most RAG Systems Fail
The retrieval stage is where the quality gap between a demo and a production system becomes enormous. Pure vector similarity search returns contextually related chunks, but not necessarily the chunks that answer the question.
Hybrid Search (Vector + Keyword)
Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. The best production systems combine both:
from qdrant_client import QdrantClient
from qdrant_client.models import (
FieldCondition,
Filter,
MatchValue,
SearchParams,
FusionQuery,
Prefetch,
Query,
)
client = QdrantClient(url="http://qdrant:6333")
# Hybrid search with Qdrant's built-in fusion
results = client.query_points(
collection_name="citadel_docs",
prefetch=[
# Dense vector search
Prefetch(
query=query_embedding,
using="dense",
limit=20,
),
# Sparse vector search (BM25-equivalent)
Prefetch(
query=sparse_query_vector,
using="sparse",
limit=20,
),
],
query=FusionQuery(fusion="rrf"), # Reciprocal Rank Fusion
limit=10,
)
Reciprocal Rank Fusion (RRF) combines the rankings from dense and sparse search without needing to normalize scores. In my benchmarks across three production datasets:
| Search Method | Precision@5 | Recall@10 | MRR |
|---|---|---|---|
| Dense vector only | 0.72 | 0.81 | 0.68 |
| Sparse (BM25) only | 0.65 | 0.74 | 0.61 |
| Hybrid (RRF fusion) | 0.83 | 0.89 | 0.79 |
Hybrid search improved precision@5 by 15% over dense-only search. The keyword component catches exact matches that semantic search misses — acronyms, error codes, version numbers, and proper nouns.
Re-ranking: The 10% That Changes Everything
After initial retrieval returns 20-50 candidates, a cross-encoder re-ranker scores each candidate against the query with much higher accuracy than the initial retrieval:
import cohere
co = cohere.Client(api_key="your-key")
# Initial retrieval returns 20 candidates
initial_results = hybrid_search(query, limit=20)
# Re-rank with Cohere
rerank_response = co.rerank(
model="rerank-v3.5",
query=query,
documents=[r.content for r in initial_results],
top_n=5,
return_documents=True,
)
# Use top 5 re-ranked results as context
final_context = [
{
"content": result.document.text,
"relevance_score": result.relevance_score,
"original_rank": initial_results[result.index].rank,
}
for result in rerank_response.results
]
Re-ranking adds 100-300ms of latency per query but improves answer accuracy by 10-20% in my production measurements. The cross-encoder model reads the query and each candidate together (unlike bi-encoders which embed them separately), allowing it to capture fine-grained relevance that embedding similarity misses.
Query Transformation
Users do not write queries optimized for retrieval. A production RAG system transforms the user query before searching:
HyDE (Hypothetical Document Embeddings): Ask the LLM to generate a hypothetical answer, then embed that answer and search for real documents similar to it. This works because the hypothetical answer uses the same vocabulary and structure as the real answer.
async def hyde_search(query: str, llm_client, embed_client, vector_store):
# Generate hypothetical answer
hypothetical = await llm_client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{
"role": "user",
"content": f"Write a short, factual paragraph that answers this question: {query}"
}],
max_tokens=200,
)
# Embed the hypothetical answer
hyde_embedding = await embed_client.embeddings.create(
input=hypothetical.choices[0].message.content,
model="text-embedding-3-small",
)
# Search with hypothetical embedding
results = vector_store.search(
vector=hyde_embedding.data[0].embedding,
limit=20,
)
return results
In my testing, HyDE improved retrieval recall@10 by 8-12% for complex, multi-part questions but decreased performance by 3-5% for simple factual queries. Use it selectively — detect query complexity first, then apply HyDE only for complex queries.
Evaluation: Measuring What Matters
You cannot improve what you cannot measure. Production RAG systems need automated evaluation pipelines:
The RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) provides four core metrics:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
eval_data = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": reference_answers,
})
results = evaluate(
eval_data,
metrics=[
faithfulness, # Is the answer supported by the context?
answer_relevancy, # Does the answer address the question?
context_precision, # Are the retrieved contexts relevant?
context_recall, # Do the contexts contain the needed information?
],
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
# 'context_precision': 0.83, 'context_recall': 0.79}
Target metrics for production systems:
| Metric | Minimum | Good | Excellent |
|---|---|---|---|
| Faithfulness | 0.85 | 0.90 | 0.95+ |
| Answer Relevancy | 0.80 | 0.88 | 0.93+ |
| Context Precision | 0.75 | 0.83 | 0.90+ |
| Context Recall | 0.80 | 0.87 | 0.92+ |
If faithfulness drops below 0.85, your system is hallucinating too often for production use. Investigate: are chunks too short (lacking context), is the re-ranker selecting irrelevant chunks, or is the LLM ignoring the context in favor of parametric knowledge?
Monitoring in Production
Every RAG query in production should log:
{
"query_id": "uuid",
"timestamp": "2026-04-02T14:32:00Z",
"user_query": "What are the IAM requirements for HIPAA?",
"transformed_query": "IAM policy requirements HIPAA compliance healthcare",
"retrieval_results": [
{"chunk_id": "doc-42-chunk-7", "score": 0.91, "rerank_score": 0.95},
{"chunk_id": "doc-15-chunk-3", "score": 0.87, "rerank_score": 0.82}
],
"llm_model": "claude-sonnet-4-20250514",
"llm_tokens_in": 3200,
"llm_tokens_out": 450,
"latency_ms": {
"query_transform": 180,
"retrieval": 45,
"reranking": 220,
"generation": 1200,
"total": 1645
},
"cost_usd": 0.0087,
"user_feedback": null
}
Build CloudWatch dashboards or Grafana dashboards tracking: - P50/P95/P99 total latency - Retrieval precision (via periodic automated evaluation against test sets) - Faithfulness score (sample 5% of queries for automated RAGAS evaluation) - Cost per query - Error rate by pipeline stage - User feedback (thumbs up/down) correlation with automated metrics
Common Production Failures and Fixes
Failure 1: "The system retrieves relevant documents but the answer is wrong." Root cause: The LLM is using parametric knowledge instead of the retrieved context. Fix: Strengthen the system prompt — explicitly instruct the model to answer only from the provided context. Add a "If the context does not contain the answer, say so" instruction. Reduce the model's temperature to 0.1.
Failure 2: "The system returns 'I don't know' when the answer exists in the corpus." Root cause: Chunking split the answer across two chunks, and neither chunk alone contains enough context. Fix: Increase chunk overlap. Try parent-child chunking — retrieve the child chunk but send the parent (larger) chunk to the LLM.
Failure 3: "Latency is 5+ seconds per query." Root cause: Usually the LLM generation step. Fix: Reduce context size (fewer chunks, shorter chunks). Use a faster model for simple queries (Claude Haiku for straightforward lookups, Claude Sonnet for complex reasoning). Stream the response so the user sees tokens immediately.
Failure 4: "The system works well for English but fails for other languages." Root cause: Embedding model and LLM may not support the target language well. Fix: Use multilingual embedding models (BGE-M3 supports 100+ languages, Cohere embed-v3 supports 100+ languages). Test retrieval quality per language independently.
Go Build It
The Cloud AI/ML course on Citadel Cloud Management covers RAG architectures from fundamentals through production deployment, including hands-on labs for building retrieval pipelines on AWS SageMaker with vector databases, embedding optimization, and evaluation frameworks. Free enrollment.
For production-ready RAG templates, embedding pipeline configurations, vector database setup guides, and evaluation harness code, browse the AI & ML Resources collection.
For engineers building agentic RAG systems — where the LLM decides which tools to call, which documents to retrieve, and how to decompose complex queries — the Claude Agent Systems course covers tool-use architectures, multi-step reasoning chains, and production agent orchestration patterns.
The difference between a RAG demo and a RAG product is engineering discipline: structured chunking, hybrid retrieval, re-ranking, automated evaluation, and operational monitoring. Apply these techniques to your corpus, measure the results, and iterate. The evaluation framework will tell you exactly where your system is failing and what to fix next.
Continue Learning
Start Your Cloud Career Today
Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.
Get Free Cloud Career Resources