Building Production RAG Pipelines: Engineer's Guide

Senior engineer's guide to production RAG pipelines. Covers chunking strategies, embedding models, vector databases, retrieval optimization, and evaluation frameworks.

Building Production RAG Pipelines: A Senior Engineer's Guide

Retrieval Augmented Generation has moved from research papers to production workloads serving millions of queries daily. Every major enterprise is building or planning a RAG system to make their proprietary data accessible through natural language interfaces. Yet the gap between a demo RAG pipeline (50 lines of LangChain) and a production system that handles real traffic, maintains accuracy, and operates within cost constraints is enormous.

This guide covers the architecture decisions, component selections, and operational patterns required to build RAG systems that survive contact with production traffic. No toy examples. No "just use LangChain" hand-waving. Every recommendation is grounded in systems that process real queries against real document corpora.

RAG Architecture: Beyond the Basic Pattern

The standard RAG pattern is deceptively simple: chunk documents, generate embeddings, store in a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context. Production systems add several critical layers.

Production RAG Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Query Pipeline                           │
│                                                                 │
│  User Query → Query Understanding → Query Rewriting             │
│       → Hybrid Retrieval (Dense + Sparse) → Reranking           │
│       → Context Assembly → LLM Generation → Response Validation │
│       → Citation Extraction → Response Delivery                 │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                     Ingestion Pipeline                           │
│                                                                 │
│  Source Documents → Format Extraction → Cleaning                │
│       → Chunking → Metadata Enrichment → Embedding Generation   │
│       → Vector Store Upsert → Index Optimization                │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                     Evaluation Pipeline                          │
│                                                                 │
│  Test Queries → Retrieval Metrics → Generation Quality          │
│       → Hallucination Detection → Latency/Cost Tracking         │
│       → Regression Alerts                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Each of these three pipelines has distinct requirements, failure modes, and optimization surfaces. Treating RAG as a single pipeline is the first mistake most teams make.

Document Ingestion: Where Quality Starts

Format Extraction

Production document corpora are messy. You will encounter PDFs with embedded tables, scanned documents, PowerPoint presentations, HTML with complex layouts, and Markdown with inconsistent formatting.

PDF extraction tools ranked by quality:

Tool Table Handling OCR Structured Output Speed
Unstructured.io Excellent Yes (Tesseract/PaddleOCR) JSON elements Medium
PyMuPDF (fitz) Good No (text-based only) Text blocks with coordinates Fast
Amazon Textract Excellent Native JSON with confidence scores Medium
Azure Document Intelligence Excellent Native JSON with bounding boxes Medium
LlamaParse Very Good Yes Markdown Slow

For enterprise deployments, Unstructured.io's partition_pdf() with the hi_res strategy provides the best balance of extraction quality and self-hosted control:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="technical-manual.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    languages=["eng"],
    extract_images_in_pdf=True,
    extract_image_block_types=["Image", "Table"],
)

# Elements are typed: NarrativeText, Title, Table, ListItem, etc.
for element in elements:
    print(f"Type: {type(element).__name__}, Text: {element.text[:100]}")

Chunking Strategies

Chunking is the single most impactful decision in your RAG pipeline. Bad chunking destroys retrieval quality regardless of how good your embedding model or vector database is.

Chunking methods compared:

Fixed-size chunking (naive approach):

# DO NOT use this in production
chunks = [text[i:i+512] for i in range(0, len(text), 512)]

Splits mid-sentence, mid-paragraph, mid-thought. Results in chunks that are semantically incoherent and retrieve poorly.

Recursive character splitting (LangChain default):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
chunks = splitter.split_text(document_text)

Better than fixed-size. Respects paragraph boundaries. But still unaware of document semantic structure.

Semantic chunking (production-grade):

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(sentences: list[str], threshold: float = 0.75) -> list[str]:
    embeddings = model.encode(sentences)
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i], embeddings[i-1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
        )
        if similarity >= threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

Groups semantically related sentences together. Produces more coherent chunks but requires an additional embedding pass during ingestion.

Document-structure-aware chunking (recommended):

The best approach uses document structure (headings, sections, subsections) as primary boundaries:

def structure_aware_chunk(elements: list, max_chunk_tokens: int = 500) -> list[dict]:
    chunks = []
    current_chunk = {"text": "", "metadata": {}, "tokens": 0}
    current_section = ""

    for element in elements:
        element_type = type(element).__name__
        element_text = element.text.strip()

        if not element_text:
            continue

        # Section headers create chunk boundaries
        if element_type in ("Title", "Header"):
            if current_chunk["text"]:
                chunks.append(current_chunk.copy())
            current_section = element_text
            current_chunk = {
                "text": element_text + "\n\n",
                "metadata": {"section": current_section},
                "tokens": count_tokens(element_text),
            }
            continue

        element_tokens = count_tokens(element_text)

        # Split if adding this element exceeds limit
        if current_chunk["tokens"] + element_tokens > max_chunk_tokens:
            if current_chunk["text"]:
                chunks.append(current_chunk.copy())
            current_chunk = {
                "text": element_text + "\n\n",
                "metadata": {"section": current_section},
                "tokens": element_tokens,
            }
        else:
            current_chunk["text"] += element_text + "\n\n"
            current_chunk["tokens"] += element_tokens

    if current_chunk["text"]:
        chunks.append(current_chunk)

    return chunks

Chunk Size: The Evidence

Optimal chunk size depends on your embedding model's context window and the nature of your documents. Based on benchmark results across multiple production systems:

Document Type Optimal Chunk Size Overlap Rationale
Technical documentation 400-600 tokens 50-100 tokens Procedures need complete context
Legal/compliance 300-500 tokens 100-150 tokens Clauses must not be split
Knowledge base articles 500-800 tokens 100 tokens Self-contained answer units
Code documentation 200-400 tokens 50 tokens Functions/methods as natural units
Research papers 500-700 tokens 100 tokens Paragraph-level semantic units

Embedding Models: Selection and Deployment

Model Comparison (2026 Benchmarks)

Model Dimensions MTEB Score Max Tokens Inference Cost Self-Hostable
OpenAI text-embedding-3-large 3072 64.6 8191 $0.13/1M tokens No
OpenAI text-embedding-3-small 1536 62.3 8191 $0.02/1M tokens No
Cohere embed-v3 1024 64.5 512 $0.10/1M tokens No
Voyage AI voyage-3 1024 67.1 32000 $0.06/1M tokens No
BGE-M3 (BAAI) 1024 63.5 8192 Self-hosted Yes
E5-Mistral-7B 4096 66.6 32768 Self-hosted Yes
Nomic Embed v1.5 768 62.3 8192 Self-hosted/API Yes
GTE-Qwen2-7B 3584 65.5 32768 Self-hosted Yes

For most production systems, Voyage AI voyage-3 or OpenAI text-embedding-3-large provides the best quality-to-cost ratio when using an API. For self-hosted deployments (data sovereignty requirements or cost optimization at scale), BGE-M3 running on NVIDIA T4 GPUs delivers strong performance.

Self-Hosted Embedding Deployment

# Deploying BGE-M3 on AWS with SageMaker
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

hub_config = {
    "HF_MODEL_ID": "BAAI/bge-m3",
    "HF_TASK": "feature-extraction",
}

huggingface_model = HuggingFaceModel(
    transformers_version="4.37.0",
    pytorch_version="2.1.0",
    py_version="py310",
    env=hub_config,
    role=sagemaker.get_execution_role(),
)

predictor = huggingface_model.deploy(
    initial_instance_count=2,
    instance_type="ml.g5.xlarge",  # NVIDIA A10G, good price/performance
    endpoint_name="bge-m3-embeddings",
)

At scale (>10 million documents), self-hosted embedding saves 60-80% versus API costs. The breakeven point is typically around 1 million embedding operations per month.

Embedding Dimensionality Reduction

OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL), allowing you to truncate embedding dimensions without retraining:

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Cloud security best practices for multi-tenant environments",
    dimensions=1024,  # Reduced from 3072 default
)

# 1024 dimensions retains ~95% of retrieval quality at 1/3 storage cost
embedding = response.data[0].embedding

This is significant for production cost management. Reducing from 3072 to 1024 dimensions cuts vector storage costs by 67% with minimal retrieval quality loss.

Vector Database Selection

Comparison for Production Workloads

Database Hosted Option Self-Hosted Max Vectors Filtering Hybrid Search Production Maturity
Pinecone Yes (primary) No Billions Metadata Yes High
Weaviate Yes Yes (Docker/K8s) Billions GraphQL Yes High
Qdrant Yes Yes (Docker/K8s) Billions Payload Yes High
Milvus/Zilliz Yes (Zilliz) Yes (K8s) Billions Expressions Yes High
pgvector Via cloud PG Yes (PostgreSQL ext) Millions SQL With tsvector Medium
ChromaDB No Yes (embedded) Millions Metadata No Low (dev/prototype)

For production systems with <10M vectors and existing PostgreSQL: pgvector 0.7+ with HNSW indexes provides good retrieval quality with zero additional infrastructure.

-- pgvector setup for RAG
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1024) NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    source_document TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Metadata index for filtered retrieval
CREATE INDEX idx_metadata ON document_embeddings USING GIN (metadata);

For production systems with >10M vectors or strict latency requirements: Qdrant or Weaviate deployed on Kubernetes provide better performance, more sophisticated filtering, and built-in hybrid search.

Retrieval Optimization

Hybrid Search: Dense + Sparse

Pure vector similarity search misses exact keyword matches. Pure keyword search misses semantic similarity. Production systems combine both.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    NamedSparseVector, NamedVector, Prefetch, Query, FusionQuery, Fusion
)

client = QdrantClient(url="http://qdrant:6333")

# Hybrid search with Reciprocal Rank Fusion (RRF)
results = client.query_points(
    collection_name="documents",
    prefetch=[
        # Dense vector search
        Prefetch(
            query=dense_embedding,
            using="dense",
            limit=20,
        ),
        # Sparse vector search (BM25)
        Prefetch(
            query=NamedSparseVector(
                name="bm25",
                vector=sparse_vector,
            ),
            using="bm25",
            limit=20,
        ),
    ],
    # Fuse results using RRF
    query=FusionQuery(fusion=Fusion.RRF),
    limit=10,
)

Hybrid search consistently outperforms either method alone by 5-15% on retrieval benchmarks across diverse document types.

Query Rewriting

User queries are often vague, misspelled, or use different terminology than the source documents. Query rewriting improves retrieval without touching the index.

QUERY_REWRITE_PROMPT = """Given the user's question, generate 3 alternative
phrasings that might better match relevant documents. Focus on:
1. Technical terminology variations
2. More specific formulations
3. Broader context framing

User question: {query}

Return as a JSON array of strings."""

async def expand_query(query: str, llm_client) -> list[str]:
    response = await llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": QUERY_REWRITE_PROMPT.format(query=query)}],
        response_format={"type": "json_object"},
        temperature=0.3,
    )
    expanded = json.loads(response.choices[0].message.content)
    return [query] + expanded["queries"]  # Include original query

Run retrieval for each query variant and deduplicate results. This technique (sometimes called Multi-Query RAG) improves recall by 10-25% at the cost of additional retrieval latency and embedding API calls.

Reranking

After initial retrieval, a cross-encoder reranker scores each (query, document) pair for relevance. Cross-encoders are more accurate than bi-encoders (embedding models) but too slow for full-corpus search, making them ideal as a second-stage filter.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_results(query: str, documents: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [(query, doc["content"]) for doc in documents]
    scores = reranker.predict(pairs)

    for doc, score in zip(documents, scores):
        doc["rerank_score"] = float(score)

    ranked = sorted(documents, key=lambda x: x["rerank_score"], reverse=True)
    return ranked[:top_k]

Cohere's Rerank API (rerank-v3.5) and Voyage AI's reranker are strong commercial alternatives that eliminate self-hosting overhead. Reranking typically improves answer quality by 15-30% compared to raw vector similarity ranking.

Context Assembly and Prompt Engineering

Context Window Management

With retrieved chunks in hand, assembling the context for the LLM requires careful token budget management:

def assemble_context(
    query: str,
    retrieved_chunks: list[dict],
    max_context_tokens: int = 6000,
    system_prompt_tokens: int = 500,
    max_response_tokens: int = 1500,
) -> str:
    # Reserve tokens for system prompt + query + response
    available_tokens = max_context_tokens - system_prompt_tokens - count_tokens(query) - max_response_tokens

    context_parts = []
    used_tokens = 0

    for chunk in retrieved_chunks:
        chunk_tokens = count_tokens(chunk["content"])
        if used_tokens + chunk_tokens > available_tokens:
            break

        source_label = f"[Source: {chunk['metadata'].get('source', 'Unknown')}, Section: {chunk['metadata'].get('section', 'N/A')}]"
        context_parts.append(f"{source_label}\n{chunk['content']}")
        used_tokens += chunk_tokens

    return "\n\n---\n\n".join(context_parts)

Generation Prompt Template

SYSTEM_PROMPT = """You are a technical assistant answering questions based on
the provided reference documents. Follow these rules strictly:

1. Answer ONLY based on the provided context. If the context does not contain
   sufficient information to answer, say "I don't have enough information to
   answer this question based on the available documents."
2. Cite your sources using [Source: filename, Section: section_name] format.
3. If multiple sources provide conflicting information, note the discrepancy.
4. Do not speculate or add information not present in the context.
5. Use precise technical terminology from the source documents."""

USER_TEMPLATE = """Context documents:
{context}

---

Question: {query}

Provide a detailed, accurate answer based on the above context."""

Evaluation: Measuring What Matters

Retrieval Metrics

def evaluate_retrieval(
    test_queries: list[dict],  # {"query": str, "relevant_doc_ids": list[str]}
    retriever,
    k_values: list[int] = [1, 3, 5, 10],
) -> dict:
    metrics = {f"recall@{k}": [] for k in k_values}
    metrics.update({f"precision@{k}": [] for k in k_values})
    metrics["mrr"] = []

    for test_case in test_queries:
        results = retriever.retrieve(test_case["query"], top_k=max(k_values))
        retrieved_ids = [r["id"] for r in results]
        relevant_ids = set(test_case["relevant_doc_ids"])

        # MRR (Mean Reciprocal Rank)
        for rank, rid in enumerate(retrieved_ids, 1):
            if rid in relevant_ids:
                metrics["mrr"].append(1.0 / rank)
                break
        else:
            metrics["mrr"].append(0.0)

        # Recall@K and Precision@K
        for k in k_values:
            top_k_ids = set(retrieved_ids[:k])
            hits = len(top_k_ids & relevant_ids)
            metrics[f"recall@{k}"].append(hits / len(relevant_ids) if relevant_ids else 0)
            metrics[f"precision@{k}"].append(hits / k)

    return {key: sum(values) / len(values) for key, values in metrics.items()}

Target benchmarks for production RAG systems:

Metric Acceptable Good Excellent
Recall@5 >0.70 >0.80 >0.90
Precision@5 >0.50 >0.65 >0.80
MRR >0.60 >0.75 >0.85
Answer correctness >0.70 >0.80 >0.90
Faithfulness (no hallucination) >0.85 >0.92 >0.97

LLM-as-Judge for Generation Quality

FAITHFULNESS_JUDGE_PROMPT = """Evaluate whether the ANSWER is faithfully
grounded in the provided CONTEXT.

CONTEXT:
{context}

ANSWER:
{answer}

Score on a scale of 1-5:
1: Answer contains fabricated information not in context
2: Answer mostly fabricated with minor grounding
3: Answer partially grounded, some claims unsupported
4: Answer mostly grounded, minor extrapolations
5: Answer fully grounded in provided context

Return JSON: {{"score": int, "reasoning": str, "unsupported_claims": list[str]}}"""

Run this evaluation on every production response in an asynchronous pipeline. Aggregate scores to detect regression over time.

Production Operations

Monitoring and Alerting

Track these metrics in your monitoring stack (Prometheus/Grafana, Datadog, or CloudWatch):

Metric Alert Threshold Impact
P95 retrieval latency >500ms User experience degradation
P95 end-to-end latency >5s User abandonment
Retrieval empty rate >5% Missing content or index issues
Hallucination rate (sampled) >8% Trust erosion
LLM token cost per query >$0.05 Budget overrun
Embedding throughput <100 docs/min (ingestion) Ingestion pipeline bottleneck

Cost Management

RAG system costs break down into three components:

Monthly cost = Embedding costs + Vector storage costs + LLM generation costs

Example for 1M document corpus, 100K queries/month:

Embedding (one-time ingestion):
  1M docs × 500 tokens avg × $0.02/1M tokens = $10

Vector storage (monthly):
  1M vectors × 1024 dims × 4 bytes = 4 GB
  Pinecone: ~$70/month (s1.x1 pod)
  pgvector: ~$0 (existing PostgreSQL)
  Qdrant Cloud: ~$50/month

LLM generation (monthly):
  100K queries × 2000 input tokens × $2.50/1M tokens (GPT-4o) = $500
  100K queries × 500 output tokens × $10/1M tokens (GPT-4o) = $500

Total: ~$1,070/month with GPT-4o
       ~$120/month with GPT-4o-mini ($0.15/$0.60 per 1M tokens)

Use GPT-4o-mini or Claude 3.5 Haiku for straightforward Q&A queries. Reserve GPT-4o or Claude 3.5 Sonnet for complex reasoning queries. Route based on query complexity classification.

Common Failure Modes and Mitigations

Failure 1: Chunking splits critical information across chunks. Mitigation: Use document-structure-aware chunking with overlap. Add parent-document retrieval (retrieve the chunk, but send the full parent section to the LLM).

Failure 2: Retrieval returns semantically similar but factually irrelevant chunks. Mitigation: Add metadata filtering (date ranges, document types, categories). Use reranking to promote factually relevant results.

Failure 3: LLM hallucinates despite having correct context. Mitigation: Reduce temperature to 0.1-0.3. Add explicit citation instructions. Implement post-generation faithfulness checking.

Failure 4: Stale documents in the index. Mitigation: Implement document versioning with updated_at timestamps. Run incremental re-indexing on a schedule. Delete outdated chunks when source documents are updated.

Failure 5: Embedding model drift after upgrade. Mitigation: Never mix embeddings from different models in the same collection. When upgrading embedding models, re-embed the entire corpus and swap the collection atomically.

Next Steps

For hands-on experience building AI/ML systems on cloud infrastructure, the Cloud AI & Machine Learning course covers the full pipeline from data engineering through model deployment, including RAG-specific modules. The AI & ML Resources collection provides production-ready templates for RAG pipelines, vector database configurations, and evaluation frameworks.

For building autonomous AI agents that leverage RAG as a tool, the Claude Agent Systems course covers agent architectures, tool use patterns, and multi-agent orchestration with retrieval-augmented components.

RAG systems are infrastructure, not magic. Treat them with the same engineering rigor you apply to any production data pipeline: version your data, test your retrieval, monitor your outputs, and iterate based on measured performance. The teams that win with RAG are the ones that invest in evaluation infrastructure, not the ones chasing the latest embedding model announcement.

Kehinde Ogunlowo

Senior Multi-Cloud DevSecOps Architect & AI Engineer

AWS, Azure, GCP Certified | Secret Clearance | FedRAMP, CMMC, HIPAA

Enterprise experience at Cigna Healthcare, Lockheed Martin, NantHealth, BP Refinery, and Patterson UTI.

Start Your Cloud Career Today

Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.

Get Free Cloud Career Resources

You might also like