Building Production RAG Pipelines: A Senior Engineer's Guide

Build production-grade RAG systems with vector databases, chunking strategies, retrieval optimization, and evaluation frameworks. Real architectures, not demos.

Building Production RAG Pipelines: A Senior Engineer's Guide

Every RAG tutorial on the internet follows the same pattern: load a PDF, split it into chunks, embed the chunks with OpenAI, store them in a vector database, retrieve the top-k results, and send them to GPT-4 with a prompt. This works in a Jupyter notebook. It fails in production.

I have built RAG systems that serve regulated healthcare data under HIPAA, defense intelligence documents under ITAR, and internal knowledge bases at scale for enterprise customers. The notebook demo and the production system share approximately 20% of their architecture. The other 80% — the chunking strategy, the embedding pipeline, the retrieval optimization, the evaluation framework, the failure handling, and the operational monitoring — is what determines whether your RAG system gives accurate answers or confidently hallucinates with citations that look correct but are not.

This article covers the production 80%. I assume you already understand the basic RAG concept. We are going straight to the engineering decisions that matter.

The Production RAG Architecture

Here is the architecture running in production at enterprise scale. Every box represents a component that took at least a week to get right:

┌─────────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                    │
│                                                         │
│  Document Sources → Format Detection → Extraction       │
│  (S3, SharePoint,   (PDF, DOCX,       (text, tables,   │
│   Confluence,        HTML, Markdown,    images via OCR,  │
│   databases)         PPTX, XLSX)        structured data) │
│                                                         │
│  → Cleaning → Chunking → Embedding → Vector Store       │
│    (dedup,    (strategy   (model       (Pinecone,       │
│     PII        selection)  selection)    Weaviate,       │
│     detection)                          pgvector,        │
│                                         Qdrant)          │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                    RETRIEVAL PIPELINE                    │
│                                                         │
│  User Query → Query Analysis → Query Transformation     │
│               (intent,         (expansion, HyDE,        │
│                entity           decomposition)           │
│                extraction)                               │
│                                                         │
│  → Hybrid Search → Re-ranking → Context Assembly        │
│    (vector +       (cross-      (dedup, ordering,       │
│     keyword +       encoder,     metadata injection,    │
│     metadata        Cohere       token budget)          │
│     filter)         Rerank)                             │
│                                                         │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   GENERATION PIPELINE                    │
│                                                         │
│  Context + Query → Prompt Assembly → LLM Call           │
│                    (system prompt,    (Claude, GPT-4,   │
│                     few-shot,         Llama, Mistral)   │
│                     guardrails)                         │
│                                                         │
│  → Citation Extraction → Hallucination Check            │
│    (map claims to        (verify against                │
│     source chunks)        retrieved context)            │
│                                                         │
│  → Response Formatting → User Response                  │
│    (markdown, tables,                                   │
│     source links)                                       │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                  EVALUATION & MONITORING                  │
│                                                         │
│  Retrieval Metrics: Precision@k, Recall@k, MRR, NDCG   │
│  Generation Metrics: Faithfulness, Relevance, Coherence │
│  Latency: P50, P95, P99 per pipeline stage              │
│  Cost: Embedding tokens, LLM tokens, vector DB queries  │
└─────────────────────────────────────────────────────────┘

Chunking: The Decision That Determines Everything

Chunking strategy is the single highest-leverage decision in a RAG pipeline. Get it wrong, and no amount of retrieval optimization or prompt engineering will save you. The chunk must contain enough context to be useful on its own, but not so much that it dilutes the specific information the query needs.

Fixed-Size Chunking (Baseline)

from langchain.text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
chunks = splitter.split_text(document_text)

Fixed-size chunking splits text into segments of a target token/character count with overlap between adjacent chunks. It is the default in every tutorial. It works adequately for homogeneous text — novels, articles, transcripts — where the information density is relatively uniform.

It fails for: - Technical documentation: A 512-token chunk might split a code block in half, or separate a function signature from its description. - Legal and compliance documents: Clause references span paragraphs. A chunk boundary in the middle of a regulatory citation produces two useless chunks. - Tables and structured data: Fixed-size splitting destroys table structure completely.

Semantic Chunking (Better)

Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,
)
chunks = semantic_splitter.split_text(document_text)

The algorithm computes embeddings for each sentence, calculates cosine similarity between consecutive sentences, and splits where the similarity drops below a threshold. This produces chunks that are topically coherent — each chunk covers one concept or topic.

The tradeoff: semantic chunking is 10-50x slower than fixed-size chunking because it requires an embedding call for every sentence during ingestion. For a corpus of 100,000 documents, this adds hours to the ingestion pipeline. In production, I use semantic chunking for high-value, slowly-changing documents (policy manuals, product documentation, compliance frameworks) and fixed-size chunking for high-volume, frequently-updated content (support tickets, chat logs, news articles).

Document-Structure-Aware Chunking (Best for Technical Content)

For technical documentation, the best approach respects the document's inherent structure:

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class StructuredChunk:
    content: str
    metadata: dict
    section_hierarchy: tuple  # ("Chapter 3", "Section 3.2", "Subsection 3.2.1")
    chunk_type: str  # "prose", "code_block", "table", "list"

def chunk_by_structure(markdown_text: str, max_chunk_tokens: int = 1024) -> list:
    """Split markdown by headers, preserving hierarchy and code blocks."""
    chunks = []
    current_hierarchy = []
    current_content = []
    current_type = "prose"

    for line in markdown_text.split("\n"):
        # Detect header level
        header_match = re.match(r'^(#{1,6})\s+(.+)', line)

        if header_match:
            # Flush current content as a chunk
            if current_content:
                chunks.append(StructuredChunk(
                    content="\n".join(current_content),
                    metadata={"headers": list(current_hierarchy)},
                    section_hierarchy=tuple(current_hierarchy),
                    chunk_type=current_type,
                ))
                current_content = []

            level = len(header_match.group(1))
            title = header_match.group(2)
            # Update hierarchy
            current_hierarchy = current_hierarchy[:level-1] + [title]

        # Detect code blocks
        if line.startswith("```"):
            current_type = "code_block" if current_type == "prose" else "prose"

        current_content.append(line)

    # Flush final chunk
    if current_content:
        chunks.append(StructuredChunk(
            content="\n".join(current_content),
            metadata={"headers": list(current_hierarchy)},
            section_hierarchy=tuple(current_hierarchy),
            chunk_type=current_type,
        ))

    return chunks

The key insight: the section hierarchy becomes metadata attached to every chunk. When a user asks "What are the authentication requirements in section 3.2?", the metadata filter narrows retrieval to chunks from that section before vector similarity runs. This is hybrid retrieval — metadata filtering plus semantic search — and it outperforms pure vector search by 15-25% on precision@5 in my production benchmarks.

Chunk Size Optimization

There is no universal optimal chunk size. It depends on your embedding model, your content type, and your query patterns. Here is a framework for finding it:

Embedding Model          Recommended Chunk Size    Why
────────────────────────────────────────────────────────────
text-embedding-3-small   256-512 tokens            Small context window; larger chunks dilute embedding
text-embedding-3-large   512-1024 tokens           Larger model captures more nuance
Cohere embed-v3          512-1024 tokens            Optimized for longer passages
voyage-3                 512-1024 tokens            Strong on technical content
BGE-M3 (open source)     256-512 tokens            Best at shorter, focused chunks

Run this experiment on your actual data: create chunk sets at 256, 512, 768, and 1024 tokens. Build a test set of 50 queries with known correct answers. Measure retrieval precision@5 and recall@10 at each chunk size. The optimal size is where precision peaks without recall dropping below an acceptable threshold (I target 85% recall@10).

Embedding Model Selection

The embedding model converts text into dense vectors that capture semantic meaning. Your choice here affects retrieval quality, latency, and cost.

Production embedding models (2026):

Model Dimensions Max Tokens Cost (per 1M tokens) Strengths
OpenAI text-embedding-3-large 3072 (or 256-3072 via dimensions param) 8,191 $0.13 Best general-purpose; dimension reduction without reindexing
OpenAI text-embedding-3-small 1536 8,191 $0.02 5x cheaper; 90% of large model quality
Cohere embed-v3.0 1024 512 $0.10 Input type parameter (search_query vs search_document)
Voyage voyage-3 1024 32,000 $0.06 Best for code and technical content
BGE-M3 (self-hosted) 1024 8,192 Infrastructure cost No API dependency; multilingual; runs on a single GPU

For production systems with sensitive data (healthcare, defense, finance), I use self-hosted BGE-M3 on AWS SageMaker endpoints. The data never leaves your VPC. The latency is 15-30ms per batch of 32 chunks on an ml.g5.xlarge instance ($1.41/hour).

For non-sensitive workloads, OpenAI text-embedding-3-small offers the best cost-to-quality ratio. At $0.02 per million tokens, embedding a 100,000-document corpus (averaging 2,000 tokens per document) costs approximately $4.

Embedding Pipeline Architecture

Do not call the embedding API synchronously during ingestion. Use a pipeline:

import asyncio
from dataclasses import dataclass

@dataclass(frozen=True)
class EmbeddingBatch:
    chunks: tuple
    model: str = "text-embedding-3-small"
    batch_size: int = 256

async def embed_corpus(chunks: list, client, model: str = "text-embedding-3-small"):
    """Embed chunks in batches with retry and rate limiting."""
    results = []
    batch_size = 256  # OpenAI max batch size is 2048, but 256 is safer for rate limits

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [chunk.content for chunk in batch]

        response = await client.embeddings.create(
            input=texts,
            model=model,
        )

        for j, embedding_data in enumerate(response.data):
            results.append({
                "chunk": batch[j],
                "embedding": embedding_data.embedding,
                "token_count": response.usage.total_tokens,
            })

    return results

In production, this pipeline runs as an AWS Step Functions workflow:

S3 Upload Event → Lambda (format detection + extraction)
  → SQS Queue (chunking jobs)
  → Lambda (chunking + metadata extraction)
  → SQS Queue (embedding jobs)
  → Lambda or SageMaker Endpoint (embedding)
  → Lambda (vector store upsert)
  → DynamoDB (ingestion tracking + status)

Each stage is independently scalable. When you add 10,000 new documents, the SQS queues absorb the burst and Lambda scales out to process them in parallel. Total cost for embedding 10,000 documents: approximately $0.50-$2.00 depending on document length and model choice.

Retrieval Optimization: Where Most RAG Systems Fail

The retrieval stage is where the quality gap between a demo and a production system becomes enormous. Pure vector similarity search returns contextually related chunks, but not necessarily the chunks that answer the question.

Hybrid Search (Vector + Keyword)

Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. The best production systems combine both:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    FieldCondition,
    Filter,
    MatchValue,
    SearchParams,
    FusionQuery,
    Prefetch,
    Query,
)

client = QdrantClient(url="http://qdrant:6333")

# Hybrid search with Qdrant's built-in fusion
results = client.query_points(
    collection_name="citadel_docs",
    prefetch=[
        # Dense vector search
        Prefetch(
            query=query_embedding,
            using="dense",
            limit=20,
        ),
        # Sparse vector search (BM25-equivalent)
        Prefetch(
            query=sparse_query_vector,
            using="sparse",
            limit=20,
        ),
    ],
    query=FusionQuery(fusion="rrf"),  # Reciprocal Rank Fusion
    limit=10,
)

Reciprocal Rank Fusion (RRF) combines the rankings from dense and sparse search without needing to normalize scores. In my benchmarks across three production datasets:

Search Method Precision@5 Recall@10 MRR
Dense vector only 0.72 0.81 0.68
Sparse (BM25) only 0.65 0.74 0.61
Hybrid (RRF fusion) 0.83 0.89 0.79

Hybrid search improved precision@5 by 15% over dense-only search. The keyword component catches exact matches that semantic search misses — acronyms, error codes, version numbers, and proper nouns.

Re-ranking: The 10% That Changes Everything

After initial retrieval returns 20-50 candidates, a cross-encoder re-ranker scores each candidate against the query with much higher accuracy than the initial retrieval:

import cohere

co = cohere.Client(api_key="your-key")

# Initial retrieval returns 20 candidates
initial_results = hybrid_search(query, limit=20)

# Re-rank with Cohere
rerank_response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[r.content for r in initial_results],
    top_n=5,
    return_documents=True,
)

# Use top 5 re-ranked results as context
final_context = [
    {
        "content": result.document.text,
        "relevance_score": result.relevance_score,
        "original_rank": initial_results[result.index].rank,
    }
    for result in rerank_response.results
]

Re-ranking adds 100-300ms of latency per query but improves answer accuracy by 10-20% in my production measurements. The cross-encoder model reads the query and each candidate together (unlike bi-encoders which embed them separately), allowing it to capture fine-grained relevance that embedding similarity misses.

Query Transformation

Users do not write queries optimized for retrieval. A production RAG system transforms the user query before searching:

HyDE (Hypothetical Document Embeddings): Ask the LLM to generate a hypothetical answer, then embed that answer and search for real documents similar to it. This works because the hypothetical answer uses the same vocabulary and structure as the real answer.

async def hyde_search(query: str, llm_client, embed_client, vector_store):
    # Generate hypothetical answer
    hypothetical = await llm_client.chat.completions.create(
        model="claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": f"Write a short, factual paragraph that answers this question: {query}"
        }],
        max_tokens=200,
    )

    # Embed the hypothetical answer
    hyde_embedding = await embed_client.embeddings.create(
        input=hypothetical.choices[0].message.content,
        model="text-embedding-3-small",
    )

    # Search with hypothetical embedding
    results = vector_store.search(
        vector=hyde_embedding.data[0].embedding,
        limit=20,
    )
    return results

In my testing, HyDE improved retrieval recall@10 by 8-12% for complex, multi-part questions but decreased performance by 3-5% for simple factual queries. Use it selectively — detect query complexity first, then apply HyDE only for complex queries.

Evaluation: Measuring What Matters

You cannot improve what you cannot measure. Production RAG systems need automated evaluation pipelines:

The RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) provides four core metrics:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": reference_answers,
})

results = evaluate(
    eval_data,
    metrics=[
        faithfulness,       # Is the answer supported by the context?
        answer_relevancy,   # Does the answer address the question?
        context_precision,  # Are the retrieved contexts relevant?
        context_recall,     # Do the contexts contain the needed information?
    ],
)

print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
#  'context_precision': 0.83, 'context_recall': 0.79}

Target metrics for production systems:

Metric Minimum Good Excellent
Faithfulness 0.85 0.90 0.95+
Answer Relevancy 0.80 0.88 0.93+
Context Precision 0.75 0.83 0.90+
Context Recall 0.80 0.87 0.92+

If faithfulness drops below 0.85, your system is hallucinating too often for production use. Investigate: are chunks too short (lacking context), is the re-ranker selecting irrelevant chunks, or is the LLM ignoring the context in favor of parametric knowledge?

Monitoring in Production

Every RAG query in production should log:

{
  "query_id": "uuid",
  "timestamp": "2026-04-02T14:32:00Z",
  "user_query": "What are the IAM requirements for HIPAA?",
  "transformed_query": "IAM policy requirements HIPAA compliance healthcare",
  "retrieval_results": [
    {"chunk_id": "doc-42-chunk-7", "score": 0.91, "rerank_score": 0.95},
    {"chunk_id": "doc-15-chunk-3", "score": 0.87, "rerank_score": 0.82}
  ],
  "llm_model": "claude-sonnet-4-20250514",
  "llm_tokens_in": 3200,
  "llm_tokens_out": 450,
  "latency_ms": {
    "query_transform": 180,
    "retrieval": 45,
    "reranking": 220,
    "generation": 1200,
    "total": 1645
  },
  "cost_usd": 0.0087,
  "user_feedback": null
}

Build CloudWatch dashboards or Grafana dashboards tracking: - P50/P95/P99 total latency - Retrieval precision (via periodic automated evaluation against test sets) - Faithfulness score (sample 5% of queries for automated RAGAS evaluation) - Cost per query - Error rate by pipeline stage - User feedback (thumbs up/down) correlation with automated metrics

Common Production Failures and Fixes

Failure 1: "The system retrieves relevant documents but the answer is wrong." Root cause: The LLM is using parametric knowledge instead of the retrieved context. Fix: Strengthen the system prompt — explicitly instruct the model to answer only from the provided context. Add a "If the context does not contain the answer, say so" instruction. Reduce the model's temperature to 0.1.

Failure 2: "The system returns 'I don't know' when the answer exists in the corpus." Root cause: Chunking split the answer across two chunks, and neither chunk alone contains enough context. Fix: Increase chunk overlap. Try parent-child chunking — retrieve the child chunk but send the parent (larger) chunk to the LLM.

Failure 3: "Latency is 5+ seconds per query." Root cause: Usually the LLM generation step. Fix: Reduce context size (fewer chunks, shorter chunks). Use a faster model for simple queries (Claude Haiku for straightforward lookups, Claude Sonnet for complex reasoning). Stream the response so the user sees tokens immediately.

Failure 4: "The system works well for English but fails for other languages." Root cause: Embedding model and LLM may not support the target language well. Fix: Use multilingual embedding models (BGE-M3 supports 100+ languages, Cohere embed-v3 supports 100+ languages). Test retrieval quality per language independently.

Go Build It

The Cloud AI/ML course on Citadel Cloud Management covers RAG architectures from fundamentals through production deployment, including hands-on labs for building retrieval pipelines on AWS SageMaker with vector databases, embedding optimization, and evaluation frameworks. Free enrollment.

For production-ready RAG templates, embedding pipeline configurations, vector database setup guides, and evaluation harness code, browse the AI & ML Resources collection.

For engineers building agentic RAG systems — where the LLM decides which tools to call, which documents to retrieve, and how to decompose complex queries — the Claude Agent Systems course covers tool-use architectures, multi-step reasoning chains, and production agent orchestration patterns.

The difference between a RAG demo and a RAG product is engineering discipline: structured chunking, hybrid retrieval, re-ranking, automated evaluation, and operational monitoring. Apply these techniques to your corpus, measure the results, and iterate. The evaluation framework will tell you exactly where your system is failing and what to fix next.

Kehinde Ogunlowo

Senior Multi-Cloud DevSecOps Architect & AI Engineer

AWS, Azure, GCP Certified | Secret Clearance | FedRAMP, CMMC, HIPAA

Enterprise experience at Cigna Healthcare, Lockheed Martin, NantHealth, BP Refinery, and Patterson UTI.

Start Your Cloud Career Today

Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.

Get Free Cloud Career Resources

You might also like