title: "Building Production RAG Pipelines: Engineer's Guide"
meta_title: "Building Production RAG Pipelines: Senior Guide"
meta_description: "Senior engineer's guide to production RAG pipelines. Covers chunking strategies, embedding models, vector databases, retrieval optimization, and evaluation frameworks."
keywords:
- RAG pipeline production
- retrieval augmented generation
- LLM RAG architecture
- vector database comparison
- RAG evaluation
author: "Kenny Ogunlowo"
date: 2026-04-02
category: "AI & Machine Learning"
Building Production RAG Pipelines: A Senior Engineer's Guide
Retrieval Augmented Generation has moved from research papers to production workloads serving millions of queries daily. Every major enterprise is building or planning a RAG system to make their proprietary data accessible through natural language interfaces. Yet the gap between a demo RAG pipeline (50 lines of LangChain) and a production system that handles real traffic, maintains accuracy, and operates within cost constraints is enormous.
This guide covers the architecture decisions, component selections, and operational patterns required to build RAG systems that survive contact with production traffic. No toy examples. No "just use LangChain" hand-waving. Every recommendation is grounded in systems that process real queries against real document corpora.
RAG Architecture: Beyond the Basic Pattern
The standard RAG pattern is deceptively simple: chunk documents, generate embeddings, store in a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context. Production systems add several critical layers.
Production RAG Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Query Pipeline │
│ │
│ User Query → Query Understanding → Query Rewriting │
│ → Hybrid Retrieval (Dense + Sparse) → Reranking │
│ → Context Assembly → LLM Generation → Response Validation │
│ → Citation Extraction → Response Delivery │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Ingestion Pipeline │
│ │
│ Source Documents → Format Extraction → Cleaning │
│ → Chunking → Metadata Enrichment → Embedding Generation │
│ → Vector Store Upsert → Index Optimization │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Evaluation Pipeline │
│ │
│ Test Queries → Retrieval Metrics → Generation Quality │
│ → Hallucination Detection → Latency/Cost Tracking │
│ → Regression Alerts │
│ │
└─────────────────────────────────────────────────────────────────┘
Each of these three pipelines has distinct requirements, failure modes, and optimization surfaces. Treating RAG as a single pipeline is the first mistake most teams make.
Document Ingestion: Where Quality Starts
Format Extraction
Production document corpora are messy. You will encounter PDFs with embedded tables, scanned documents, PowerPoint presentations, HTML with complex layouts, and Markdown with inconsistent formatting.
PDF extraction tools ranked by quality:
| Tool | Table Handling | OCR | Structured Output | Speed | ||
|---|---|---|---|---|---|---|
| Unstructured.io | Excellent | Yes (Tesseract/PaddleOCR) | JSON elements | Medium | ||
| PyMuPDF (fitz) | Good | No (text-based only) | Text blocks with coordinates | Fast | ||
| Amazon Textract | Excellent | Native | JSON with confidence scores | Medium | ||
| Azure Document Intelligence | Excellent | Native | JSON with bounding boxes | Medium |
| LlamaParse | Very Good | Yes | Markdown | Slow |
|---|
| Document Type | Optimal Chunk Size | Overlap | Rationale | |||
|---|---|---|---|---|---|---|
| Technical documentation | 400-600 tokens | 50-100 tokens | Procedures need complete context | |||
| Legal/compliance | 300-500 tokens | 100-150 tokens | Clauses must not be split | |||
| Knowledge base articles | 500-800 tokens | 100 tokens | Self-contained answer units | |||
| Code documentation | 200-400 tokens | 50 tokens | Functions/methods as natural units |
| Research papers | 500-700 tokens | 100 tokens | Paragraph-level semantic units | |||
|---|---|---|---|---|---|---|
| Model | Dimensions | MTEB Score | Max Tokens | Inference Cost | Self-Hostable | |
| OpenAI text-embedding-3-large | 3072 | 64.6 | 8191 | $0.13/1M tokens | No |
| OpenAI text-embedding-3-small | 1536 | 62.3 | 8191 | $0.02/1M tokens | No | |
|---|---|---|---|---|---|---|
| Cohere embed-v3 | 1024 | 64.5 | 512 | $0.10/1M tokens | No | |
| Voyage AI voyage-3 | 1024 | 67.1 | 32000 | $0.06/1M tokens | No | |
| BGE-M3 (BAAI) | 1024 | 63.5 | 8192 | Self-hosted | Yes | |
| E5-Mistral-7B | 4096 | 66.6 | 32768 | Self-hosted | Yes |
| Nomic Embed v1.5 | 768 | 62.3 | 8192 | Self-hosted/API | Yes | |
|---|---|---|---|---|---|---|
| GTE-Qwen2-7B | 3584 | 65.5 | 32768 | Self-hosted | Yes |
| Database | Hosted Option | Self-Hosted | Max Vectors | Filtering | Hybrid Search | Production Maturity |
|---|---|---|---|---|---|---|
| Pinecone | Yes (primary) | No | Billions | Metadata | Yes | High |
| Weaviate | Yes | Yes (Docker/K8s) | Billions | GraphQL | Yes | High |
| Qdrant | Yes | Yes (Docker/K8s) | Billions | Payload | Yes | High |
| Milvus/Zilliz | Yes (Zilliz) | Yes (K8s) | Billions | Expressions | Yes | High |
| pgvector | Via cloud PG | Yes (PostgreSQL ext) | Millions | SQL | With tsvector | Medium |
|---|---|---|---|---|---|---|
| ChromaDB | No | Yes (embedded) | Millions | Metadata | No | Low (dev/prototype) |
| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| Recall@5 | >0.70 | >0.80 | >0.90 |
| Precision@5 | >0.50 | >0.65 | >0.80 |
| MRR | >0.60 | >0.75 | >0.85 |
| Answer correctness | >0.70 | >0.80 | >0.90 |
| Faithfulness (no hallucination) | >0.85 | >0.92 | >0.97 |
|---|
| Metric | Alert Threshold | Impact |
|---|---|---|
| P95 retrieval latency | >500ms | User experience degradation |
| P95 end-to-end latency | >5s | User abandonment |
| Retrieval empty rate | >5% | Missing content or index issues |
| Hallucination rate (sampled) | >8% | Trust erosion |
| LLM token cost per query | >$0.05 | Budget overrun |
|---|---|---|
| Embedding throughput | <100 docs/min (ingestion) | Ingestion pipeline bottleneck |