Building Production RAG Pipelines with LangChain & Claude

Build production RAG pipelines with LangChain 0.3, Claude 3.5, pgvector, and evaluation frameworks. Covers chunking, retrieval, reranking, and deployment.

Building Production RAG Pipelines with LangChain and Claude in 2026

Retrieval Augmented Generation has moved from research curiosity to production requirement. Every enterprise deploying LLM-powered applications — customer support, internal knowledge search, document analysis, compliance Q&A — needs a RAG pipeline. The concept is straightforward: retrieve relevant context from your data, inject it into the LLM prompt, and generate grounded responses. The implementation details are where teams spend months iterating.

This guide covers building a production-grade RAG pipeline using LangChain 0.3, Anthropic's Claude 3.5 Sonnet (and Opus for complex reasoning), pgvector for embeddings storage, and Cohere Rerank for retrieval quality. The architecture handles documents from 100 pages to 500,000+ pages with consistent retrieval accuracy.

[IMAGE: RAG pipeline architecture diagram showing document ingestion (PDF, Markdown, HTML), chunking with overlap, embedding generation via voyage-3, storage in pgvector, retrieval with hybrid search, Cohere reranking, and Claude 3.5 generation with citation tracking]

Architecture Overview

A production RAG pipeline has five stages, and each one matters:

  1. Document Ingestion — Parse source documents into clean text
  2. Chunking — Split text into retrieval-optimized segments
  3. Embedding & Indexing — Generate vector embeddings and store them
  4. Retrieval & Reranking — Find and score relevant chunks for a query
  5. Generation — Synthesize an answer using retrieved context + LLM

The naive approach — split documents into 500-character chunks, embed with text-embedding-ada-002, retrieve top-3, and send to GPT — works for demos. It fails in production because retrieval precision drops below 60% on domain-specific queries, chunk boundaries split critical information, and there is no mechanism to evaluate or improve quality.

Stage 1: Document Ingestion

Parser Selection by Document Type

Document Type Parser Why
PDF (text) PyMuPDF (v1.24) Fastest, preserves layout structure
PDF (scanned/image) Amazon Textract or Azure Document Intelligence OCR with table/form extraction
Markdown langchain_text_splitters.MarkdownHeaderTextSplitter Preserves header hierarchy as metadata
HTML BeautifulSoup4 + langchain_community.document_loaders.BSHTMLLoader Strips boilerplate, extracts content
DOCX python-docx via langchain_community.document_loaders.Docx2txtLoader Handles formatting, tables
Confluence/Notion Native API loaders in langchain_community Preserves page structure, links
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import MarkdownHeaderTextSplitter

# PDF ingestion with metadata preservation
loader = PyMuPDFLoader("compliance_handbook.pdf")
raw_docs = loader.load()

# Each document carries metadata: page number, source file, section headers
for doc in raw_docs:
    doc.metadata["source_type"] = "pdf"
    doc.metadata["ingestion_date"] = "2026-04-03"
    doc.metadata["document_category"] = "compliance"

Metadata matters. In production, users need to know where an answer came from — page number, document title, section heading. Attach metadata during ingestion, not after.

Stage 2: Chunking Strategy

Chunking is the most underestimated stage. Poor chunking destroys retrieval quality regardless of how good your embedding model or LLM is.

Recursive Character Splitting with Semantic Awareness

LangChain 0.3's RecursiveCharacterTextSplitter remains the workhorse, but the parameters require tuning:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,       # Characters — not tokens
    chunk_overlap=200,     # 15-20% overlap prevents information loss at boundaries
    separators=[
        "\n## ",           # H2 headers (highest priority split point)
        "\n### ",          # H3 headers
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentence boundaries
        " ",               # Word boundaries (last resort)
    ],
    length_function=len,
    is_separator_regex=False,
)

chunks = splitter.split_documents(raw_docs)

Why 1,200 characters? Anthropic's voyage-3 embedding model (our recommended choice) handles up to 32K tokens, but retrieval precision peaks with chunks of 200-400 tokens (~800-1,600 characters). Chunks that are too small lose context; chunks that are too large dilute the relevant signal.

Parent-Child Chunking for Long Documents

For documents exceeding 100 pages, use a two-level chunking strategy:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever

# Parent chunks: large context windows for generation
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=400)

# Child chunks: small, precise for retrieval matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)

store = InMemoryStore()  # Use Redis or PostgreSQL in production

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

The child chunk matches the query with high precision. The parent chunk provides the LLM with surrounding context needed for a complete answer.

Stage 3: Embedding and Indexing

Embedding Model Selection (April 2026)

Model Dimensions Max Tokens Cost (per 1M tokens) MTEB Score
Voyage-3 (Anthropic) 1024 32K $0.06 (input) 67.3
text-embedding-3-large (OpenAI) 3072 8K $0.13 64.6
Cohere embed-v4 1024 512 $0.10 66.1
BGE-M3 (open source) 1024 8K Self-hosted 65.8

For Anthropic Claude pipelines, voyage-3 is the recommended pairing. It is trained on similar data distributions and provides the best retrieval performance when combined with Claude for generation.

pgvector on PostgreSQL 16

pgvector (v0.7+) running on PostgreSQL 16 is the production-grade vector store choice for teams already running PostgreSQL. No additional infrastructure to manage.

-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create embeddings table with HNSW index
CREATE TABLE document_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    embedding vector(1024) NOT NULL,  -- voyage-3 dimensions
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- GIN index on metadata for filtered retrieval
CREATE INDEX ON document_embeddings
USING gin (metadata jsonb_path_ops);

LangChain integration:

from langchain_postgres import PGVector
from langchain_voyageai import VoyageAIEmbeddings

embeddings = VoyageAIEmbeddings(
    model="voyage-3",
    voyage_api_key=os.environ["VOYAGE_API_KEY"],
)

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="compliance_docs",
    connection=os.environ["DATABASE_URL"],  # postgresql://...
    use_jsonb=True,
)

# Index documents
vectorstore.add_documents(chunks)

Hybrid Search: Vector + Full-Text

Pure vector similarity misses exact keyword matches. Hybrid search combines semantic similarity with BM25 keyword matching:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks, k=10)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7],  # 70% semantic, 30% keyword
)

Stage 4: Retrieval and Reranking

Cohere Rerank v3

Initial retrieval (top-20 from hybrid search) casts a wide net. Reranking with Cohere Rerank v3 narrows to the most relevant chunks:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(
    model="rerank-english-v3.0",
    cohere_api_key=os.environ["COHERE_API_KEY"],
    top_n=5,
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever,
)

# Retrieve + rerank
results = compression_retriever.invoke("What are the data retention requirements for HIPAA?")

Reranking consistently improves retrieval precision by 15-25% compared to raw vector similarity, because cross-encoder models evaluate query-document relevance jointly rather than comparing independent embeddings.

[IMAGE: Side-by-side comparison showing retrieval results before and after Cohere reranking, with relevance scores and highlighting showing how reranking promotes the most contextually relevant chunks to the top positions]

Stage 5: Generation with Claude

Prompt Architecture

The generation prompt must instruct Claude to use only the retrieved context, cite sources, and acknowledge when context is insufficient:

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

llm = ChatAnthropic(
    model="claude-sonnet-4-20250514",
    temperature=0,
    max_tokens=4096,
    anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
)

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a compliance knowledge assistant for an enterprise organization.
Answer questions using ONLY the provided context documents. For each claim in your answer,
cite the source using [Source: document_name, page X] format.

If the context does not contain sufficient information to answer the question,
say "I don't have enough information in the available documents to answer this question"
and suggest what additional documents might help.

Do not use prior knowledge. Do not speculate. Be precise and specific."""),
    ("human", """Context documents:
{context}

Question: {question}

Answer with citations:"""),
])

chain = prompt | llm

Streaming with Citation Tracking

For production applications, stream responses and track which chunks were used:

from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {
        "context": compression_retriever | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
)

# Stream response
async for chunk in rag_chain.astream("What are HIPAA data retention requirements?"):
    print(chunk.content, end="", flush=True)

Evaluation: Measuring RAG Quality

Without evaluation, you are guessing. Three metrics matter:

RAGAS Framework

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Prepare evaluation dataset
eval_results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=evaluation_llm,
    embeddings=embeddings,
)

print(eval_results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.79}
Metric Target What It Measures
Faithfulness >0.90 Does the answer only use information from retrieved context?
Answer Relevancy >0.85 Is the answer actually relevant to the question?
Context Precision >0.80 Are the retrieved chunks relevant to the question?
Context Recall >0.75 Did retrieval find all the relevant information?

Run evaluations on a curated set of 50-100 question-answer pairs covering your domain. Automate with CI/CD — every change to chunking strategy, embedding model, or prompt template triggers a RAGAS evaluation.

Production Deployment Checklist

Concern Solution
Latency Cache frequent queries with Redis; use streaming for perceived speed
Cost Batch embedding generation; use Claude Haiku for simple queries, Sonnet/Opus for complex
Observability LangSmith for trace logging; track retrieval scores, latency, token usage per request
Security Sanitize inputs; implement document-level access control in metadata filters
Freshness Incremental ingestion pipeline triggered by document changes (S3 events, webhook)
Scaling Horizontal scaling of retrieval service; pgvector handles 10M+ vectors with HNSW

Where to Go from Here

Production RAG is iterative. Start with the architecture above, measure with RAGAS, and improve. Common iteration paths: fine-tuning embedding models on domain data, experimenting with chunk sizes, adding query expansion, and implementing multi-step retrieval for complex questions.

Citadel Cloud Management's AI & ML Resources collection includes production RAG templates, LangChain project scaffolds, and evaluation datasets. Our courses cover LLM engineering from fundamentals through production deployment, including dedicated modules on RAG architecture, prompt engineering, and MLOps. Check the pricing page for access tiers — the free tier includes all 17 courses.

Ready to build production AI systems? Enroll free at Citadel Cloud Management and start building with LangChain and Claude today.


RAG #LangChain #Claude #AI #LLM #VectorDatabase #pgvector #MachineLearning #AIEngineering #ProductionAI

Kehinde Ogunlowo

Senior Multi-Cloud DevSecOps Architect & AI Engineer

AWS, Azure, GCP Certified | Secret Clearance | FedRAMP, CMMC, HIPAA

Enterprise experience at Cigna Healthcare, Lockheed Martin, NantHealth, BP Refinery, and Patterson UTI.

Start Your Cloud Career Today

Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.

Get Free Cloud Career Resources

You might also like