Building Production RAG Pipelines with LangChain and Claude in 2026
Retrieval Augmented Generation has moved from research curiosity to production requirement. Every enterprise deploying LLM-powered applications — customer support, internal knowledge search, document analysis, compliance Q&A — needs a RAG pipeline. The concept is straightforward: retrieve relevant context from your data, inject it into the LLM prompt, and generate grounded responses. The implementation details are where teams spend months iterating.
This guide covers building a production-grade RAG pipeline using LangChain 0.3, Anthropic's Claude 3.5 Sonnet (and Opus for complex reasoning), pgvector for embeddings storage, and Cohere Rerank for retrieval quality. The architecture handles documents from 100 pages to 500,000+ pages with consistent retrieval accuracy.
[IMAGE: RAG pipeline architecture diagram showing document ingestion (PDF, Markdown, HTML), chunking with overlap, embedding generation via voyage-3, storage in pgvector, retrieval with hybrid search, Cohere reranking, and Claude 3.5 generation with citation tracking]
Architecture Overview
A production RAG pipeline has five stages, and each one matters:
- Document Ingestion — Parse source documents into clean text
- Chunking — Split text into retrieval-optimized segments
- Embedding & Indexing — Generate vector embeddings and store them
- Retrieval & Reranking — Find and score relevant chunks for a query
- Generation — Synthesize an answer using retrieved context + LLM
The naive approach — split documents into 500-character chunks, embed with text-embedding-ada-002, retrieve top-3, and send to GPT — works for demos. It fails in production because retrieval precision drops below 60% on domain-specific queries, chunk boundaries split critical information, and there is no mechanism to evaluate or improve quality.
Stage 1: Document Ingestion
Parser Selection by Document Type
| Document Type | Parser | Why |
|---|---|---|
| PDF (text) |
PyMuPDF (v1.24) |
Fastest, preserves layout structure |
| PDF (scanned/image) |
Amazon Textract or Azure Document Intelligence
|
OCR with table/form extraction |
| Markdown | langchain_text_splitters.MarkdownHeaderTextSplitter |
Preserves header hierarchy as metadata |
| HTML |
BeautifulSoup4 + langchain_community.document_loaders.BSHTMLLoader
|
Strips boilerplate, extracts content |
| DOCX |
python-docx via langchain_community.document_loaders.Docx2txtLoader
|
Handles formatting, tables |
| Confluence/Notion | Native API loaders in langchain_community
|
Preserves page structure, links |
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import MarkdownHeaderTextSplitter
# PDF ingestion with metadata preservation
loader = PyMuPDFLoader("compliance_handbook.pdf")
raw_docs = loader.load()
# Each document carries metadata: page number, source file, section headers
for doc in raw_docs:
doc.metadata["source_type"] = "pdf"
doc.metadata["ingestion_date"] = "2026-04-03"
doc.metadata["document_category"] = "compliance"
Metadata matters. In production, users need to know where an answer came from — page number, document title, section heading. Attach metadata during ingestion, not after.
Stage 2: Chunking Strategy
Chunking is the most underestimated stage. Poor chunking destroys retrieval quality regardless of how good your embedding model or LLM is.
Recursive Character Splitting with Semantic Awareness
LangChain 0.3's RecursiveCharacterTextSplitter remains the workhorse, but the parameters require tuning:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1200, # Characters — not tokens
chunk_overlap=200, # 15-20% overlap prevents information loss at boundaries
separators=[
"\n## ", # H2 headers (highest priority split point)
"\n### ", # H3 headers
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence boundaries
" ", # Word boundaries (last resort)
],
length_function=len,
is_separator_regex=False,
)
chunks = splitter.split_documents(raw_docs)
Why 1,200 characters? Anthropic's voyage-3 embedding model (our recommended choice) handles up to 32K tokens, but retrieval precision peaks with chunks of 200-400 tokens (~800-1,600 characters). Chunks that are too small lose context; chunks that are too large dilute the relevant signal.
Parent-Child Chunking for Long Documents
For documents exceeding 100 pages, use a two-level chunking strategy:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever
# Parent chunks: large context windows for generation
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=400)
# Child chunks: small, precise for retrieval matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
store = InMemoryStore() # Use Redis or PostgreSQL in production
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
The child chunk matches the query with high precision. The parent chunk provides the LLM with surrounding context needed for a complete answer.
Stage 3: Embedding and Indexing
Embedding Model Selection (April 2026)
| Model | Dimensions | Max Tokens | Cost (per 1M tokens) | MTEB Score |
|---|---|---|---|---|
| Voyage-3 (Anthropic) | 1024 | 32K | $0.06 (input) | 67.3 |
| text-embedding-3-large (OpenAI) | 3072 | 8K | $0.13 | 64.6 |
| Cohere embed-v4 | 1024 | 512 | $0.10 | 66.1 |
| BGE-M3 (open source) | 1024 | 8K | Self-hosted | 65.8 |
For Anthropic Claude pipelines, voyage-3 is the recommended pairing. It is trained on similar data distributions and provides the best retrieval performance when combined with Claude for generation.
pgvector on PostgreSQL 16
pgvector (v0.7+) running on PostgreSQL 16 is the production-grade vector store choice for teams already running PostgreSQL. No additional infrastructure to manage.
-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create embeddings table with HNSW index
CREATE TABLE document_embeddings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
embedding vector(1024) NOT NULL, -- voyage-3 dimensions
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index for approximate nearest neighbor search
CREATE INDEX ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
-- GIN index on metadata for filtered retrieval
CREATE INDEX ON document_embeddings
USING gin (metadata jsonb_path_ops);
LangChain integration:
from langchain_postgres import PGVector
from langchain_voyageai import VoyageAIEmbeddings
embeddings = VoyageAIEmbeddings(
model="voyage-3",
voyage_api_key=os.environ["VOYAGE_API_KEY"],
)
vectorstore = PGVector(
embeddings=embeddings,
collection_name="compliance_docs",
connection=os.environ["DATABASE_URL"], # postgresql://...
use_jsonb=True,
)
# Index documents
vectorstore.add_documents(chunks)
Hybrid Search: Vector + Full-Text
Pure vector similarity misses exact keyword matches. Hybrid search combines semantic similarity with BM25 keyword matching:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=10)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7], # 70% semantic, 30% keyword
)
Stage 4: Retrieval and Reranking
Cohere Rerank v3
Initial retrieval (top-20 from hybrid search) casts a wide net. Reranking with Cohere Rerank v3 narrows to the most relevant chunks:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
reranker = CohereRerank(
model="rerank-english-v3.0",
cohere_api_key=os.environ["COHERE_API_KEY"],
top_n=5,
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=hybrid_retriever,
)
# Retrieve + rerank
results = compression_retriever.invoke("What are the data retention requirements for HIPAA?")
Reranking consistently improves retrieval precision by 15-25% compared to raw vector similarity, because cross-encoder models evaluate query-document relevance jointly rather than comparing independent embeddings.
[IMAGE: Side-by-side comparison showing retrieval results before and after Cohere reranking, with relevance scores and highlighting showing how reranking promotes the most contextually relevant chunks to the top positions]
Stage 5: Generation with Claude
Prompt Architecture
The generation prompt must instruct Claude to use only the retrieved context, cite sources, and acknowledge when context is insufficient:
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
temperature=0,
max_tokens=4096,
anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
)
prompt = ChatPromptTemplate.from_messages([
("system", """You are a compliance knowledge assistant for an enterprise organization.
Answer questions using ONLY the provided context documents. For each claim in your answer,
cite the source using [Source: document_name, page X] format.
If the context does not contain sufficient information to answer the question,
say "I don't have enough information in the available documents to answer this question"
and suggest what additional documents might help.
Do not use prior knowledge. Do not speculate. Be precise and specific."""),
("human", """Context documents:
{context}
Question: {question}
Answer with citations:"""),
])
chain = prompt | llm
Streaming with Citation Tracking
For production applications, stream responses and track which chunks were used:
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{
"context": compression_retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
)
# Stream response
async for chunk in rag_chain.astream("What are HIPAA data retention requirements?"):
print(chunk.content, end="", flush=True)
Evaluation: Measuring RAG Quality
Without evaluation, you are guessing. Three metrics matter:
RAGAS Framework
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Prepare evaluation dataset
eval_results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=evaluation_llm,
embeddings=embeddings,
)
print(eval_results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.85, 'context_recall': 0.79}
| Metric | Target | What It Measures |
|---|---|---|
| Faithfulness | >0.90 | Does the answer only use information from retrieved context? |
| Answer Relevancy | >0.85 | Is the answer actually relevant to the question? |
| Context Precision | >0.80 | Are the retrieved chunks relevant to the question? |
| Context Recall | >0.75 | Did retrieval find all the relevant information? |
Run evaluations on a curated set of 50-100 question-answer pairs covering your domain. Automate with CI/CD — every change to chunking strategy, embedding model, or prompt template triggers a RAGAS evaluation.
Production Deployment Checklist
| Concern | Solution |
|---|---|
| Latency | Cache frequent queries with Redis; use streaming for perceived speed |
| Cost | Batch embedding generation; use Claude Haiku for simple queries, Sonnet/Opus for complex |
| Observability | LangSmith for trace logging; track retrieval scores, latency, token usage per request |
| Security | Sanitize inputs; implement document-level access control in metadata filters |
| Freshness | Incremental ingestion pipeline triggered by document changes (S3 events, webhook) |
| Scaling | Horizontal scaling of retrieval service; pgvector handles 10M+ vectors with HNSW |
Where to Go from Here
Production RAG is iterative. Start with the architecture above, measure with RAGAS, and improve. Common iteration paths: fine-tuning embedding models on domain data, experimenting with chunk sizes, adding query expansion, and implementing multi-step retrieval for complex questions.
Citadel Cloud Management's AI & ML Resources collection includes production RAG templates, LangChain project scaffolds, and evaluation datasets. Our courses cover LLM engineering from fundamentals through production deployment, including dedicated modules on RAG architecture, prompt engineering, and MLOps. Check the pricing page for access tiers — the free tier includes all 17 courses.
Ready to build production AI systems? Enroll free at Citadel Cloud Management and start building with LangChain and Claude today.
RAG #LangChain #Claude #AI #LLM #VectorDatabase #pgvector #MachineLearning #AIEngineering #ProductionAI
Continue Learning
Start Your Cloud Career Today
Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.
Get Free Cloud Career Resources