AI Agents April 10, 2025 11 min read

Building a Production RAG System with LangChain and Pinecone

Architecture and code for a production RAG system — chunking strategies, embedding models, hybrid search, reranking, and hallucination mitigation.

Architecture

Documents → Chunker → Embedder → Pinecone
                                    ↓
User Query → Embedder → Pinecone Search → Reranker → LLM → Answer

Chunking Strategy

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=['\n\n', '\n', '. ', ' '],
)

Hybrid Search (BM25 + Dense)

BM25 for keyword match + cosine similarity for semantic — combine with RRF:

rrf_score = sum(1/(k + rank_i) for rank_i in [bm25_rank, dense_rank])

Hallucination Mitigation

Cite source chunks in the answer
Set temperature=0 for factual queries
Check if answer tokens overlap with context (RAGAS faithfulness metric)

RAGLangChainPineconeLLMProduction

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco

About me →Contact →