Data pipeline architecture diagram representing RAG pipeline construction
    Technical Guide

    How to Build a RAG Pipeline
    The LLM Engineer's Practical Guide

    PS

    Priya Sharma

    Technical Roles Editor

    May 2, 2026
    10 min read

    Most RAG tutorials show you how to build a demo. This guide covers how to build a system that works in production — with reliable retrieval, acceptable latency, low hallucination rates, and a way to measure whether it's actually working.

    What Is RAG and Why Does It Matter?

    Retrieval-Augmented Generation solves a fundamental limitation of large language models: their knowledge is frozen at training time and doesn't include your proprietary data. RAG addresses this by retrieving relevant documents from an external knowledge base at inference time and providing them as context to the model.

    The result is a system that can answer questions about documents it was never trained on, cite sources for its answers, and stay up to date without model retraining. This makes RAG the dominant architecture for enterprise LLM applications: document Q&A systems, internal knowledge bases, customer support automation, and any system that needs to ground model outputs in specific facts.

    The Full Pipeline Architecture

    A production RAG system has two distinct phases: indexing (run once, or incrementally) and retrieval + generation (run on every query).

    Indexing phase:

    • Document ingestion: Load documents from their source (PDFs, databases, APIs, web pages). Extract text, handling format-specific challenges.
    • Chunking: Split documents into chunks appropriate for your embedding model and context window. This is more important than most tutorials suggest — see below.
    • Embedding: Convert each chunk to a dense vector using an embedding model.
    • Storage: Store vectors and metadata in a vector database with appropriate indices.

    Retrieval + generation phase:

    • Query embedding: Embed the user query using the same model used for document embedding.
    • Retrieval: Find the most semantically similar chunks using vector similarity search (typically cosine similarity or dot product).
    • Re-ranking: Optionally apply a cross-encoder model to re-rank retrieved results for better precision.
    • Context assembly: Combine retrieved chunks into a coherent context for the model.
    • Generation: Pass the query and context to the LLM and generate a response.

    Chunking Strategy: The Most Underrated Decision

    Poor chunking is the most common cause of RAG systems that don't work well. The goal is chunks that are semantically coherent (a single chunk should cover one idea or topic), appropriately sized for your embedding model, and have enough overlap with adjacent chunks that you don't lose context at boundaries.

    Fixed-size chunking (e.g., 512 tokens with 50 tokens overlap) is simple to implement and a reasonable default, but splits text mid-sentence and mid-concept frequently. For well-structured documents, semantic chunking — splitting at paragraph or section boundaries — produces much better retrieval quality. For documents with consistent structure (legal contracts, technical documentation), structure-aware chunking that preserves headers and section hierarchy is often best.

    A common mistake is using chunks that are too small (poor semantic density, high noise) or too large (retrieval returns too much irrelevant context, wastes context window). For most use cases, 256–512 tokens per chunk with 10–15% overlap is a reasonable starting point.

    Choosing a Vector Store

    Pinecone is fully managed and easy to scale, with good performance on approximate nearest neighbour search. Good default choice for teams that want to minimise infrastructure work. The cost can be meaningful at scale.

    Weaviate supports both vector and keyword search (BM25) natively, making hybrid search straightforward. Good if your use case benefits from combining semantic and keyword matching.

    Qdrant is open-source, self-hostable, and has excellent performance. Good for teams with data residency requirements or cost constraints.

    pgvector is a PostgreSQL extension that turns your existing database into a vector store. Excellent for teams already running PostgreSQL who don't want another service. Performance is limited at very large scale but sufficient for many use cases.

    Common RAG failure modes

    • Poor retrieval recall: Relevant chunks aren't retrieved. Fix: improve chunking, add hybrid search, tune top-k.
    • Context pollution: Retrieved chunks contain lots of irrelevant content. Fix: stricter similarity threshold, re-ranking.
    • Context stuffing: Too many chunks overwhelm the model and dilute the relevant signal. Fix: reduce top-k, apply re-ranking more aggressively.
    • Stale index: Documents updated but index not refreshed. Fix: implement incremental indexing with change detection.
    • Embedding mismatch: Query and document embedded with different models. Fix: always use the same model for both.

    Re-ranking for Better Precision

    A bi-encoder embedding model (used for initial retrieval) is fast but approximate — it encodes query and document independently. A cross-encoder re-ranker processes query and document together, producing better relevance scores at higher compute cost. The typical approach is to retrieve a larger candidate set (top-20 or top-50) with the fast bi-encoder, then re-rank with a cross-encoder and keep only the top 3–5 for the model context.

    Cross-encoder models from Hugging Face (the ms-marco family) work well for English text re-ranking. For production deployments, Cohere's Rerank API provides a managed re-ranking option.

    Evaluating RAG Quality

    The RAGAS framework provides automated evaluation metrics for RAG systems:

    • Context precision: Are the retrieved chunks actually relevant to the question?
    • Context recall: Did retrieval find all the relevant information available?
    • Answer faithfulness: Does the generated answer stick to the retrieved context, or hallucinate?
    • Answer relevance: Does the answer actually address what was asked?

    Build a golden dataset of 50–100 representative question-answer pairs and run RAGAS evaluations regularly as you make changes. This is the only reliable way to know if a change to your chunking strategy or retrieval parameters is actually an improvement.

    See the full LLM Engineer career guide

    Salary benchmarks, required skills including RAG, UK hiring companies, and how to demonstrate RAG knowledge in interviews.

    Frequently Asked Questions

    Do I need to build RAG from scratch?

    For prototyping, use LangChain or LlamaIndex to reduce boilerplate. For production, many teams progressively replace framework components with custom implementations as requirements become clearer.

    When is RAG better than fine-tuning?

    RAG is better when your knowledge base changes frequently, you need source attribution, or you're working with confidential documents. Fine-tuning is better when you need to change model behaviour or style, or when latency is critical.

    How do I handle PDFs and structured data?

    For PDFs: use PyMuPDF or pdfplumber for text extraction, with special handling for tables and multi-column layouts. For structured data: consider generating natural language descriptions of records rather than naive text chunking.

    What's the cheapest RAG stack?

    Open-source embedding model (sentence-transformers) + pgvector (free PostgreSQL extension) + a cheaper LLM (GPT-3.5-turbo or Ollama). Costs almost nothing beyond compute.

    How do I know if my RAG is working?

    Use RAGAS metrics (context precision, recall, answer faithfulness, answer relevance) on a golden dataset of representative Q&A pairs. Run evaluations regularly as you make changes.

    Get career tips delivered to your inbox

    Get weekly insights on tech careers, salaries, and industry trends.

    We'll send you relevant job alerts and career content. Unsubscribe anytime. See our Privacy Policy.

    About the Author

    PS

    Priya Sharma

    Technical Roles Editor @ ObiTech

    Priya covers ML engineering, LLM deployment, and technical career paths in UK AI.

    LLM Engineer Role Guide

    Full salary tables, skills breakdown, and UK hiring guide.