Do I need to build RAG from scratch or can I use a framework?

For learning and prototyping, frameworks like LangChain and LlamaIndex reduce boilerplate significantly. For production systems, many teams start with a framework and progressively replace components with custom implementations as they understand their specific requirements better. Building from scratch gives you more control but takes considerably more time to get right.

How do I handle PDFs and structured data in RAG?

PDFs require text extraction — tools like PyMuPDF or pdfplumber handle most cases, but complex layouts (multi-column, tables, figures) require more careful handling. For structured data (databases, spreadsheets), consider generating natural language descriptions of rows or using SQL generation approaches rather than naive text chunking. Tables embedded in PDFs are particularly tricky and often need specialised extraction.

How to Build a RAG Pipeline: The LLM Engineer's Practical Guide

Q: When is RAG better than fine-tuning?

RAG is better when: your knowledge base changes frequently (you'd need to retrain constantly); you need source attribution for answers; you want the model to work with proprietary or confidential documents; or you need to update the knowledge base without model retraining. Fine-tuning is better when: you want to change the model's behaviour or style; you need to instil domain-specific reasoning patterns; or latency is critical and you can't afford retrieval overhead.

Q: What's the cheapest RAG stack?

For cost-conscious deployments: use an open-source embedding model (sentence-transformers/all-MiniLM-L6-v2 is a good starting point, free to run); use pgvector as your vector store (free PostgreSQL extension, no separate service needed); use a cheaper LLM (GPT-3.5-turbo or an open-source model via Ollama) for generation. This stack costs almost nothing beyond compute.

Q: How do I know if my RAG is working?

Use the RAGAS framework to measure: context precision (is the retrieved context relevant?), context recall (did you retrieve all the relevant context?), answer faithfulness (does the answer stick to the retrieved context?), and answer relevance (does the answer address the question?). Set up a golden dataset of question-answer pairs and run these metrics regularly as you make changes.

Most RAG tutorials show you how to build a demo. This guide covers how to build a system that works in production — with reliable retrieval, acceptable latency, low hallucination rates, and a way to measure whether it's actually working.

What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation solves a fundamental limitation of large language models: their knowledge is frozen at training time and doesn't include your proprietary data. RAG addresses this by retrieving relevant documents from an external knowledge base at inference time and providing them as context to the model.

The result is a system that can answer questions about documents it was never trained on, cite sources for its answers, and stay up to date without model retraining. This makes RAG the dominant architecture for enterprise LLM applications: document Q&A systems, internal knowledge bases, customer support automation, and any system that needs to ground model outputs in specific facts.

The Full Pipeline Architecture

A production RAG system has two distinct phases: indexing (run once, or incrementally) and retrieval + generation (run on every query).

Indexing phase:

Document ingestion: Load documents from their source (PDFs, databases, APIs, web pages). Extract text, handling format-specific challenges.
Chunking: Split documents into chunks appropriate for your embedding model and context window. This is more important than most tutorials suggest — see below.
Embedding: Convert each chunk to a dense vector using an embedding model.
Storage: Store vectors and metadata in a vector database with appropriate indices.

Retrieval + generation phase:

Query embedding: Embed the user query using the same model used for document embedding.
Retrieval: Find the most semantically similar chunks using vector similarity search (typically cosine similarity or dot product).
Re-ranking: Optionally apply a cross-encoder model to re-rank retrieved results for better precision.
Context assembly: Combine retrieved chunks into a coherent context for the model.
Generation: Pass the query and context to the LLM and generate a response.

Chunking Strategy: The Most Underrated Decision

Poor chunking is the most common cause of RAG systems that don't work well. The goal is chunks that are semantically coherent (a single chunk should cover one idea or topic), appropriately sized for your embedding model, and have enough overlap with adjacent chunks that you don't lose context at boundaries.

Fixed-size chunking (e.g., 512 tokens with 50 tokens overlap) is simple to implement and a reasonable default, but splits text mid-sentence and mid-concept frequently. For well-structured documents, semantic chunking — splitting at paragraph or section boundaries — produces much better retrieval quality. For documents with consistent structure (legal contracts, technical documentation), structure-aware chunking that preserves headers and section hierarchy is often best.

A common mistake is using chunks that are too small (poor semantic density, high noise) or too large (retrieval returns too much irrelevant context, wastes context window). For most use cases, 256–512 tokens per chunk with 10–15% overlap is a reasonable starting point.

Choosing a Vector Store

Pinecone is fully managed and easy to scale, with good performance on approximate nearest neighbour search. Good default choice for teams that want to minimise infrastructure work. The cost can be meaningful at scale.

Weaviate supports both vector and keyword search (BM25) natively, making hybrid search straightforward. Good if your use case benefits from combining semantic and keyword matching.

Qdrant is open-source, self-hostable, and has excellent performance. Good for teams with data residency requirements or cost constraints.

pgvector is a PostgreSQL extension that turns your existing database into a vector store. Excellent for teams already running PostgreSQL who don't want another service. Performance is limited at very large scale but sufficient for many use cases.

Common RAG failure modes

• Poor retrieval recall: Relevant chunks aren't retrieved. Fix: improve chunking, add hybrid search, tune top-k.
• Context pollution: Retrieved chunks contain lots of irrelevant content. Fix: stricter similarity threshold, re-ranking.
• Context stuffing: Too many chunks overwhelm the model and dilute the relevant signal. Fix: reduce top-k, apply re-ranking more aggressively.
• Stale index: Documents updated but index not refreshed. Fix: implement incremental indexing with change detection.
• Embedding mismatch: Query and document embedded with different models. Fix: always use the same model for both.

Re-ranking for Better Precision

A bi-encoder embedding model (used for initial retrieval) is fast but approximate — it encodes query and document independently. A cross-encoder re-ranker processes query and document together, producing better relevance scores at higher compute cost. The typical approach is to retrieve a larger candidate set (top-20 or top-50) with the fast bi-encoder, then re-rank with a cross-encoder and keep only the top 3–5 for the model context.

Cross-encoder models from Hugging Face (the ms-marco family) work well for English text re-ranking. For production deployments, Cohere's Rerank API provides a managed re-ranking option.

Evaluating RAG Quality

The RAGAS framework provides automated evaluation metrics for RAG systems:

Context precision: Are the retrieved chunks actually relevant to the question?
Context recall: Did retrieval find all the relevant information available?
Answer faithfulness: Does the generated answer stick to the retrieved context, or hallucinate?
Answer relevance: Does the answer actually address what was asked?

Build a golden dataset of 50–100 representative question-answer pairs and run RAGAS evaluations regularly as you make changes. This is the only reliable way to know if a change to your chunking strategy or retrieval parameters is actually an improvement.

See the full LLM Engineer career guide

Salary benchmarks, required skills including RAG, UK hiring companies, and how to demonstrate RAG knowledge in interviews.

Frequently Asked Questions

Do I need to build RAG from scratch?

For prototyping, use LangChain or LlamaIndex to reduce boilerplate. For production, many teams progressively replace framework components with custom implementations as requirements become clearer.

When is RAG better than fine-tuning?

RAG is better when your knowledge base changes frequently, you need source attribution, or you're working with confidential documents. Fine-tuning is better when you need to change model behaviour or style, or when latency is critical.

How do I handle PDFs and structured data?

For PDFs: use PyMuPDF or pdfplumber for text extraction, with special handling for tables and multi-column layouts. For structured data: consider generating natural language descriptions of records rather than naive text chunking.

What's the cheapest RAG stack?

Open-source embedding model (sentence-transformers) + pgvector (free PostgreSQL extension) + a cheaper LLM (GPT-3.5-turbo or Ollama). Costs almost nothing beyond compute.

How do I know if my RAG is working?

Use RAGAS metrics (context precision, recall, answer faithfulness, answer relevance) on a golden dataset of representative Q&A pairs. Run evaluations regularly as you make changes.

How to Build a RAG Pipeline
The LLM Engineer's Practical Guide

What Is RAG and Why Does It Matter?

The Full Pipeline Architecture

Chunking Strategy: The Most Underrated Decision

Choosing a Vector Store

Common RAG failure modes

Re-ranking for Better Precision

Evaluating RAG Quality

See the full LLM Engineer career guide

Frequently Asked Questions

Do I need to build RAG from scratch?

When is RAG better than fine-tuning?

How do I handle PDFs and structured data?

What's the cheapest RAG stack?

How do I know if my RAG is working?

Get career tips delivered to your inbox

About the Author

LLM Engineer Jobs

LLM Engineer

AI Platform Engineer

Applied AI Engineer

Related Reading

LLM Engineer Role Guide

How to Build a RAG PipelineThe LLM Engineer's Practical Guide

What Is RAG and Why Does It Matter?

The Full Pipeline Architecture

Chunking Strategy: The Most Underrated Decision

Choosing a Vector Store

Common RAG failure modes

Re-ranking for Better Precision

Evaluating RAG Quality

See the full LLM Engineer career guide

Frequently Asked Questions

Do I need to build RAG from scratch?

When is RAG better than fine-tuning?

How do I handle PDFs and structured data?

What's the cheapest RAG stack?

How do I know if my RAG is working?

Get career tips delivered to your inbox

About the Author

LLM Engineer Jobs

LLM Engineer

AI Platform Engineer

Applied AI Engineer

Related Reading

LLM Engineer Role Guide

How to Build a RAG Pipeline
The LLM Engineer's Practical Guide