Retrieval-Augmented Generation
The 2026 Skills Guide
RAG has become the default architecture for building AI applications that need to work with private or up-to-date information. This guide covers the full stack — from chunking and embeddings to advanced retrieval patterns, reranking, and production evaluation.
Why RAG Became the Default Architecture
Two fundamental limitations of LLMs make RAG necessary for most enterprise use cases. First, knowledge cutoffs: every LLM's training data has a cutoff date, meaning recent events, updated documentation, or new product information are outside its knowledge. Second, hallucination: LLMs generate the most statistically likely next token, not necessarily the most factually accurate one. For tasks requiring reliable facts, this is unacceptable.
RAG addresses both by retrieving relevant information at inference time and providing it directly in the prompt as ground-truth context. The LLM's role shifts from knowledge retrieval to reasoning and synthesis over provided information — a task LLMs do well.
When RAG is the right choice:
- Answering questions over private document collections (internal wikis, contracts, technical docs)
- Customer support over product documentation that changes frequently
- Research assistance requiring citation and source attribution
- Any task where factual accuracy is non-negotiable and data changes over time
When RAG is not the right choice: General reasoning, creative generation, code generation (where the model's parametric knowledge is sufficient), or tasks where document retrieval latency is unacceptable. For these, prompting or fine-tuning is more appropriate.
The RAG Pipeline: Step by Step
Load raw documents from PDFs (PyMuPDF), DOCX, web pages, databases, or APIs. Extract clean text; discard headers, footers, and noise.
Split documents into chunks suitable for embedding. Fixed-size with overlap (RecursiveCharacterTextSplitter), or semantic chunking at natural boundary points.
Encode each chunk as a dense vector using an embedding model. Store (chunk text, vector, metadata) in a vector database.
At query time, embed the user's question using the same embedding model used at indexing time.
Perform approximate nearest-neighbour search (HNSW) in the vector database. Optionally apply metadata filters. Return top-k chunks.
Pass the query and retrieved chunks through a cross-encoder reranker. Cross-encoders attend to both query and document jointly and are more accurate than bi-encoders for relevance scoring.
Assemble the prompt: system instruction + retrieved context + user question. Pass to the LLM. Stream the response to the user.
Choosing an Embedding Model
The embedding model is one of the highest-impact choices in a RAG system. It determines retrieval quality more than almost any other component. Key selection criteria: retrieval performance on benchmarks (MTEB), embedding dimension vs. storage cost, max sequence length, and whether you need multilingual support.
- text-embedding-3-small / text-embedding-3-large (OpenAI) — Strong all-round performance. text-embedding-3-small offers the best cost/performance ratio for most production deployments. text-embedding-3-large sets a performance ceiling at higher cost. Available via API only — data leaves your infrastructure.
- bge-m3 (BAAI) — State-of-the-art open-source embedding model supporting dense, sparse (colbert-style late interaction), and multi-vector retrieval in a single model. Multilingual (100+ languages). Excellent MTEB scores. Good choice for on-premises deployments where data residency matters.
- e5-mistral-7b-instruct (Microsoft) — Instruction-tuned embedding model derived from a 7B LLM. Excellent performance on asymmetric retrieval (query and document have different styles). High compute cost — typically deployed for offline batch indexing rather than online query embedding.
- all-MiniLM-L6-v2 (Sentence Transformers) — Lightweight 22M parameter model. Fast inference; suitable for latency-critical or resource-constrained deployments. Lower quality ceiling than larger models but often sufficient for clearly structured document collections.
Critical rule: always use the same embedding model for indexing and query encoding. Switching models requires re-embedding and re-indexing your entire corpus.
Advanced RAG Patterns
HyDE (Hypothetical Document Embeddings): Rather than embedding the raw query (which is often short and stylistically different from indexed documents), prompt an LLM to generate a hypothetical answer to the question, then embed that hypothetical answer for retrieval. The hypothesis embedding is more similar to actual relevant documents in the embedding space. Effective for questions phrased very differently from the document corpus style.
Parent Document Retrieval: Index small chunks for precise retrieval, but return the larger parent document or section when building the context for the LLM. Small chunks have focused, distinctive embeddings that retrieve more precisely; large context windows give the LLM sufficient surrounding information to answer well. Implement by storing (small_chunk_id → parent_id) mappings.
Multi-hop RAG: For complex questions requiring reasoning across multiple documents, decompose the query into sub-questions, retrieve for each, then synthesise. Example: "Which UK universities produced more AI PhDs than Imperial College in 2023?" requires retrieving data for multiple institutions before comparison. LangGraph and LlamaIndex both provide multi-hop retrieval primitives.
Contextual Compression: After retrieval, use an LLM to compress each retrieved chunk to only the sentences directly relevant to the query before including them in the final prompt. Reduces context window usage and improves answer quality by removing irrelevant noise from long retrieved passages.
Reranking: Pass the query and each retrieved candidate through a cross-encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers, or the Cohere Rerank API). Cross-encoders jointly attend to query and document and are significantly more accurate than bi-encoder similarity at relevance scoring. Retrieve 20–50 candidates, rerank, and take the top 3–5. Consistently improves RAG quality at modest latency cost.
RAG Frameworks
- LangChain — The most widely used RAG framework. Rich ecosystem of document loaders, text splitters, vector store integrations, and retriever abstractions. LCEL (LangChain Expression Language) for composing chains declaratively. LangGraph for multi-step agentic RAG.
- LlamaIndex — More index-centric than LangChain; particularly strong for structured data retrieval and multi-modal RAG. Excellent abstractions for parent document retrieval, knowledge graphs, and query routing across multiple indices.
- Haystack (deepset) — Pipeline-based framework with strong enterprise features. Popular in UK fintech and legal tech. Well-suited for hybrid search pipelines with Elasticsearch.
- RAGAS — The standard evaluation framework for RAG pipelines. Measures faithfulness, answer relevancy, context precision, and context recall using LLM-as-judge.
Learning Path for RAG Skills
Foundations (0–4 weeks)
- Understand embedding models: cosine similarity, dot product, L2 distance
- Build a naive RAG pipeline: load PDF → chunk → embed → FAISS index → query
- Use LangChain or LlamaIndex to compose the pipeline
- Evaluate with RAGAS on a simple QA dataset
Core Skills (1–3 months)
- Vector databases: deploy Qdrant or Weaviate, understand HNSW indexing
- Hybrid search: combine BM25 and dense retrieval with Reciprocal Rank Fusion
- Implement a cross-encoder reranker with Sentence Transformers
- Parent document retrieval and contextual compression
- Metadata filtering: filter by date, author, document type at retrieval time
Advanced (3–6 months)
- HyDE: hypothetical document embeddings for query expansion
- Multi-hop RAG with LangGraph for complex question decomposition
- Self-RAG: model-driven retrieval decisions and quality assessment
- Production considerations: latency profiling, caching embeddings, async retrieval
- Security: prompt injection via retrieved content, content filtering on retrieved documents
Frequently Asked Questions
What is RAG and why does it matter for AI engineering?
RAG retrieves relevant document chunks from an external knowledge base at inference time and provides them as context in the LLM prompt. It addresses knowledge cutoffs and hallucination — LLMs only know what was in their training data, and they can generate plausible-sounding false information. RAG is the standard architecture for enterprise knowledge bases, documentation assistants, and any use case requiring current or private information.
What is the difference between naive RAG and advanced RAG?
Naive RAG uses fixed chunking and a single embedding-based search. Advanced RAG adds: query rewriting, HyDE (hypothetical document embeddings), parent document retrieval, multi-hop retrieval for complex questions, reranking, hybrid search (BM25 + vectors), and self-RAG (model decides when retrieval helps).
What chunk size should you use?
256–512 tokens for precise factual content; 512–1024 tokens for narrative text. 10–20% overlap preserves cross-boundary context. Semantic chunking (splitting at embedding similarity boundaries) often outperforms fixed-size. Test empirically with RAGAS — chunk size is one of the highest-leverage hyperparameters.
What is hybrid search and why use it?
Hybrid search combines sparse (BM25) and dense (embedding) retrieval. BM25 handles keyword queries and precise terms; dense retrieval handles semantic similarity and synonyms. Merge result lists with Reciprocal Rank Fusion (RRF). Production RAG systems use hybrid search because real queries span both types — keyword-sensitive (product codes, names) and semantically rich.
How do you evaluate a RAG pipeline?
RAGAS measures: Faithfulness (no hallucinated claims), Answer Relevancy, Context Precision, and Context Recall. Use an LLM judge for scalable scoring. Also run end-to-end QA on a labelled benchmark, profile latency (retrieval + generation), and monitor embedding API costs.