Retrieval Augmented Generation, The Seminal Papers you own this product

Principles for architecting reliable and verifiable AI

Ben Auffarth

MEAP began March 2026
Last updated June 2026
Publication in Fall 2026 (estimated)

ISBN 9781633434431
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

resources: Source code Book forum Source code on Github

table of content

1 How RAG research prevents disasters

1.1 How this book teaches RAG

1.2 How do RAG systems work?

1.3 How RAG research addresses failure points

1.3.1 The knowledge boundary problem

1.3.2 The hallucination challenge

1.3.3 The private knowledge access problem

1.3.4 RAG is an enduring architectural pattern

1.3.5 Why RAG justifies infrastructure investment

1.4 Failure points within RAG systems

1.4.1 RAG failure examples

1.4.2 Research-backed solutions

1.5 Naive, advanced, modular, and agentic RAG

1.5.1 Naive RAG: Establishing the foundation

1.5.2 Advanced RAG: Optimizing the components

1.5.3 Modular RAG: Adaptive and composable systems

1.5.4 Agentic RAG: Autonomous information-seeking systems

1.5.5 RAG evolution

1.6 Summary

2 Revolutions in semantics, scale, and similarity

2.1 Word2Vec (2013): Semantic word embeddings

2.1.1 From search engines to semantic understanding

2.1.2 Word2Vec: From words to vectors

2.1.3 Implementing Word2Vec: From theory to practice

2.1.4 Business impact

2.1.5 The foundation for dense retrieval

2.1.6 Limitations and the path forward

2.2 FAISS (2017): Billion-scale similarity search

2.2.1 Approximate nearest neighbors

2.2.2 The scale advantage

2.2.3 The bridge to practical semantic search

2.3 Sentence-BERT (2019): Practical sentence similarity

2.3.1 From cross-encoders to bi-encoders

2.3.2 Implementation and business impact

2.3.3 Training, performance, and limitations

2.3.4 The final piece of the RAG puzzle

2.4 Ready for revolution

2.4.1 The modern retrieval stack

2.4.2 From retrieval to synthesis

2.5 Summary

3 REALM: Birth of end-to-end trainable RAG

3.1 Open-domain question answering

3.2 Retrieval-augmented language model pre-training

3.2.1 Test results

3.3 Practical implementation with Python

3.4 REALM’s enduring impact and legacy

3.5 Summary

4 Retrieval-augmented generation for knowledge tasks

4.1 The RAG model architecture

4.1.1 The retriever

4.1.2 The generator

4.1.3 Modular architecture

4.2 Marginalization over retrieved documents

4.2.1 How marginalization works in practice

4.2.2 The importance of marginalization

4.3 RAG-sequence vs. RAG-token variations

4.3.1 RAG-sequence

4.3.2 RAG-token

4.3.3 RAG-sequence vs. RAG-token

4.4 Fine-tuning for downstream tasks

4.5 Case study: Open-domain question answering

4.5.1 Round 1: Exact answers

4.5.2 Round 2: Efficiency

4.5.3 Round 3: Knowledge update

4.5.4 Ablation studies: Why each component matters

4.5.5 Where RAG struggled

4.5.6 The verdict

4.6 Building a basic RAG pipeline

4.7 Canonical RAG vs. naive RAG

4.7.1 The RAG ecosystem

4.7.2 Limitations of naive RAG

4.8 Summary

5 Fusion-in-Decoder for multi-document processing

5.1 The challenges of multiple document contexts

5.2 Independent encoding, fused decoding

5.3 Scaling to hundreds of documents

5.3.1 Scaling training

5.4 Implementing FiD with transformer models

5.5 Case study: Finding the FiD sweet spot

5.6 Applications and performance characteristics

5.6.1 FiD’s evolution

5.7 Summary

6 Atlas: Few-shot learning with retrieval augmentation

6.1 The parametric vs. non-parametric knowledge tradeoff

6.2 Atlas pre-training methodology

6.2.1 Joint training without labels

6.2.2 A deep dive into the training objectives

6.2.3 Code in practice

6.3 Dynamic knowledge retrieval

6.3.1 Head-to-head

6.3.2 The temporal sensitivity experiment

6.4 Efficient knowledge base updating

6.4.1 Three strategies for efficient updates

6.4.2 Compressing the index

6.5 Few-shot learning strategies for RAG systems

6.6 Case study: Enterprise knowledge adaptation

6.7 Summary

7 HyDE: Imagining the answer before you search

7.1 The query-document mismatch problem

7.2 Generating hypothetical documents

7.3 Zero-shot dense retrieval without relevance labels

7.4 Implementing HyDE for challenging queries

7.5 Case study: Medical knowledge retrieval

7.6 Variations and extensions

7.6.1 Multi-sample HyDE

7.6.2 Hybrid HyDE (Fusion)

7.6.3 HyDE with Re-ranking

7.6.4 Reverse HyDE

7.6.5 Learned query refinement: RQ-RAG

7.7 Summary

8 RAG-Fusion: Multi-query retrieval enhancement

8.1 The coverage problem

8.2 LLM-driven query generation

8.2.1 Generating queries with LangChain

8.3 The Reciprocal Rank Fusion algorithm

8.4 Implementing multi-query RAG systems

8.5 Case study: Enhancing e-commerce product search

8.6 Balancing diversity and precision

8.7 Summary

9 Self-RAG: Retrieval with reflection and self-critique

9.1 The challenges of controlling retrieval and generation

9.1.1 The limitations of passive retrieval architectures

9.1.2 The evolution toward active reasoning

9.2 Reflection tokens for retrieval, relevance, and critique

9.2.1 The concept of self-reflection

9.2.2 The reflection token taxonomy

9.3 Training for self-evaluation

9.3.1 The critic-driven training pipeline

9.3.2 Inference: Tree decoding and control

9.3.3 Adaptive retrieval thresholding

9.4 Implementing prompted Self-RAG

9.4.1 Defining the graph state

9.4.2 Emulating reflection tokens

9.4.3 The control logic

9.5 Real-world applications

9.6 Where Self-RAG fits, and what replaced it

9.7 Related research: Fact-checking on top of RAG

9.8 Summary

10 RAPTOR: Recursive abstractive processing for tree-organized retrieval

10.1 The semantic flattening problem

10.1.1 Lost in the middle

10.1.2 The limits of fixed-size chunking

10.2 From lists to trees

10.2.1 The recursive mechanism

10.2.2 Visualizing the tree construction

10.3 Clustering in high dimensions

10.3.1 The curse of dimensionality and UMAP

10.3.2 Gaussian Mixture Models and soft clustering

10.3.3 Picking the number of clusters with Bayesian Information Criterion (BIC)

10.4 Practical implementation

10.4.1 The RaptorNode architecture

10.4.2 The clustering engine

10.4.3 The summarization engine

10.4.4 The recursive tree builder

10.5 Indexing and retrieval strategies

10.5.1 Collapsed tree (flattened index)

10.5.2 Implementing tree traversal

10.6 Advanced summarization strategies

10.7 Evaluation, performance, and legacy

10.7.1 Benchmark performance

10.7.2 Operational challenges

10.7.3 Solutions and successors: related architectures

10.8 Summary

11 Correction, planning, and reasoning

11.1 The agentic turn

11.1.1 The enterprise compliance assistant

11.1.2 Anatomy of a Single-Pass failure

11.1.3 Stale knowledge and missing freshness checks

11.1.4 Retrieval collapse on long-tail entities

11.1.5 Multi-hop reasoning gaps

11.2 Corrective retrieval augmented generation

11.2.1 The retrieval evaluator

11.2.2 Decompose-then-recompose

11.2.3 Corrective actions and web search integration

11.2.4 The CRAG pipeline

11.2.5 Evaluation, shortcomings, and where CRAG fits

11.3 LevelRAG

11.3.1 The high-level searcher

11.3.2 The low-level searcher

11.3.3 LevelRAG implementation

11.4 Reactive vs proactive agency

11.5 Agentic RAG in practice

11.5.1 The agentic spectrum

11.5.2 Products and platforms

11.5.3 How CRAG and LevelRAG fit together in production

11.5.4 Refusal as a feature

11.6 Summary

12 Graph-based RAG

12.1 Graph-based RAG

12.1.1 From chunks and vectors to graphs

12.1.2 Three lineages of graph-based RAG

12.1.3 The core architecture of graph-based RAG

12.2 Microsoft GraphRAG

12.2.1 G-Indexing: Building the pyramid

12.2.2 G-Retrieval and G-Generation: Map-Reduce over the pyramid

12.2.3 Implementing GraphRAG

12.3 HippoRAG

12.3.1 From PageRank to Personalized PageRank

12.3.2 How HippoRag works

12.3.3 G-Generation: ranking by activation

12.3.4 Implementing PPR activation

12.4 LightRAG

12.4.1 G-Indexing: Graph-enhanced indexing

12.4.2 G-Retrieval: Dual-path routing

12.4.3 Implementing LightRAG

12.5 Evaluation and real-world application

12.6 Emerging optimizations

12.6.1 TagRAG

12.6.2 Clue-RAG

12.6.3 HybGRAG

12.6.4 AcademicRAG

12.6.5 Future

12.7 Summary

13 Context compression and pruning

13.1 The costs of long context

13.1.1 Evolution of context compression

13.1.2 Types of compression

13.2 Perplexity-based compression with LongLLMLingua

13.2.1 Why absolute perplexity fails

13.2.2 Contrastive perplexity as a relevance signal

13.2.3 The LongLLMLingua pipeline

13.2.4 Implementing a prompt compressor with the LLMLingua

13.3 Native pruning with AttentionRAG

13.3.1 Beyond all-layer aggregation: Evaluator heads

13.4 Extensions: The RECOMP framework

13.4.1 CORE and multi-hop synthesis

13.5 Implementing a query-aware compressor

13.5.1 Recursive retrieval and index nodes

13.6 Case study: Legal document synthesis

13.7 Tradeoffs: Compression ratio vs. semantic loss

13.8 Future directions and adjacent research

13.8.1 Gist tokens and trained prompt compression

13.8.2 KV-cache eviction

13.8.3 Multi-modal retrieval

13.8.4 Hardware-aware budgeting

13.8.5 Agentic self-compression

13.9 Summary

14 Evaluating RAG systems

14.1 Classical retrieval metrics and their limits

14.1.1 The foundation: precision and recall

14.1.2 The correlation gap: When good retrieval metrics produce bad answers

14.2 The RAGAS evaluation framework

14.2.1 Faithfulness: Does the answer stick to the evidence?

14.2.2 Answer relevance: Does the answer address the question?

14.2.3 Context relevance: Did retrieval fetch what was needed?

14.2.4 Putting the triad to work

14.2.5 Limitations and caveats

14.2.6 Beyond the triad: When ground-truth still earns its keep

14.3 LLM-as-judge: From G-Eval to trained evaluators

14.3.1 G-Eval: Chain-of-thought meets evaluation

14.3.2 ARES: Trained judges with statistical guarantees

14.3.3 The bias problem: What LLM judges get wrong

14.4 Building your evaluation dataset

14.4.1 Public benchmarks: Useful but skewed

14.4.2 Generating balanced evaluation data from your corpus

14.4.3 Practical tools for synthetic evaluation

14.5 Implementing an evaluation pipeline

14.5.1 A faithfulness judge

14.5.2 Mapping scores to failure points

14.5.3 Practical considerations at scale

14.6 Case study: Continuous evaluation for a compliance assistant

14.6.1 Establishing a baseline

14.6.2 The reranking fix and its side effect

14.6.3 Targeted improvement and monitoring

14.6.4 The evaluation flywheel

14.7 Summary

15 Production RAG: Metrics, agentic systems, and continuous improvement

15.1 The evolving RAG landscape

15.1.1 More retrieval is not always better

15.1.2 Measuring stability and scalability

15.1.3 The reader is the bottleneck

15.2 Agentic RAG: Autonomous retrieval systems

15.2.1 Four design patterns

15.2.2 From patterns to pipelines

15.2.3 Practical implementation

15.2.4 When agentic RAG helps and when it does not

15.3 Fine-tuning in the RAG lifecycle

15.3.1 RankRAG: Unifying ranking and generation

15.3.2 Results and implications

15.3.3 Where fine-tuning fits in the RAG lifecycle

15.4 Self-improving RAG: The feedback loop

15.4.1 The gatekeeper: Learning when to trust retrieval

15.4.2 The critic: Learning which memories to trust

15.4.3 The student: Learning to teach itself to use retrieval

15.4.4 Building the feedback loop

15.5 Practical frameworks for RAG technique selection

15.5.1 Complexity versus performance

15.5.2 Mapping failure modes to RAG approaches

15.5.3 Decision tree for new projects

15.6 Best practices for production deployment

15.6.1 Instrument everything

15.6.2 Version your pipeline as a unit

15.6.3 RAG framework comparison

15.6.4 The testing pyramid for RAG

15.7 Case study: Agentic customer support in fintech

15.7.1 The problem

15.7.2 The agentic design

15.7.3 Results

15.7.4 What worked and what did not

15.7.5 Connecting to the failure points

15.7.6 Lessons for production teams

15.8 Current challenges and open research questions

15.8.1 Evaluation remains fragmented

15.8.2 Robustness and security

15.8.3 The efficiency-quality tradeoff

15.8.4 Multi-modal and structured retrieval

15.8.5 Open questions

15.9 Summary

Overview

1 How RAG research prevents disasters

Retrieval-Augmented Generation (RAG) connects large language models to authoritative, current knowledge sources so responses are grounded rather than guessed. The chapter opens with a cautionary lesson: a customer-service chatbot confidently contradicted the airline’s own policy, illustrating a factually inconsistent hallucination that led to legal and financial consequences. Beyond one incident, the authors argue that organizations repeatedly face three enduring constraints—keeping knowledge current, preventing hallucinations during synthesis, and accessing private, fast-changing internal data—making RAG not a fad but a durable architectural pattern. Research literacy becomes a strategic edge: teams that understand how RAG really works can anticipate limitations, engineer safeguards, and turn reliability into a competitive advantage.

To move from trial-and-error to engineering discipline, the chapter introduces a seven-point taxonomy of RAG failures—missing content, missed top rank, factually inconsistent hallucination, not in context, not extracted, incorrect specificity, and incomplete answers—and shows how these issues quietly erode trust even when systems are “usually” right. It maps failure modes to concrete remedies across the indexing and query pipelines: better curation and chunking, query expansion and hypothetical document generation, hybrid retrieval and re-ranking, result fusion, context compression and organization, and generation-time grounding and verification. Research-backed methods such as Self-RAG (confidence and self-critique), FLARE (active, need-aware retrieval), HyDE (bridging query–document vocabulary gaps), and result-fusion strategies help detect uncertainty, trigger additional retrieval, and align synthesis with sources. The chapter also weighs costs and infrastructure trade-offs, noting that while longer contexts, fine-tuning, and caching strengthen RAG, they do not replace retrieval-first grounding.

Finally, the authors chart an architectural progression that guides investment: Naive RAG for rapid, low-cost foundations; Advanced RAG to optimize preprocessing, retrieval, and context for production; Modular RAG to compose specialized components across diverse data and tasks; and Agentic RAG for autonomous, iterative information seeking with self-monitoring and multi-step reasoning. Each stage has clear upgrade triggers tied to accuracy demands, domain complexity, risk, and scale, along with operational considerations like orchestration, evaluation, and cost controls. The book’s teaching approach emphasizes research literacy, prevention of known failure modes, progressive complexity, hands-on implementation, and systematic evaluation—so practitioners can predict, measure, and improve reliability before failures undermine user trust.

The RAG workflow — from user queries to grounded responses through retrieval and generation.

RAG architectural evolution from Naive to Agentic implementations. Each paradigm builds upon its predecessors while adding specialized components and capabilities to address increasingly complex requirements.

RAG Implementation Decision Tree — Business Decisions guiding RAG choices.

Summary

Retrieval-Augmented Generation solves three critical limitations that make standalone language models unreliable for production applications: knowledge boundaries that prevent access to current information, hallucinations that generate unverifiable claims, and the inability to incorporate private organizational knowledge that drives business decisions.
The seven-point failure taxonomy provides a systematic diagnosis for RAG system problems rather than guessing at solutions. Failure points, such as "Missed the Top Rank" and "Factually Inconsistent Hallucination," enable the precise identification of issues and the selection of research-backed solutions that address specific failure modes.
RAG systems evolve through four architectural stages based on complexity requirements and business needs. Naive RAG establishes basic retrieve-then-generate functionality for proof-of-concept applications. Advanced RAG optimizes retrieval quality and context processing for production performance. Modular RAG implements adaptive strategies and quality control for mission-critical applications. Agentic RAG introduces autonomous planning and self-correction to the retrieval and generation loop.
The core RAG architecture coordinates two specialized components: retrieval systems that find relevant information from external knowledge sources, and generation systems that synthesize retrieved context with user queries to produce grounded, factual responses. This coordination enables AI systems that combine broad language capabilities with specific, current, and verifiable knowledge.
Research literacy transforms technology evaluation from reactive debugging to proactive problem-solving. Understanding the academic foundations behind RAG techniques enables independent assessment of new approaches, strategic planning for system evolution, and adaptation to changing requirements without waiting for tutorials or expert opinions.

FAQ

What is Retrieval-Augmented Generation (RAG), and how is it different from search or pure LLMs?

RAG connects a language model to external knowledge sources at query time so responses are grounded in retrieved evidence. Unlike search, which returns documents for humans to read, and unlike pure LLMs, which rely only on training data, RAG retrieves relevant passages and uses them to inform generation in real time.

What went wrong in Air Canada’s chatbot incident, and which failure point did it illustrate?

The bot retrieved the correct policy page but generated an answer that contradicted it, promising a retroactive bereavement discount that didn’t exist. This is Failure Point 3: Factually inconsistent hallucination (ungrounded generation). The tribunal held the company responsible, underscoring the need for grounding, uncertainty signals, and oversight.

What is the “knowledge boundary” problem, and why can’t bigger context windows fix it?

Organizational knowledge changes daily; new facts don’t exist in a model’s training data. Even very large context windows can’t add information that hasn’t been ingested. RAG addresses this by retrieving current, authoritative sources at inference time and clarifying what the system does and doesn’t know.

Why don’t larger LLMs eliminate hallucinations?

Bigger models can make hallucinations more plausible and harder to detect. The core issue is unverifiable synthesis: combining information without clear provenance. Research like Self-RAG adds confidence indicators and retrieval-aware self-critique, but hallucination control remains an ongoing challenge requiring grounding and verification.

Why can’t fine-tuning on private data replace RAG for enterprise use?

Private knowledge changes frequently, comes with access controls, and spans varied formats. Training alone can’t keep up or enforce permissions. RAG integrates private, proprietary, and real-time data with the right access, preserving both general model competence and up-to-date, user-specific accuracy.

Why is RAG an enduring architectural pattern despite longer contexts, LoRA fine-tuning, and caching?

These techniques enhance RAG but don’t replace it. Retrieval-first pipelines deliver precision and cost efficiency by pre-filtering relevant content, outperforming long-context alone on knowledge-intensive tasks. LoRA improves components (e.g., generator behavior) without sacrificing modular updating. Caching optimizes performance inside the same retrieval+generation framework.

How do the two RAG pipelines (indexing and query) work together?

The indexing pipeline chunks and encodes documents, storing them in a searchable store. At query time, the system: (1) processes the question, (2) retrieves relevant passages, (3) combines them with the query, and (4) generates an answer grounded in the retrieved evidence. Retrieval techniques target missing/hidden context issues; generation techniques target faithful use of that context.

What are the seven failure points in RAG, and how does the taxonomy help?

The taxonomy includes: FP1 Missing content, FP2 Missed the top rank, FP3 Factually inconsistent hallucination, FP4 Not in context, FP5 Not extracted, FP6 Incorrect specificity, FP7 Incomplete. It turns debugging into diagnosis: identify the specific failure and apply targeted fixes (e.g., hybrid search or RRF for FP2, better context filtering for FP4, extraction prompts for FP5).

Which research-backed techniques mitigate common RAG failures?

Examples include: HyDE (bridges query–document vocabulary gaps, helping FP2/FP6), Self-RAG (reflection tokens for confidence and evidence checks, addressing FP3/FP5), FLARE (active, iterative retrieval when confidence is low, addressing FP1/FP3/FP4), and RAG-Fusion/reciprocal rank fusion plus hybrid search (improve retrieval ranking for FP2).

When should I choose Naive, Advanced, Modular, or Agentic RAG?

Use Naive RAG for quick proofs in well-structured domains and tolerant users. Move to Advanced RAG when retrieval quality limits accuracy or scale demands optimization. Adopt Modular RAG for diverse use cases, multi-source integration, and independent scaling/QA. Choose Agentic RAG for dynamic, multi-step tasks that need autonomous retrieval decisions, confidence-based triggers, and iterative refinement.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $30.23

you save $17.76 (37%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $30.23

you save $17.76 (37%)

eBook

pdf, ePub, online

$47.99 $30.23

you save $17.76 (37%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more