Enterprise RAG you own this product

Scaling Retrieval Augmented Generation

Tyler Suard

MEAP began March 2025
Last updated July 2025
Publication in Fall 2026 (estimated)

ISBN 9781633435476
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

resources: Source code Book forum Source code on Github

table of content

PART 1: BUILDING YOUR RAG

1 Intro to enterprise RAG

1.1 A brief intro to RAG

1.2 The difference between naive RAG and enterprise RAG

1.3 Why businesses need enterprise RAG

1.4 Example use cases

1.5 Building a RAG system

1.6 Summary

2 Nothing happens until someone writes an eval

2.1 Introducing evals and eval-driven development

2.2 Why you can’t write a single line of code until you have an eval

2.3 How to use evals

2.4 Automatic evals in end-to-end testing

2.4.1 How to use LLMS in the eval process

2.4.2 Testing all data sources

2.4.3 Implementing automated evals

2.5 Summary

3 Search service ingestion

3.1 Setting up Azure AI Search

3.2 Preparing and uploading SQL records to AI Search

3.2.1 Code setup

3.2.2 Summarize one SQL record

3.2.3 Embed the record summary

3.2.4 Upload the record and embedding to Azure AI Search

3.2.5 Putting it all together

3.2.6 Loop through all SQL records

3.2.7 Test AI Search

3.2.8 Search filters and selecting fields

3.2.9 Ordering search results for consistency

3.2.10 Updating AI Search Index: adding new records

3.2.11 Updating AI Search Index: deleting records

3.3 Preparing and uploading unstructured data to AI Search

3.4 Summary

4 Retrieval using AutoGen agents

4.1 Using agents for smarter retrieval in RAG

4.1.1 Getting into agents

4.1.2 Setting up AutoGen agents

4.2 Assigning a search function to the agents

4.2.1 Creating query embeddings for the AI search index

4.2.2 Setting up the Azure AI Search client

4.3 Searching multiple databases simultaneously

4.3.1 Using group chats for simultaneous queries

4.4 Writing the final answer

4.5 Making sure the bot answers in the right language

4.5.1 Define the language checker function

4.6 Storing answer information

4.7 The RAGChat function: one RAG to rule them all

4.8 Question triage

4.8.1 Code Setup

4.9 Websocket integration for streaming responses

4.9.1 Essential setup

4.9.2 Handling websocket messages

4.9.3 Starting the websocket server

4.10 Testing the websocket server with a client

4.11 Summary

PART 2: DEPLOYING AND IMPROVING

5 Hosting, scaling, and load testing

5.1 Packaging and containers

5.1.1 Removing state from the container

5.1.2 Creating a CosmosDB database for chat logs

5.1.3 Modifying chapter 4 code to hit the new CosmosDB database

5.1.4 Returning the final answer for integration tests

5.1.5 Getting ready to containerize

5.2 Deploying

5.2.1 Creating a container registry

5.2.2 Creating a web app

5.2.3 Setting up the web app

5.2.4 Setting up connections with Github

5.3 Trying out your deployed app

5.4 Breaking things, on purpose: load testing with Locust

5.4.1 Adding logging

5.4.2 Load testing using Locust

5.4.3 Fixing the things that Locust breaks

5.5 Summary

== 6 Communication strategies: Disclaimers, feedback, and prompt tuning

6.1 Disclaimer message and user guide

6.2 Introduction to user feedback

6.3 Building a user feedback system part 1: CosmosDB

6.4 Building a user feedback system part 2: Function App

6.5 Why not direct? Why not use our web app?

6.6 App setup

6.7 Code

6.8 Deploying the function app

6.9 Environment variables

6.10 Trying out our feedback loop

6.11 Ask a question and give negative feedback

6.12 Check the logs and diagnose the problem

6.13 Update the prompt and/or your code

6.14 Redeploy and try it again

6.15 Email the user

6.16 A fully fleshed-out writer prompt

6.17 Responding (nicely) to dumb user feedback

6.18 Summary

7 Security and governance

PART 3: MAINTAINING

8 Monitoring and observability

9 Writing readable code

10 Tips and Troubleshooting

Overview

1 Intro to enterprise RAG

Imagine a tireless digital assistant that understands natural questions, finds the right data instantly, and writes clear answers. That is the promise of Retrieval Augmented Generation (RAG): it fuses a language model’s conversational ability with targeted retrieval from company sources like databases, documents, and internal apps. The chapter introduces RAG at a high level, explains how it works end to end, and contrasts simple “naive” setups with the enterprise-grade approach needed for real-world complexity. It frames the core goal as fast, accurate, and trustworthy access to knowledge so teams can make better decisions with less friction.

The text details why naive RAG struggles in business settings—returning irrelevant context, hallucinating, and failing at scale—and defines Enterprise RAG as a robust pipeline designed for reliability. Key capabilities include input validation, question triage, query rewriting, asynchronous agent orchestration, and hybrid enterprise search that blends keyword and vector retrieval. Results are ranked and filtered before a writer agent produces consistent, grounded answers. Beyond speed and accuracy, the chapter emphasizes multilingual support, up-to-date data, guardrails for safety, privacy controls, cost management, and observability. Practical benefits span faster customer support, smoother collaboration, and better decision-making, illustrated across small business, large enterprise, healthcare, finance, and education scenarios.

Finally, the chapter previews how to build such a system: start with evaluation, then ingest both unstructured and structured data; chunk documents thoughtfully, attach metadata, and generate embeddings for vector search. On retrieval, rewrite messy user questions into precise queries and combine multiple searches to gather the best evidence. For generation, assemble concise, user-friendly answers grounded in retrieved context with traceability. The result is a scalable, adaptable Enterprise RAG that turns scattered information into an accessible, trusted asset—and a practical roadmap for readers to implement it in their own organizations.

The left column shows the multiple steps and complexity of manually searching a SQL database for records. Compare this with the relative ease and simplicity of asking the question of a RAG chatbot instead,shown in the right column.

In a RAG system, the user question, the prompt, and the retrieved data are combined and sent to an LLM, which generates an answer using all that input information.

Traditional manual workflow for retrieving answers, requiring database queries, corrections, and manual review. This process is time-consuming, and requires a lot of effort.

Basic RAG process with embedding, vector search, and a large language model. This simple approach is efficient but prone to errors and lacks context handling.

Enterprise RAG pipeline improves speed, accuracy, and scalability by incorporating validation, query rewriting, and asynchronous agents, reducing response times to 30 seconds..

A naive RAG pipeline with limited steps for retrieving answers. Suitable for simple queries but insufficient for handling complex or large-scale enterprise needs.

Key questions for designing enterprise RAG systems, addressing user input limits, database performance, context accuracy, and feedback management for better scalability and reliability.

Enterprise RAG system architecture showing ingestion, retrieval, and generation steps. Raw data is preprocessed, embedded, and searched to deliver accurate, context-aware answers.

Summary

Retrieval Augmented Generation (RAG) is an advanced AI technology that combines conversational skills with real-time data retrieval, like an efficient assistant.
RAG allows users to ask questions in plain language and receive detailed, specific information tailored to their needs, accessing data from databases, documents, and applications like Slack.
Naive RAG, while easy to set up, often falls short in business environments due to misunderstandings of context, retrieving incorrect data, or providing inaccurate ("hallucinated") answers.
Enterprise RAG is designed to handle complex business scenarios, accurately processing diverse questions in different languages and grasping user intent.
Implementing Enterprise RAG leads to streamlined operations, faster decision-making, improved collaboration, and enhanced customer service by resolving issues quickly.
The book will guide readers step-by-step in building their own Enterprise RAG system, empowering them to harness the full potential of AI-driven data retrieval.

FAQ

What is Retrieval Augmented Generation (RAG)?

RAG is an AI approach that combines a language model’s conversational abilities with real-time retrieval from your data sources. You ask a question in plain language, the system searches relevant databases and documents, and the language model generates a clear, context-aware answer using the retrieved information.

How is RAG different from a traditional database search?

Traditional search requires knowing where to look, crafting exact queries, and manually reviewing results. RAG lets you ask natural questions; it finds the right sources (across databases, files, apps), retrieves the most relevant snippets, and writes a coherent answer, dramatically reducing steps and effort.

What is “Naive RAG,” and how does it work?

Naive RAG typically turns the user question into an embedding, searches a vector database for semantically similar chunks, and feeds the top matches plus the question to a language model to draft an answer. It’s simple and works for basic queries but struggles with scale and complexity.

Why does Naive RAG often fail in business settings?

Common issues include retrieving the wrong context, mixing unrelated content, and hallucinations—especially with large, messy, or multi-source datasets. In complex, enterprise scenarios, this leads to inaccurate answers and poor reliability; estimates suggest many early business RAG attempts fail due to these pitfalls.

What is “Enterprise RAG,” and what makes it different?

Enterprise RAG is a robust pipeline designed for real-world business needs. It adds steps such as input validation, question triage, query rewriting, asynchronous agents, multi-source enterprise search (keyword + vector), ordering and filtering results, and a writer agent. The result is higher accuracy, faster performance, and better scalability and usability.

How does Enterprise RAG reduce errors and hallucinations?

It validates inputs, routes questions to the right paths, rewrites queries to match data schemas, searches across sources with combined methods, filters irrelevant results, and has a writer agent compose grounded answers. It also prompts for clarification when needed and applies guardrails to avoid unsafe or misleading outputs.

What kinds of data sources can Enterprise RAG use?

Enterprise RAG connects to structured and unstructured data: SQL/operational databases, PDFs and documents, internal communications (e.g., Slack), and other business apps and indexes. It’s designed to search across many systems simultaneously and keep results current.

What benefits can a business expect from Enterprise RAG?

- Faster decisions and responses (e.g., cutting a five-minute lookup to ~30 seconds in examples)
- Higher accuracy and consistency across queries
- Productivity gains and better collaboration
- Improved customer support (e.g., notable reductions in time-per-issue in practice)

What risks and considerations should we plan for?

- Cost and operations: model usage, search services, and compute can add up; monitor and optimize workflows
- Skills and maintenance: requires AI/ML/data engineering expertise and ongoing updates
- Safety and compliance: apply guardrails, control access, prevent data leaks, and handle legal risks from incorrect answers
- Multilingual and ambiguous queries: ensure triage and rewriting for reliability across languages and styles

How do we start building an Enterprise RAG system?

Follow a three-part flow: (1) Ingestion—extract text (e.g., from PDFs), chunk content, add metadata, and create embeddings or use a managed search service; (2) Retrieval—validate inputs, triage, rewrite queries, and search across sources with keyword + vector; (3) Generation—have a writer agent compose a clear answer grounded in retrieved context, with links/citations when possible. The book then guides you step-by-step to implement and scale this pipeline.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more