Enterprise RAG you own this product

Scaling Retrieval Augmented Generation

Tyler Suard

MEAP began March 2025
Last updated July 2025
Publication in Fall 2026 (estimated)

ISBN 9781633435476
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Deep Learning / Generative AI

resources: Source code Book forum Source code on Github

table of content

PART 1: BUILDING YOUR RAG

1 Intro to enterprise RAG

1.1 A brief intro to RAG

1.2 The difference between naive RAG and enterprise RAG

1.3 Why businesses need enterprise RAG

1.4 Example use cases

1.5 Building a RAG system

1.6 Summary

2 Nothing happens until someone writes an eval

2.1 Introducing evals and eval-driven development

2.2 Why you can’t write a single line of code until you have an eval

2.3 How to use evals

2.4 Automatic evals in end-to-end testing

2.4.1 How to use LLMS in the eval process

2.4.2 Testing all data sources

2.4.3 Implementing automated evals

2.5 Summary

3 Search service ingestion

3.1 Setting up Azure AI Search

3.2 Preparing and uploading SQL records to AI Search

3.2.1 Code setup

3.2.2 Summarize one SQL record

3.2.3 Embed the record summary

3.2.4 Upload the record and embedding to Azure AI Search

3.2.5 Putting it all together

3.2.6 Loop through all SQL records

3.2.7 Test AI Search

3.2.8 Search filters and selecting fields

3.2.9 Ordering search results for consistency

3.2.10 Updating AI Search Index: adding new records

3.2.11 Updating AI Search Index: deleting records

3.3 Preparing and uploading unstructured data to AI Search

3.4 Summary

4 Retrieval using AutoGen agents

4.1 Using agents for smarter retrieval in RAG

4.1.1 Getting into agents

4.1.2 Setting up AutoGen agents

4.2 Assigning a search function to the agents

4.2.1 Creating query embeddings for the AI search index

4.2.2 Setting up the Azure AI Search client

4.3 Searching multiple databases simultaneously

4.3.1 Using group chats for simultaneous queries

4.4 Writing the final answer

4.5 Making sure the bot answers in the right language

4.5.1 Define the language checker function

4.6 Storing answer information

4.7 The RAGChat function: one RAG to rule them all

4.8 Question triage

4.8.1 Code Setup

4.9 Websocket integration for streaming responses

4.9.1 Essential setup

4.9.2 Handling websocket messages

4.9.3 Starting the websocket server

4.10 Testing the websocket server with a client

4.11 Summary

PART 2: DEPLOYING AND IMPROVING

5 Hosting, scaling, and load testing

5.1 Packaging and containers

5.1.1 Removing state from the container

5.1.2 Creating a CosmosDB database for chat logs

5.1.3 Modifying chapter 4 code to hit the new CosmosDB database

5.1.4 Returning the final answer for integration tests

5.1.5 Getting ready to containerize

5.2 Deploying

5.2.1 Creating a container registry

5.2.2 Creating a web app

5.2.3 Setting up the web app

5.2.4 Setting up connections with Github

5.3 Trying out your deployed app

5.4 Breaking things, on purpose: load testing with Locust

5.4.1 Adding logging

5.4.2 Load testing using Locust

5.4.3 Fixing the things that Locust breaks

5.5 Summary

== 6 Communication strategies: Disclaimers, feedback, and prompt tuning

6.1 Disclaimer message and user guide

6.2 Introduction to user feedback

6.3 Building a user feedback system part 1: CosmosDB

6.4 Building a user feedback system part 2: Function App

6.5 Why not direct? Why not use our web app?

6.6 App setup

6.7 Code

6.8 Deploying the function app

6.9 Environment variables

6.10 Trying out our feedback loop

6.11 Ask a question and give negative feedback

6.12 Check the logs and diagnose the problem

6.13 Update the prompt and/or your code

6.14 Redeploy and try it again

6.15 Email the user

6.16 A fully fleshed-out writer prompt

6.17 Responding (nicely) to dumb user feedback

6.18 Summary

7 Security and governance

PART 3: MAINTAINING

8 Monitoring and observability

9 Writing readable code

10 Tips and Troubleshooting

Overview

5 Hosting, scaling, and load testing

This chapter moves a RAG chatbot from a single-process prototype to a resilient, production-ready service. It frames statelessness as the guiding principle: any replica should handle any request, with durable data and configuration living outside the process. Containers become the vehicle for this reliability, packaging a consistent runtime that can boot identically on a laptop, an Azure Web App, or Kubernetes. The end goal is a repeatable, observable, and scalable system that survives real traffic rather than just working on a developer’s machine.

Practically, the chapter replaces local SQLite logs with a cloud database for durability, pins dependencies, and builds a slim, fast Docker image while keeping secrets out of the artifact via environment variables. After validating the container locally over WebSockets, the image is pushed to a private container registry and deployed to an Azure Web App with Application Insights enabled. The app’s configuration is injected via environment variables, and GitHub Actions is wired in for CI/CD that runs tests, builds the image, publishes to the registry, and deploys on every push. Small adjustments—like a test-mode flag that returns full answers for integration tests—ensure automated checks stay reliable as streaming behavior remains production-friendly.

To ensure the service performs under load, the chapter adds structured logging and uses Locust to simulate concurrent users against the public endpoint. Real-time logs and telemetry surface bottlenecks and guide targeted fixes: scale the App Service up or out for throughput, raise search service capacity to prevent throttling, and address model rate limits by adjusting quotas or usage patterns. The result is a disciplined loop—deploy, observe, load-test, tune, repeat—so capacity planning, cost control, and reliability are driven by data, turning “git push” into a safe, measurable path to production.

This flowchart shows the key stages of this chapter. We will start by packaging our app as a container, then deploy it to Azure’s app service, then test it remotely. Finally, we will perform load testing using logging and simulated users with Locust.

The RAG container keeps no secrets and writes nothing to its own disk; it simply streams prompts from the browser, fans out to AI Search and OpenAI for retrieval and generation, and drops a JSON record into Cosmos DB for posterity. All configuration arrives at launch via environment variables or a key vault. The dashed line to “Local Disk” drives home the point: local storage is strictly off-limits in a truly stateless service.

An overview of the new logging path: each answer that streams to the user is also packaged into a JSON document and sent over HTTPS to Cosmos DB, where it can be queried later to debug slow responses, audit model behaviour, or train future ranking tweaks..

In this section, we will get our web app’s URL, add it to our test website, open it, and send a query, then wait for the streaming answer to come back.

We will add logging to our file to catch any errors, then run Locust to simulate users and cause our app to fail. We will use the Locust dashboard and logs to find bottlenecks so we can fix them.

A Locust swarm batters the public Web App endpoint while Application Insights logs every request and latency spike. Those metrics show up in real time, giving you the data needed to nudge the App Service scale slider from one replica to many.

If we experience any failures during out Locust tests, we first look at our logs in our web app. Next, once we have identified the error, we fix it by scaling out that particular part of our system. For example, if we are getting rate limits from OpenAI, we increase the amount of requests we can send.

Summary

Rolling out a RAG app is easier when you pair an Azure Web App with GitHub Actions, because every code push magically turns into a live app.
By moving chat logs into Cosmos DB and pinning every package in requirements.txt, we turned our scripts into cloud-worthy code that can reboot safely and retain chat logs.
Setting up a CI/CD pipeline means tests run automatically on each commit, so broken code gets caught in the pipeline instead of embarrassing you in front of real users.
Adding friendly logging.info() calls throughout the code gives us play-by-play insight, so when something misbehaves under load we can trace the exact moment and function where it stumbles.
Locust lets us unleash a pretend stampede of users, revealing performance potholes in a safe sandbox, so we can prepare for the real users.
When bottlenecks appear, we learned to crank up AI Search tiers, boost our OpenAI token quotas, or spin up extra Web App instances—scaling only what’s necessary to keep response times snappy without lighting our budget on fire.

FAQ

Why does the chapter recommend containers on Azure Web App instead of serverless for a RAG chatbot?

Serverless sounds attractive but hits practical limits for RAG: WebSockets can be unreliable on some serverless platforms, cold starts add latency, and package size limits (for example AWS Lambda’s 250 MB unzipped) don’t fit typical Python stacks. A managed container host (Azure Web App for Containers) runs a stable, long-lived process that streams tokens reliably, eliminates cold starts, and lets you control the full runtime.

What does “stateless” mean here, and where should chat history live?

Stateless means any replica can serve any request and nothing important is kept on the container’s local disk. Persist chat logs externally (not in SQLite inside the container). The chapter moves logs to Azure Cosmos DB (NoSQL, serverless RU billing, global access, millisecond writes). Use environment variables for configuration and upsert JSON documents with fields like id, user_question, timestamp, final_answer, user_email, and agents_search_results.

How do I package the app into a Docker image safely and reproducibly?

- Use a slim, pinned base image (python:3.11-slim) and install needed OS libs (e.g., libgomp1 for OpenMP dependencies). - Pin Python dependencies in requirements.txt; install with pip --no-cache-dir. - EXPOSE 8000 and start with CMD ["python","app.py"]. - Keep secrets out of the image: put them in a .env file and add .env to .dockerignore. Inject at runtime with --env-file. - Build and run: docker build -t ch5-image . and docker run -it --env-file .env -p 8000:8000 ch5-image. On Windows you may need -p 8080:8000 locally.

How can I test the container locally end-to-end with WebSockets?

Use the static test page try_api_websockets.html from the repo. Point it to ws://localhost:8000 (or ws://localhost:8080 on Windows if you remapped). Launch your container with the env file, open the HTML locally, send a question, and verify the streamed tokens. Watch the container logs to diagnose issues quickly.

What are the key steps to deploy the container to Azure?

- Create an Azure Container Registry (enable Admin user), tag and push your image: docker tag and docker push to .azurecr.io/your-image. - Create an Azure Web App (Linux, Publish: Container). Use at least B1 for stable WebSockets. Enable Application Insights. - Configure environment variables in Settings → Environment Variables (OpenAI, AI Search, Cosmos DB, AUTOGEN_USE_DOCKER=0, test_environment as needed). - Set startup command to: python app.py. The app streams via its own asyncio loop; no Gunicorn/Uvicorn required.

How do I set up automatic builds and releases with GitHub Actions?

- In the Web App’s Deployment Center, select GitHub Actions and your repo/branch; point it at your ACR and image tag. - In GitHub, add environment Secrets for all variables your app needs (same as in Azure). - In the workflow YAML, set environment: Production (or your env name) so secrets are injected. - Add a test step that runs python -m pytest -s with test_environment="True". On each push, the workflow builds the image, runs tests, pushes to ACR, and deploys to the Web App if tests pass.

How should I scale to handle more users and bursty traffic?

- Scale out: App Service → Scale out (App Service plan) → Manual → increase instance count. Stateless containers make this easy. - Scale up: move to a larger App Service plan SKU for more CPU/RAM and better WebSocket handling. - Also raise external limits when needed: purchase Azure OpenAI quota/PTUs and increase Azure AI Search tier for higher QPS.

How do I load test a WebSocket-based RAG app with Locust?

- Install: pip install locust websocket-client. - Write locustfile.py that opens a wss:// connection to your Web App, sends a JSON payload (email, chat_history), receives the streamed answer, and records timings via events.request.fire. - Run locust and use the web UI at http://localhost:8089 to configure users and spawn rate. Monitor requests/sec, latency, and errors. Point both the script and UI to your wss:// Web App URL.

What telemetry should I add and where do I observe it during tests?

- Add logging.info calls in key functions (agent construction, embeddings, search, write path, RAGChat entry/exit) to trace flow and timings. - Enable Application Insights when creating the Web App. - Use Azure Portal → Monitoring → Log stream during Locust runs to see live behavior and pinpoint bottlenecks.

What are common failures under load and how do I fix them?

- 429 rate limits from OpenAI: move to Azure OpenAI in a nearby region and consider Provisioned Throughput Units; also reduce tokens (e.g., fewer retrieved chunks). - Slow or stalled responses: scale out App Service instances or scale up the plan for more CPU/RAM and stable sockets. - Azure AI Search throttling: raise the service tier and capacity in AI Search → Settings → Scale. Note: Load tests can get expensive—run short bursts, fix issues, and iterate.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more