Overview

5 Hosting, scaling, and load testing

This chapter moves a RAG chatbot from a single-process prototype to a resilient, production-ready service. It frames statelessness as the guiding principle: any replica should handle any request, with durable data and configuration living outside the process. Containers become the vehicle for this reliability, packaging a consistent runtime that can boot identically on a laptop, an Azure Web App, or Kubernetes. The end goal is a repeatable, observable, and scalable system that survives real traffic rather than just working on a developer’s machine.

Practically, the chapter replaces local SQLite logs with a cloud database for durability, pins dependencies, and builds a slim, fast Docker image while keeping secrets out of the artifact via environment variables. After validating the container locally over WebSockets, the image is pushed to a private container registry and deployed to an Azure Web App with Application Insights enabled. The app’s configuration is injected via environment variables, and GitHub Actions is wired in for CI/CD that runs tests, builds the image, publishes to the registry, and deploys on every push. Small adjustments—like a test-mode flag that returns full answers for integration tests—ensure automated checks stay reliable as streaming behavior remains production-friendly.

To ensure the service performs under load, the chapter adds structured logging and uses Locust to simulate concurrent users against the public endpoint. Real-time logs and telemetry surface bottlenecks and guide targeted fixes: scale the App Service up or out for throughput, raise search service capacity to prevent throttling, and address model rate limits by adjusting quotas or usage patterns. The result is a disciplined loop—deploy, observe, load-test, tune, repeat—so capacity planning, cost control, and reliability are driven by data, turning “git push” into a safe, measurable path to production.

This flowchart shows the key stages of this chapter. We will start by packaging our app as a container, then deploy it to Azure’s app service, then test it remotely. Finally, we will perform load testing using logging and simulated users with Locust.
The RAG container keeps no secrets and writes nothing to its own disk; it simply streams prompts from the browser, fans out to AI Search and OpenAI for retrieval and generation, and drops a JSON record into Cosmos DB for posterity. All configuration arrives at launch via environment variables or a key vault. The dashed line to “Local Disk” drives home the point: local storage is strictly off-limits in a truly stateless service.
An overview of the new logging path: each answer that streams to the user is also packaged into a JSON document and sent over HTTPS to Cosmos DB, where it can be queried later to debug slow responses, audit model behaviour, or train future ranking tweaks..
In this section, we will get our web app’s URL, add it to our test website, open it, and send a query, then wait for the streaming answer to come back.
We will add logging to our file to catch any errors, then run Locust to simulate users and cause our app to fail. We will use the Locust dashboard and logs to find bottlenecks so we can fix them.
A Locust swarm batters the public Web App endpoint while Application Insights logs every request and latency spike. Those metrics show up in real time, giving you the data needed to nudge the App Service scale slider from one replica to many.
If we experience any failures during out Locust tests, we first look at our logs in our web app. Next, once we have identified the error, we fix it by scaling out that particular part of our system. For example, if we are getting rate limits from OpenAI, we increase the amount of requests we can send.

Summary

  • Rolling out a RAG app is easier when you pair an Azure Web App with GitHub Actions, because every code push magically turns into a live app.
  • By moving chat logs into Cosmos DB and pinning every package in requirements.txt, we turned our scripts into cloud-worthy code that can reboot safely and retain chat logs.
  • Setting up a CI/CD pipeline means tests run automatically on each commit, so broken code gets caught in the pipeline instead of embarrassing you in front of real users.
  • Adding friendly logging.info() calls throughout the code gives us play-by-play insight, so when something misbehaves under load we can trace the exact moment and function where it stumbles.
  • Locust lets us unleash a pretend stampede of users, revealing performance potholes in a safe sandbox, so we can prepare for the real users.
  • When bottlenecks appear, we learned to crank up AI Search tiers, boost our OpenAI token quotas, or spin up extra Web App instances—scaling only what’s necessary to keep response times snappy without lighting our budget on fire.

FAQ

Why does the chapter recommend containers on Azure Web App instead of serverless for a RAG chatbot?Serverless sounds attractive but hits practical limits for RAG: WebSockets can be unreliable on some serverless platforms, cold starts add latency, and package size limits (for example AWS Lambda’s 250 MB unzipped) don’t fit typical Python stacks. A managed container host (Azure Web App for Containers) runs a stable, long-lived process that streams tokens reliably, eliminates cold starts, and lets you control the full runtime.
What does “stateless” mean here, and where should chat history live?Stateless means any replica can serve any request and nothing important is kept on the container’s local disk. Persist chat logs externally (not in SQLite inside the container). The chapter moves logs to Azure Cosmos DB (NoSQL, serverless RU billing, global access, millisecond writes). Use environment variables for configuration and upsert JSON documents with fields like id, user_question, timestamp, final_answer, user_email, and agents_search_results.
How do I package the app into a Docker image safely and reproducibly?- Use a slim, pinned base image (python:3.11-slim) and install needed OS libs (e.g., libgomp1 for OpenMP dependencies). - Pin Python dependencies in requirements.txt; install with pip --no-cache-dir. - EXPOSE 8000 and start with CMD ["python","app.py"]. - Keep secrets out of the image: put them in a .env file and add .env to .dockerignore. Inject at runtime with --env-file. - Build and run: docker build -t ch5-image . and docker run -it --env-file .env -p 8000:8000 ch5-image. On Windows you may need -p 8080:8000 locally.
How can I test the container locally end-to-end with WebSockets?Use the static test page try_api_websockets.html from the repo. Point it to ws://localhost:8000 (or ws://localhost:8080 on Windows if you remapped). Launch your container with the env file, open the HTML locally, send a question, and verify the streamed tokens. Watch the container logs to diagnose issues quickly.
What are the key steps to deploy the container to Azure?- Create an Azure Container Registry (enable Admin user), tag and push your image: docker tag and docker push to .azurecr.io/your-image. - Create an Azure Web App (Linux, Publish: Container). Use at least B1 for stable WebSockets. Enable Application Insights. - Configure environment variables in Settings → Environment Variables (OpenAI, AI Search, Cosmos DB, AUTOGEN_USE_DOCKER=0, test_environment as needed). - Set startup command to: python app.py. The app streams via its own asyncio loop; no Gunicorn/Uvicorn required.
How do I set up automatic builds and releases with GitHub Actions?- In the Web App’s Deployment Center, select GitHub Actions and your repo/branch; point it at your ACR and image tag. - In GitHub, add environment Secrets for all variables your app needs (same as in Azure). - In the workflow YAML, set environment: Production (or your env name) so secrets are injected. - Add a test step that runs python -m pytest -s with test_environment="True". On each push, the workflow builds the image, runs tests, pushes to ACR, and deploys to the Web App if tests pass.
How should I scale to handle more users and bursty traffic?- Scale out: App Service → Scale out (App Service plan) → Manual → increase instance count. Stateless containers make this easy. - Scale up: move to a larger App Service plan SKU for more CPU/RAM and better WebSocket handling. - Also raise external limits when needed: purchase Azure OpenAI quota/PTUs and increase Azure AI Search tier for higher QPS.
How do I load test a WebSocket-based RAG app with Locust?- Install: pip install locust websocket-client. - Write locustfile.py that opens a wss:// connection to your Web App, sends a JSON payload (email, chat_history), receives the streamed answer, and records timings via events.request.fire. - Run locust and use the web UI at http://localhost:8089 to configure users and spawn rate. Monitor requests/sec, latency, and errors. Point both the script and UI to your wss:// Web App URL.
What telemetry should I add and where do I observe it during tests?- Add logging.info calls in key functions (agent construction, embeddings, search, write path, RAGChat entry/exit) to trace flow and timings. - Enable Application Insights when creating the Web App. - Use Azure Portal → Monitoring → Log stream during Locust runs to see live behavior and pinpoint bottlenecks.
What are common failures under load and how do I fix them?- 429 rate limits from OpenAI: move to Azure OpenAI in a nearby region and consider Provisioned Throughput Units; also reduce tokens (e.g., fewer retrieved chunks). - Slow or stalled responses: scale out App Service instances or scale up the plan for more CPU/RAM and stable sockets. - Azure AI Search throttling: raise the service tier and capacity in AI Search → Settings → Scale. Note: Load tests can get expensive—run short bursts, fix issues, and iterate.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Enterprise RAG ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Enterprise RAG ebook for free