Building Reliable AI Systems you own this product

Production-ready methods to reduce hallucination, bias, and more

Rush Shahani

MEAP began October 2024
Last updated February 2026
Publication in June 2026 (estimated)

ISBN 9781633436732
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning / Large Language Models

resources: Source code Book forum

table of content

1 AI Reliability: Building LLMs for the Real World

1.1 The current AI revolution: Reasoning

1.2 The tangible impact of LLMs in the real world

1.2.1 Legal industry transformation

1.2.2 Customer service revolution

1.2.3 Programming and development

1.2.4 Agentic AI – Systems that can take action

1.3 Understanding hallucinations and Reliable AI

1.3.1 When AI lies convincingly

1.3.2 What exactly is a hallucination?

1.3.3 What is Reliable AI?

1.4 The AI reliability framework

1.4.1 Layer 1: Reliable outputs (Chapters 2-5)

1.4.2 Layer 2: Reliable agents (Chapters 6-8)

1.4.3 Layer 3: Reliable operations (Chapters 9-11)

1.5 The reliability toolbox

1.6 Why reliable AI systems matter now

1.7 Requirements for Following Along

1.8 Summary

1.9 References

PART 1: RELIABLE OUTPUTS

2 Generating Trustworthy Responses with Prompt Engineering

2.1 Tailoring LLM Settings for Maximum Reliability

2.1.1 Optimizing temperature for predictable outputs

2.2 Limit output and turns for reduction of hallucination risk

2.3 Applying frequency and presence penalties for balanced content

2.4 Minimizing intrinsic randomness for stable performance

2.5 Foundations of prompt engineering for reliable LLMs

2.5.1 Designing components of a prompt for reliability

2.5.2 Crafting basic prompts with a focus on dependability

2.6 Prompt engineering techniques for preventing hallucinations

2.6.1 Zero-shot prompting

2.6.2 Few-shot prompting for contextual stability

2.6.3 Chain-of-thought prompting for transparent reasoning

2.6.4 Automatic chain of thought (Auto-CoT) for scalable reasoning

2.6.5 Self-consistency for cross-verification

2.6.6 Tree-of-thought (ToT) prompting for structured decision making

2.7 Project: Creating a reliable weather assistant with OpenAI’s function calls

2.7.1 Building a weather assistant

2.7.2 Getting started and defining the weather function

2.7.3 Integrating the chatbot

2.8 Summary

2.9 References

3 Grounding Outputs with RAG

3.1 What is retrieval augmented generation?

3.2 RAG system architecture

3.2.1 The role of the retriever

3.2.2 The factual generators

3.2.3 RAG system flow

3.2.4 Example: E-commerce shopping assistant

3.3 Reducing hallucinations with RAG

3.3.1 Grounding the language model

3.3.2 Asking a question about climate change

3.3.3 Advantages of RAG in reducing hallucinations

3.3.4 Critical applications

3.3.5 Enhancing transparency

3.4 Data preparation and reliable indexing for RAG systems

3.4.1 Introduction to LangChain

3.4.2 Structuring product catalogs

3.4.3 Creating the searchable vector index

3.4.4 Implementing a FAISS Index

3.5 Building an effective RAG system

3.5.1 Building an e-commerce chatbot powered by RAG

3.5.2 Processing user queries

3.5.3 Crafting an accurate response using an LLM

3.6 Evaluating and optimizing RAG systems

3.6.1 The role of evaluator LLMs in assessing hallucinations

3.6.2 Crafting evaluation datasets

3.6.3 Practical metrics for RAG evaluation

3.6.4 Implementing RAGAS evaluation

3.6.5 Reducing hallucinations by filtering retrievals and leveraging metadata

3.6.6 Embedding chunk sizes and model choice

3.7 Advanced techniques for improving RAG

3.7.1 Fine-tuning embeddings for in-domain relevance

3.7.2 Incorporating metadata to re-establish context

3.7.3 Implementing hybrid retrieval

3.7.4 Enhancing queries via rephrasing and augmentation

3.7.5 Implementing query routing

3.7.6 GraphRAG: Knowledge graph-enhanced retrieval

3.7.7 Agentic RAG: Agent-controlled retrieval

3.8 Building a RAG system for an ecommerce chatbot with LangChain

3.8.1 Part 1: Installation of required libraries

3.8.2 Part 2: Importing libraries

3.8.3 Part 3: Document loading

3.8.4 Part 4: Document transformers

3.8.5 Part 5: Text embedding

3.8.6 Part 6: Vector stores

3.8.7 Part 7: Preparing the LLM model

3.8.8 Part 8: Retrievers

3.8.9 Part 9: Retrieval QA chain

3.8.10 Part 10: Testing the chatbot

3.8.11 Part 11: Evaluating and preventing hallucinations with LangChain

3.8.12 Challenges and adaptations of the chatbot

3.9 Summary

3.10 References

4 Embeddings and Vector Search

4.1 What are embeddings (really)?

4.1.1 A mental model: Embeddings as coordinates in meaning space

4.1.2 Why the model matters

4.2 Embedding models in production

4.2.1 Commercial models: Powerful but opaque

4.2.2 Open-source models: Flexible and transparent

4.2.3 Domain-specific models: Purpose-built precision

4.2.4 How to choose the right model

4.2.5 Embeddings across industries

4.2.6 Case study: Learning from Airbnb’s embedding journey

4.3 Beyond simple vectors: Hybrid and multi-stage retrieval

4.3.1 The limitations of pure vector search

4.3.2 Hybrid retrieval: Combining dense and sparse approaches

4.3.3 Multi-stage retrieval: Building precision through layers

4.3.4 Building a hybrid retriever: Hands-on implementation for an employee chatbot

4.4 Essential retrieval optimizations with embeddings and RAG

4.4.1 Metadata filtering: Beyond pure semantic search

4.4.2 Chunking strategies: Finding the right granularity

4.5 Exploring vector storage and databases

4.5.1 HNSW: The algorithm powering modern vector search

4.5.2 FAISS: Industry-standard vector indexing

4.5.3 Vector databases: Complete vector storage solutions

4.5.4 Vector database project: Building a healthcare policy assistant with pinecone

4.6 Practical challenges: Drift and compression

4.6.1 Diving deeper into vector compre

4.7 Summary

4.8 References

5 Fine-Tuning LLMs for Improved Performance

5.1 Choosing the right approach: Prompting, RAG, or fine-tuning?

5.1.1 When to use each approach: A practical decision framework

5.1.2 Hybrid approach: Combining RAG and fine-tuning

5.2 Case studies: Finetuning in the real world

5.2.1 Medicine: Med-PaLM 2

5.2.2 Finance: BloombergGPT and FinGPT

5.2.3 Code generation: Codex

5.3 Understanding the finetuning process

5.3.1 The data preparation phase

5.3.2 Fine-tuning closed source models (OpenAI)

5.3.3 Understanding key fine-tuning approaches

5.3.4 Building a customer support assistant with META’s LLaMA open-source model

5.4 Knowledge Distillation for LLMs

5.4.1 How knowledge distillation works in LLMs

5.5 Summary

5.6 References

PART 2: RELIABLE AGENTS

6 Creating Effective AI Agents

6.1 Why do we need agents?

6.1.1 Static knowledge

6.1.2 Inability to act

6.1.3 No workflow management

6.1.4 How AI agents solve these challenges

6.1.5 Broader applications of AI agents

6.2 Core components of reliable agents

6.2.1 Memory: Context that sticks

6.2.2 Balancing memory systems with LangChain

6.2.3 Active Tool Usage vs. Passive Retrieval

6.2.4 Strategies to reduce hallucinations with tools

6.2.5 Decision making: Making AI agents intelligent

6.3 Agentic RAG for reliable agents

6.3.1 Project: Building an agentic RAG system with LangChain

6.3.2 Setting up dependencies

6.3.3 Building the knowledge base

6.3.4 Creating interactive tools

6.3.5 Configuring the agent’s behavior

6.3.6 Implementing transparent decision making

6.3.7 Managing conversation context

6.3.8 Making it working in production

6.4 Advanced techniques for preventing hallucinations and improving accuracy in agents

6.4.1 Cross-referencing

6.4.2 Guardrails for security and preventing hallucinations

6.4.3 Reinforcement Learning from AI Feedback (RLAIF)

6.4.4 Transparency in multi-step reasoning

6.4.5 Case studies for agents in production

6.5 Summary

6.6 References

Tool Integration and MCP

7.1 What is MCP?

7.1.1 The hidden complexity: The N×M problem

7.1.2 The solution: Model Context Protocol (MCP)

7.2 Building your first real MCP tool with a CSV-powered product catalog

7.2.1 Loading your product catalog

7.2.2 Setting up the MCP server

7.2.3 Running and testing your server

7.3 Teaching your AI to use MCP tools

7.3.1 How models learn what tools they can use

7.3.2 A complete example: model → MCP → answer

7.3.3 What you didn’t have to write

7.3.4 Tool design becomes interface design

7.4 Adding a second tool: Inventory status

7.4.1 How models chain tools

7.4.2 Tool descriptions guide model behavior

7.5 Handling failures gracefully

7.5.1 Catching and communicating errors

7.5.2 Why structured responses matter

7.5.3 Handling empty results vs. errors

7.5.4 Timeout and rate limiting

7.6 Summary

7.7 References

8 Multi-Agent Systems

8.1 Multi-agent systems

8.1.1 Why multi-agent architecture matters

8.1.2 Introducing LangGraph: The multi-agent framework

8.1.3 Building intelligent agent coordination

8.1.4 Real-world example: ShopBot multi-agent system

8.1.5 Running your multi-agent system

8.1.6 Adding vision: Image-based product search with VisionAgent

8.1.7 What we built—why it matters

8.2 Testing multi-agent workflows—not just agents

8.2.1 Category 1: Blended queries and multi-model interactions

8.2.2 Category 2: Ambiguous intent and robust routing decisions

8.2.3 Category 3: Agent failure and resilience verification

8.2.4 Category 4: State flow and information preservation across workflow

8.2.5 Regression: “Are you moving forward—without breaking what worked?”

8.3 Alternative frameworks: CrewAI

8.3.1 LangGraph vs CrewAI: Different philosophies

8.3.2 ShopBot with CrewAI

8.3.3 When to use which multi-agent framework

8.3.4 The framework landscape

8.4 Summary

8.5 References

PART 3: RELIABLE OPERATIONS

9 Evaluation and Performance for LLMs and Agents

9.1 Identifying and measuring hallucinations

9.1.1 Four steps to identify and measure hallucinations

9.1.2 FActScore: Fine-grained factual evaluation

9.1.3 ROUGE metric for summarization evaluation

9.1.4 LLM as a judge: A holistic approach

9.1.5 Red teaming and stress testing

9.1.6 Using monitoring frameworks for hallucination detection

9.2 Essential architectural patterns for performance

9.2.1 Token streaming: Presenting answers incrementally or all at once

9.2.2 Handling surges with batching—System-level vs. OpenAI’s Batch API

9.2.3 Caching for efficiency

9.2.4 Multi-model fallback: Matching each query to the right model

9.2.5 Project: Building an e-commerce LLM service with batching, caching, and model fallback

9.2.6 Further improvements

9.3 Evaluating agent performance

9.3.1 Core metrics for evaluating LLMs

9.3.2 Evaluating agents: Beyond traditional LLM metrics

9.4 Summary

9.5 References

10 Deploying and Monitoring

10.1 Introducing LLMOps

10.2 Serving LLMs: Hosted APIs vs. open-source models

10.2.1 Using hosted APIs

10.2.2 The open-source alternative

10.2.3 The hybrid solution: Best of both worlds

10.3 Building LLM-native monitoring systems

10.3.1 What really matters: The four questions

10.3.2 Logging what actually matters

10.3.3 Setting up alerts that actually help

10.3.4 Catching cost explosions before they hurt

10.3.5 Building dashboards that drive action

10.3.6 Output quality monitoring

10.4 User experience and feedback monitoring

10.4.1 Explicit feedback collection

10.4.2 Implicit feedback signals

10.4.3 Building actionable feedback loops

10.5 Ensuring high-quality outputs in production

10.5.1 The three-pillar quality framework

10.5.2 Prompt engineering for consistent quality

10.5.3 Continuous quality monitoring with automated testing

10.6 Observability in practice: Introducing Langfuse with a real-world case study

10.6.1 Case Study: How Huntr uses Langfuse to power the AI Resume Builder

10.7 Summary

11 Bias, Privacy and Responsible AI

11.1 The responsible AI imperative

11.1.1 Regulatory pressure is accelerating

11.1.2 User expectations have shifted

11.1.3 Business risks have multiplied

11.1.4 Real examples of AI bias in production

11.1.5 The four failure modes

11.1.6 The responsible AI defense system

11.2 Data layer: Where bias begins

11.2.1 The fine-tuning bias trap

11.2.2 Detecting bias in chat logs

11.2.3 The name experiment

11.2.4 Three proven bias mitigation strategies

11.3 Model layer: Where bias evolves

11.3.1 Why this matters for open-source

11.3.2 Example: Adding fairness to a LoRA fine-tuning loop

11.3.3 ANTHROPIC’S constitutional AI: LLM-as-judge at training scale

11.4 Safety layer: Your last line of defense

11.4.1 Multi-layered safety architecture

11.4.2 Layer 3: Enhanced safety with commercial APIs

11.5 Privacy layer: Protecting personal data

11.5.1 Why LLM privacy failures are uniquely dangerous

11.5.2 Building Sensitive Data Detection

11.5.3 Understanding HIPAA: Healthcare privacy protection

11.5.4 Understanding GDPR: European data protection

11.6 Real-world project: SafeMedAssist

11.6.1 Why a medical AI assistant?

11.6.2 Professional testing with LangTest

11.6.3 Production Deployment Considerations

11.6.4 The business case for responsible AI

11.6.5 Completing the reliability framework

11.6.6 The Six Principles of Reliable AI

11.6.7 Final Thoughts

11.7 Summary

11.8 References

Overview

7 Fine-Tuning LLMs for Improved Performance

Modern LLMs are powerful generalists but often miss the mark in specialized domains or demanding workflows. This chapter explains how to choose among prompting, Retrieval-Augmented Generation (RAG), and fine-tuning, using a pragmatic decision framework: start by testing whether improved prompting meets the bar; if knowledge changes frequently, prefer RAG; and when you need consistent style, strict formats, or deep domain behavior, fine-tune. It also highlights a common production pattern that combines a fine-tuned core for reliable reasoning with RAG for up-to-date facts, yielding systems that are both accurate and current.

Through case studies in medicine (Med-PaLM 2), finance (BloombergGPT and FinGPT), and code generation (Codex), the chapter shows that domain specialization reliably boosts performance—provided the data is high quality. It details the fine-tuning workflow across three phases—data preparation, model selection, and training—emphasizing representative, well-formatted datasets, careful splits, and avoidance of leakage and bias. Both closed-source (e.g., managed APIs) and open-source paths are covered, along with trade-offs between full fine-tuning and parameter-efficient methods like LoRA and QLoRA. A hands-on project builds a customer support assistant with LLaMA-2-7B using QLoRA on consumer hardware, illustrating practical setup, formatting conventions, and sensible hyperparameters.

The chapter then moves from training to evaluation and deployment: validating with held-out data, monitoring checkpoints, and testing for tone, structure, reduced hallucinations, and adherence to policy. For real-world scalability, it introduces knowledge distillation—training a smaller student model to imitate a stronger teacher’s outputs—delivering major gains in cost and latency while preserving task quality. Best practices include starting with a strong teacher, collecting diverse, representative prompts and responses, evaluating frequently against realistic benchmarks, and retraining as needs evolve. Together, fine-tuning, RAG, and distillation form a practical toolkit for building reliable, efficient, and domain-accurate AI systems.

The Fine-Tuning Process - Training a pre-trained model on specialized data to adapt its behavior for specific domains or tasks.

The Different Approaches: Prompting, Fine-Tuning, RAG (Retrieval Augmented Generation) - A comparison of techniques for adapting LLMs to specialized tasks.

The Fine-tuning process for LLMs: Data Preparation, Model Selection, and Training

The Fine-Tuning LLM Data Preparation Workflow: the steps needed to successfully prepare your data for finetuning

Our Fine-tuning setup for building a customer support chatbot

The Knowledge Distillation Process

Summary

Fine-tuning is a powerful method for adapting LLMs to specialized tasks when prompting and RAG alone aren’t sufficient.
Prompting is quick and easy for basic tasks, while RAG is ideal for dynamic, factual information retrieval. Fine-tuning provides deep domain customization by updating the model’s weights.
A hybrid of RAG + fine-tuning often provides the best of both worlds: live knowledge access and consistent reasoning.
Successful fine-tuning depends on high-quality, well-formatted training data. Use prompt-completion pairs or structured chat examples.
Parameter-efficient techniques like LoRA and QLoRA allow you to fine-tune large models on modest hardware with strong results.
You can fine-tune both open-source models (e.g., LLaMA 2) and proprietary ones (e.g., OpenAI's GPT-4o) depending on your use case.
Knowledge distillation enables you to train a smaller model to mimic a larger fine-tuned model—reducing cost and latency for deployment.
Fine-tuning opens the door to building domain-aware assistants, structured generation tools, and highly customized LLM applications.

FAQ

How do I choose between prompting, RAG, and fine-tuning?

Use this quick decision flow: 1) Prompting test: improve your prompt and try 10 real cases; if ≥8 meet quality, stop at prompting (fastest, cheapest). 2) Data freshness: if facts change (daily/weekly/monthly), use RAG to fetch up-to-date info at query time. 3) Consistency and domain nuance: if you need strict style/format adherence or deep domain behavior on stable knowledge, fine-tune. In production, a hybrid often wins: fine-tune for behavior and use RAG for fresh facts.

What exactly is fine-tuning and what gains can I expect?

Fine-tuning updates model weights on domain- or task-specific data so the model becomes a specialist. It improves precision, tone/style control, format compliance, and reduces hallucinations in your domain. Evidence: OpenAI reported that 50–100+ high-quality examples yield clear gains, and a few hundred can dramatically improve task performance versus zero-shot base models. Trade-offs: higher cost/effort, and knowledge becomes static until retrained.

What are the core phases of the fine-tuning workflow?

Three phases: 1) Data Preparation: collect, clean, balance, annotate, and format representative examples. 2) Model Selection: pick a base model and a method (full FT, LoRA, QLoRA) that fits your hardware and goals. 3) Training and Evaluation: configure hyperparameters, run jobs, monitor metrics/checkpoints, validate to avoid overfitting, and prepare for deployment.

How should I collect and curate high-quality training data?

- Use real, domain-specific text you have rights to (e.g., support logs, FAQs, legal/contracts, code). - Add human annotation where needed (labels, rankings, style corrections). - Prioritize quality over quantity; involve domain experts. - Remove incorrect facts and sensitive data; mitigate bias and ensure coverage across user intents/classes. - Match real deployment inputs/outputs (tone, jargon, formats).

How do I format training examples and respect token limits?

- Match your platform’s expected format: prompt–completion pairs or chat-style messages. - Keep prompts and outputs consistent (role tags, JSON schemas, stop sequences). - Stay well under the model’s context length; split long docs into smaller examples. - For OpenAI, use JSONL with consistent roles and, if needed, explicit stop tokens. Consistent formatting strongly improves reliability.

How do I evaluate, monitor, and avoid overfitting?

- Create a validation set (10–20%) with the same distribution as training; prevent data leakage. - Track validation loss/metrics during training; stop or select the best checkpoint when validation worsens. - For managed APIs, supply validation files and inspect per-epoch checkpoints. - Evaluate qualitative behavior too: tone, format adherence, factuality, and hallucination rates on real-world scenarios.

Full fine-tuning vs LoRA vs QLoRA: which should I pick?

- Full fine-tuning: highest performance and flexibility, but requires large VRAM/multi-GPU and risks catastrophic forgetting. - LoRA: train small adapters (≈0.1–1% of params) for 95–99% of full-FT quality at a fraction of cost. - QLoRA: LoRA plus 4-bit quantized base weights; enables big models on limited GPUs with ~90–95% of full-FT performance. Choose full FT if you have heavy compute and need maximal adaptation; otherwise prefer LoRA/QLoRA.

How do I fine-tune a closed-source model with OpenAI’s API?

Steps: 1) Prepare chat-format JSONL examples (system/user/assistant messages). 2) Upload the file with purpose=fine-tune. 3) Create a fine-tuning job (choose model, e.g., gpt-4o-mini; set n_epochs and optional validation). 4) Monitor events and checkpoints. 5) Use the returned fine-tuned model name (ft:...) in Chat Completions. Result: consistent tone/policies/behaviors without long prompts.

How do I fine-tune an open-source model (e.g., LLaMA-2) with QLoRA?

Outline: 1) Load a chat-tuned base (e.g., LLaMA-2-7B) with 4-bit quantization (bitsandbytes). 2) Add LoRA adapters (e.g., r=8, lora_alpha=16, target q_proj/v_proj, dropout=0.1). 3) Prepare a representative dataset (e.g., Bitext support) and format for the model’s instruction tokens ([INST]…[/INST]). 4) Train with SFTTrainer using small batch + gradient accumulation; fp16 for efficiency. 5) Save and deploy by loading base weights plus PEFT adapters.

What is knowledge distillation and when should I use it?

Distillation trains a smaller student model to mimic a larger teacher by learning from the teacher’s input–output pairs (sequence-level distillation; no logits needed). Steps: collect teacher completions, define an eval benchmark, fine-tune the student (e.g., GPT‑4o Mini), then evaluate and iterate. Use it to cut cost and latency (often 5–20× smaller), and to deploy on constrained environments while preserving quality.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more