Building Reliable AI Systems you own this product

Production-ready methods to reduce hallucination, bias, and more

Rush Shahani

MEAP began October 2024
Last updated October 2025
Publication in Summer 2026 (estimated)

ISBN 9781633436732
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning / Large Language Models

table of content

PART 1: UNDERSTANDING AND MITIGATING HALLUCINATIONS IN LLMS

1 Deploying reliable and responsible large language models in the real world

1.1 The Inception of Large Language Models

1.2 The Tangible Impact of LLMs in the Real World

1.3 How do Large Language Models work?

1.3.1 Training Datasets

1.3.2 Neural Network Architectures of LLMs

1.4 Navigating Key Challenges in Real-World LLM Deployment

1.4.1 Curbing Hallucination Risks

1.4.2 Mitigating Problematic Biases

1.4.3 Improving Efficiency of LLMs

1.4.4 Performance Optimization of LLMs

1.5 Summary

1.6 References

2 Understanding and measuring hallucinations in LLMs

2.1 What are Hallucinations Exactly?

2.1.1 Unpacking the Use of "Hallucination" in AI

2.1.2 Types of Hallucinations

2.1.3 Why Do Hallucinations Occur?

2.2 How to Identify and Measure Hallucinations

2.2.1 Four Steps to Identify and Measure Hallucinations

2.2.2 ROUGE Metric: News Summarization Example

2.2.3 QA-Based Metrics

2.2.4 LLM as a Judge: A Holistic Approach to Hallucination Detection

2.2.5 Human Evaluation

2.2.6 Red Teaming: Stress-Testing LLMs for Robustness

2.2.7 Using Arize and the Phoenix Framework for Hallucination Detection

2.3 Mitigating Hallucinations

2.3.1 Selecting the Right Model

2.3.2 Data-Related Methods to Improve Relevance and Reduce Hallucinations

2.3.3 Post-Processing to Lower Hallucinations

2.3.4 Retrieval-Augmented Generation to Improve Grounding

2.3.5 Prompting Techniques to Lower Hallucinations

2.3.6 Mitigating Hallucinations Through Produce Design and User Interaction

2.4 Summary

3 Minimizing hallucinations and enhancing reliability with prompt engineering techniques

3.1 Tailoring LLM Settings for Maximum Reliability

3.1.1 Optimizing Temperature for Predictable Outputs

3.2 Limit Output and Turns for Reduction of Hallucination Risk

3.3 Applying Frequency and Presence Penalties for Balanced Content

3.4 Minimizing Intrinsic Randomness for Stable Performance

3.5 Foundations of Prompt Engineering for Reliable LLMs

3.5.1 Designing Components of a Prompt for Reliability

3.5.2 Crafting Basic Prompts with a Focus on Dependability

3.6 Prompt Engineering Techniques for Preventing Hallucinations

3.6.1 Zero-shot prompting

3.6.2 Few-shot Prompting for Contextual Stability

3.6.3 Chain-of-thought Prompting for Transparent Reasoning

3.6.4 Automatic Chain of Thought (Auto-CoT) for Scalable Reasoning

3.6.5 Self-Consistency for Cross-Verification

3.6.7 Tree-of-Thought (ToT) Prompting for Structured Decision Making

3.7 Project: Creating a Reliable Weather Assistant with OpenAI’s Function Calls

3.7.1 Building a Weather Assistant

3.7.2 Getting Started and Defining the Weather Function

3.7.3 Integrating the Chatbot

3.8 Summary

3.9 References

4 Advancing trust & minimizing hallucinations with retrieval augmented generation

4.1 What is Retrieval Augmented Generation?

4.2 RAG System Architecture

4.3 Reducing Hallucinations with RAG

4.4 Data Preparation and Reliable Indexing for RAG Systems

4.4.1 Structuring Product Catalogs

4.4.2 Creating the searchable vector index

4.4.3 Implementing a FAISS Index

4.5 Building an Effective RAG system

4.5.1 Processing User Queries

4.5.2 Crafting an Accurate Response Using an LLM

4.6 Evaluating and Optimizing RAG Systems

4.6.1 The Role of Evaluator LLMs in Assessing Hallucinations

4.6.2 Crafting Evaluation Datasets

4.6.3 Practical Metrics for RAG Evaluation

4.6.4 Reducing Hallucinations by Filtering Retrievals and Leveraging Metadata

4.6.5 Embedding chunk sizes and model choice

4.7 Advanced Techniques for Improving RAG

4.8 Introduction to LangChain

4.8.1 Building a RAG System for an Ecommerce Chatbot with LangChain

4.8.2 Challenges and Adaptations of the Chatbot

4.9 Summary

4.10 References

PART 2: OPTIMIZING PERFORMANCE AND BUILDING RELIABLE LLM APPLICATIONS

5 Building Reliable AI Agents

5.1 Why Do We Need Agents?

5.1.1 Static Knowledge

5.1.2 Inability to Act

5.1.3 No Workflow Management

5.1.4 How AI Agents Solve These Challenges

5.1.5 Broader Applications of AI Agents

5.2 Core Components of Reliable Agents

5.2.1 Memory: Context that Sticks

5.2.2 Balancing Memory Systems with LangChain

5.2.3 Tool Integration: Connecting Agents to the Real World

5.2.4 Strategies to reduce hallucinations with tools

5.2.5 Decision Making: Making AI Agents Intelligent

5.3 Agentic RAG for Reliable Agents

5.3.1 Project: Building an Agentic RAG System with LangChain

5.3.2 Setting Up Dependencies

5.3.3 Building the Knowledge Base

5.3.4 Creating Interactive Tools

5.3.5 Configuring the Agent’s Behavior

5.3.6 Implementing Transparent Decision Making

5.3.7 Managing Conversation Context

5.3.8 Making it Working in Production

5.4 Advanced Techniques for Preventing Hallucinations and Improving Accuracy in Agents

5.4.1 Cross-Referencing

5.4.2 Guardrails for security and preventing hallucinations

5.4.3 Reinforcement Learning from AI Feedback (RLAIF)

5.4.4 Transparency in Multi-Step Reasoning

5.5 Summary

5.6 References

6 Performance optimization techniques for LLMs and agents

6.1 Essential Architectural Patterns for LLM-Based Systems

6.1.1 Token Streaming: Presenting Answers Incrementally or All at Once

6.1.2 Handling Surges with Batching—System-Level vs. OpenAI’s Batch API

6.1.3 Caching for Efficiency

6.1.4 Multi-Model Fallback: Matching Each Query to the Right Model

6.1.5 Project: Building an E-Commerce LLM Service with Batching, Caching, and Model Fallback

6.1.6 Further Improvements

6.2 Measuring and Evaluating LLM and Agent Performance

6.2.1 Core Metrics for Evaluating LLMs

6.2.2 Evaluating Agents: Beyond Traditional LLM Metrics

6.3 Summary

7 Fine-Tuning LLMs for Improved Performance

7.1 Choosing the right approach: Prompting, RAG, or fine-tuning?

7.1.1 Prompting

7.1.2 Retrieval-augmented generation (RAG)

7.1.3 Fine-tuning

7.1.4 Hybrid approach: Combining RAG and fine-tuning

7.2 Case studies: Finetuning in the real world

7.2.1 Medicine: Med-PaLM 2

7.2.2 Finance: BloombergGPT and FinGPT

7.2.3 Code generation: Codex

7.3 Understanding the finetuning process

7.3.1 The data preparation phase

7.3.2 Fine-tuning closed source models (OpenAI)

7.3.3 Understanding key fine-tuning approaches

7.3.4 Building a customer support assistant with META’s LLaMA open-source model

7.4 Knowledge Distillation for LLMs

7.4.1 How knowledge distillation works in LLMs

7.5 Summary

7.6 References

8 Embeddings, Vector Databases and Retrieval

8.1 What are embeddings (really)?

8.1.1 A mental model: Embeddings as coordinates in meaning space

8.1.2 Why the model matters

8.2 Embedding models in production

8.2.1 Commercial models: Powerful but opaque

8.2.2 Open-source models: Flexible and transparent

8.2.3 Domain-specific models: Purpose-built precision

8.2.4 How to choose the right model

8.2.5 Practical applications across industries

8.2.6 Case study: Learning from Airbnb’s embedding journey

8.3 Beyond simple vectors: Hybrid and multi-stage retrieval

8.3.1 The limitations of pure vector search

8.3.2 Hybrid retrieval: Combining dense and sparse approaches

8.3.3 Multi-stage retrieval: Building precision through layers

8.3.4 Building a hybrid retriever: Hands-on implementation for an employee chatbot

8.4 Essential retrieval optimizations with embeddings and RAG

8.4.1 Metadata filtering: Beyond pure semantic search

8.4.2 Chunking strategies: Finding the right granularity

8.5 Exploring vector storage and databases

8.5.1 HNSW: The algorithm powering modern vector search

8.5.2 FAISS: Industry-standard vector indexing

8.5.3 Vector databases: Complete vector storage solutions

8.5.4 Vector database project: Building a healthcare policy assistant with pinecone

8.6 Practical challenges: Drift and compression

8.6.1 Compression: Getting more from less

8.7 Summary

8.8 References

PART 3: MONITORING, BIAS, AND ETHICAL CONSIDERATIONS

9 Deploying and monitoring large language models for high-quality outcomes

9.1 Introducing LLMOps

9.2 Serving LLMs: Hosted APIs vs. open-source models

9.2.1 Using hosted APIs

9.2.2 The open-source alternative

9.2.3 The hybrid solution: Best of both worlds

9.3 Building LLM-native monitoring systems

9.3.1 What really matters: The four questions

9.3.2 Logging what actually matters

9.3.3 Setting up alerts that actually help

9.3.4 Catching cost explosions before they hurt

9.3.5 Building dashboards that drive action

9.3.6 Output quality monitoring

9.4 User experience and feedback monitoring

9.4.1 Explicit feedback collection

9.4.2 Implicit feedback signals

9.4.3 Building actionable feedback loops

9.5 Ensuring high-quality outputs in production

9.5.1 The three-pillar quality framework

9.5.2 Prompt engineering for consistent quality

9.5.3 Continuous quality monitoring with automated testing

9.6 Observability in practice: Introducing Langfuse with a real-world case study

9.6.1 Case Study: How Huntr uses Langfuse to power the AI Resume Builder

9.7 Summary

10 Bias, privacy and trust in AI systems

10.1 The responsible AI imperative

10.1.1 Regulatory pressure is accelerating

10.1.2 User expectations have shifted

10.1.3 Business risks have multiplied

10.1.4 Real examples of AI bias in production

10.1.5 The four failure modes

10.1.6 The responsible AI defense system

10.2 Data layer: Where bias begins

10.2.1 The fine-tuning bias trap

10.2.2 Detecting bias in chat logs

10.2.3 The name experiment

10.2.4 Three proven bias mitigation strategies

10.3 Model layer: Where bias evolves

10.3.1 Why this matters for open-source

10.3.2 Example: Adding fairness to a LoRA fine-tuning loop

10.3.3 ANTHROPIC’S constitutional AI: LLM-as-judge at training scale

10.4 Safety layer: Your last line of defense

10.4.1 Multi-layered safety architecture

10.4.2 Layer 3: Enhanced safety with commercial APIs

10.5 Privacy layer: Protecting personal data

10.5.1 Why LLM privacy failures are uniquely dangerous

10.5.2 Building Sensitive Data Detection

10.5.3 Understanding HIPAA: Healthcare privacy protection

10.5.4 Understanding GDPR: European data protection

10.6 Real-world project: SafeMedAssist

10.6.1 Why a medical AI assistant?

10.6.2 Professional testing with LangTest

10.6.3 Production Deployment Considerations

10.6.4 The business case for responsible AI

10.7 Summary

11 Model Context Protocol and multi-agent AI systems

11.1 What is MCP?

11.1.1 The hidden complexity: The N×M problem

11.1.2 The solution: Model Context Protocol (MCP)

11.2 Building your first real MCP tool with a CSV-powered product catalog

11.2.1 Loading your product catalog

11.2.2 Setting up the MCP server

11.2.3 Running and testing your server

11.3 Teaching your AI to use MCP tools

11.3.1 How models learn what tools they can use

11.3.2 A complete example: model → MCP → answer

11.3.3 What you didn’t have to write

11.3.4 Tool design becomes interface design

11.4 Multi-agent systems with LangGraph

11.4.1 Why multi-agent architecture matters

11.4.2 Introducing LangGraph: The multi-agent framework

11.4.3 Building intelligent agent coordination

11.4.4 Real-world example: ShopBot multi-agent system

11.4.5 Running your multi-agent system

11.4.6 Adding vision: Image-based product search with VisionAgent

11.4.7 What we built—why it matters

11.5 Testing multi-agent workflows—not just agents

11.5.1 Blended queries: “Can your system walk and chew gum?”

11.5.2 Ambiguous intent: “Does your system make the right call?”

11.5.3 3. Agent failure: “Do you fail gracefully?”

11.5.4 State flow and memory: “Does information flow, or get lost?”

11.5.5 Regression: “Are you moving forward—without breaking what worked?”

11.6 From playground to production: How real AI gets built

11.6.1 From prompts to production workflows

11.6.2 Hallucination prevention: Beyond wishful thinking

11.6.3 Evaluation: Humans, LLMs, and continuous testing

11.6.4 Scaling, finetuning, and living systems

11.6.5 The mindset: Durable, trustworthy AI

11.7 Summary

11.8 References

Appendices

Appendix A: Choosing the right model

Overview

1 Deploying Large Language Models Reliably in the Real World

Large language models have progressed rapidly from the Transformer breakthrough to systems that read, write, and reason with striking fluency, but turning demos into dependable products remains hard. This chapter sets the stage: despite impressive capabilities and industry momentum, many pilots fail to deliver ROI because reliability, efficiency, and responsibility are treated as afterthoughts. The book positions itself as a practical guide to close that gap—focusing on engineering techniques that make LLMs accurate, trustworthy, and durable in production.

It highlights the real-world impact across domains—legal research acceleration (e.g., contract analysis), customer service automation at scale, faster software development with AI coding assistants, and enterprise “agentic” systems woven into core workflows—while surfacing why deployments break in practice. LLMs are probabilistic, can hallucinate, carry hidden bias, and increasingly take actions with real-world consequences. The chapter reframes success around rigorous evaluation and operations, not just capability demos: reliability must be measured, monitored, and continuously improved.

The chapter then outlines a toolkit for dependable systems: retrieval-augmented generation, structured reasoning prompts, semantic search, confidence and uncertainty signaling, and source attribution to curb hallucinations; proactive bias programs with adversarial tests, fairness metrics, dataset curation, and CI-style audits; and efficiency strategies such as distillation, quantization, caching, hybrid routing, and comprehensive quality monitoring. For agentic AI, it stresses least-privilege permissions and layered safety controls. With regulation, public trust, and ROI on the line, the book offers hands-on projects and end-to-end workflows—using accessible tools like Python and API-based models—to help teams build systems that stay reliable from day one to day 1,000.

Showing the exponential size increase in language models. Most of the newest models such as the newer GPT & Claude models do not reveal their parameter size but they are estimated to be over a trillion parameters.

Performance comparison of GPT models on AIME 2025 competition mathematics problems [10]

Global AI agents market growth forecast by region, 2018–2030, showing rapid acceleration to $50.3B by 2030

Summary

LLMs have immense potential to transform industries. Their applications span content creation, customer service, healthcare, and more.
Core challenges like hallucinations, bias, efficiency and performance must be addressed to successfully use LLMs in production.
Agentic AI systems that take real-world actions introduce new categories of risk requiring sophisticated reliability engineering.
Mitigating bias is crucial to prevent perpetuating harmful assumptions and ensure fair, equitable treatment.
Improving efficiency is vital to making large models economically and environmentally viable at scale.
Curbing hallucination risks is key to keep outputs honest and grounded in facts.
Performance optimization ensures LLMs meet speed responsiveness demands, and quality of real-world applications.
Multi-agent systems require coordination protocols, error handling, and monitoring to prevent cascading failures.
This book covers promising solutions to these challenges that will enable safely harnessing LLMs to create groundbreaking innovations across healthcare, science, education, entertainment, and more while building vital public trust.

FAQ

Why do so many LLM pilots fail to deliver ROI?

Most pilots stumble when moving from demo to production. An MIT study found 95% of generative AI pilots fail to deliver ROI due to hallucinations, flaky outputs, brittle toolchains, and weak evaluations. Without reliability engineering, systems that look magical in the lab become unreliable in the real world.

How do LLMs work, and what changed with Transformers?

LLMs predict the most likely next token based on patterns learned from massive text corpora. The 2017 Transformer architecture enabled models to capture long-range context, and scaling to billions+ parameters unlocked strong abilities in generation, comprehension, reasoning, and translation. Crucially, they’re probabilistic—so identical inputs can yield different outputs.

Where are LLMs driving impact today—and what risks come with it?

- Law: JPMorgan’s COIN automates ~360k hours of contract review; risk: fabricated citations. - Customer service: Klarna’s assistant replaces work of ~700 agents and saves ~$40M/year; Intercom’s bot resolves 67% of inquiries; risk: incorrect policy info. - Development: GitHub Copilot speeds coding by ~55%; risk: buggy/insecure suggestions. - Enterprise AI: Salesforce Einstein runs 1B+ predictions/day, boosting revenue 25–35%; risk: biased outputs in high-stakes contexts.

What is “hallucination,” and how can we minimize it?

Hallucination is when a model produces confident, plausible, but false content—like fabricated legal cases that look real. Mitigations include retrieval-augmented generation (RAG) and semantic search for grounding, chain-of-thought prompting for stepwise reasoning, confidence/uncertainty signaling, and source attribution. Treat it like a security threat: assume it will happen and deploy layered defenses.

How should teams detect and mitigate bias in LLM systems?

Bias often hides in correlations learned from historical data, yielding unequal outcomes across demographics. Build bias checks into the pipeline: adversarial tests, clear fairness metrics, routine audits, and curated, representative datasets. Automate bias testing in CI/CD and create feedback loops to rapidly catch and correct issues.

How can we improve LLM efficiency and cost without losing quality?

Use model distillation (smaller “student” models that retain ~90% performance at ~10% compute), quantization (shrink models by up to ~75% with minimal quality loss), intelligent caching/precompute, and hybrid routing (small models for routine tasks, large models for hard cases). Real-world results show it’s feasible to achieve performance at sustainable cost.

What should production monitoring include beyond standard metrics?

Track technical metrics (latency, error rates) and quality metrics (hallucination rate, bias incidents, user satisfaction). Comprehensive monitoring surfaces issues early, preserves trust, and prevents small degradations from becoming production outages or PR/legal risks.

What is agentic AI, and how do we keep it reliable?

Agentic AI uses tools and takes real actions (send emails, update databases, make purchases). Mistakes become costly—e.g., booking the wrong city or deleting critical records. Enforce least-privilege permissions, tightly scope tool/data access, add guardrails and escalation/human review for sensitive actions, and test workflows rigorously—especially in multi-agent setups.

Why do these reliability challenges matter right now?

LLMs are entering high-stakes domains (healthcare, finance, law), where errors bring regulatory, legal, and reputational consequences. Regulations (e.g., EU AI Act, pending U.S. rules) demand verifiable safety, fairness, and transparency. Reliable systems earn trust, scale faster, and produce ROI; unreliable ones invite lawsuits and public backlash.

What do I need to follow along with the book’s projects?

Install Python 3+ with pip, use a code editor (e.g., VS Code or Cursor), and obtain an OpenAI API key. Most examples use API calls rather than local hosting to keep setup simple and cost low; optional cloud services and tools are introduced later, with free-tier options where possible.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more