Build a Machine Learning Platform (From Scratch) you own this product

Benjamin Tan Wei Hao, Shanoop Padmanabhan, and Varun Mallya

MEAP began May 2024
Last updated December 2025
Publication in March 2026 (estimated)

ISBN 9781633437333
475 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning

resources: Source code Book forum Source code on GitHub

table of content

PART 1: LAYING THE MLOPS FOUNDATIONS

1 Getting started with MLOps and ML engineering

1.1 The ML Life Cycle

1.1.1 Experimentation Phase

1.1.2 Dev/Staging/Production Phase

1.2 Skills Needed for MLOps

1.2.1 Prerequisites

1.3 Building a Machine Learning Platform

1.3.1 Build vs Buy

1.3.2 Looking Ahead: From MLOps to LLMOps

1.3.3 Tools Used In This Book

1.4 Building Machine Learning Systems

1.4.1 Introducing the ML Projects

1.5 Summary

2 What is MLOps ?

2.1 The iterative MLOps lifecycle

2.1.1 Data collection

2.1.2 Exploratory Data Analysis (EDA)

2.1.3 Modelling and Training

2.1.4 Model Evaluation

2.1.5 Deployment

2.1.6 Monitoring

2.1.7 Maintenance, updates and review

2.2 Why is robust MLOps important ?

2.3 Role of MLOps in a mature organization

2.4 DevOps vs MLOps

2.5 Levels of MLOps maturity

2.5.1 Level 0 - Basic

2.5.2 Level 1 - Intermediate

2.5.3 Level 2 - Advanced

2.6 Summary

3 Building applications on Kubernetes

3.1 Containers and tooling

3.2 Docker

3.2.1 Write application code

3.2.2 Write Dockerfile

3.2.3 Building And Pushing Docker Image

3.3 Kubernetes

3.3.1 Kubernetes Architecture Overview

3.3.2 Kubectl

3.3.3 Kubernetes Objects

3.3.4 Networking And Services

3.3.5 Other Objects

3.3.6 Helm charts

3.3.7 Conclusion

3.4 Continuous Integration And Deployment

3.4.1 Gitlab CI

3.4.2 Argo CD

3.5 Prometheus And Grafana

3.6 Summary

PART 2: BUILDING CORE ML PLATFORM CAPABILITIES

4 Designing reliable ML systems

4.1 MLflow For Experiment Tracking

4.1.1 Data Exploration

4.1.2 MLflow Tracking

4.1.3 MLflow Model Registry

4.2 Feast as a Feature Store

4.2.1 Registering Features

4.2.2 Retrieving Features

4.2.3 Feature Server

4.2.4 Feast UI

4.3 Summary

5 Orchestrating ML pipelines

5.1 Kubeflow Pipeline, the Task Orchestrator

5.1.1 Kubeflow Components

5.1.2 Income Classifier Pipeline

5.2 Summary

6 Productionizing ML models

6.1 Bento ML as a Deployment Platform

6.1.1 Building A Bento

6.1.2 Deploying a Bento

6.2 Evidently For Data Drift Monitoring

6.2.1 Data Drift Detection Report And Dashboard

6.2.2 Data drift detection Kubeflow pipeline component

6.2.3 Data drift detection for model deployed as API

6.3 Summary

PART 3: APPLYING MLOPS IN PRACTICE

7 Data analysis and preparation

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity / tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data Passing

7.2.1 Scenario 1: Passing Simple Values to Downstream Components

7.2.2 Scenario 2: Passing Paths for Larger Data

7.2.3 Overview of KFP v2 Artifact Types

7.3 Data Preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

7.4 Summary

8 Model training and validation: Part 1

8.1 Training an Object Detection Model

8.1.1 Training YOLO on a Custom Dataset

8.1.2 Training the Model

8.1.3 Container Components for System Dependencies

8.1.4 Creating the Validation Component

8.1.5 Creating the Pipeline

8.1.6 Executing the Pipeline

8.1.7 Validating Model Artifacts

8.2 Summary

9 Model training and validation: Part 2

9.1 Storing data with PersistentVolumeClaim

9.1.1 Creating a VolumeOp

9.1.2 Download Op using PVC

9.2 Tracking training with TensorBoard

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance.

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLFlow

9.3.5 Model registry with MLFlow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook

9.4 Summary

10 Model inference & serving

10.1 Model deployment is hard

10.2 BentoML: Simplifying model deployment

10.3 A whirlwind tour of BentoML

10.3.1 BentoML service and runners

10.4 Executing a BentoML service locally

10.4.1 Loading a model with BentoML Runner

10.5 Building Bentos: Packaging your service for deployment

10.5.1 Bento Tags: Versioning and managing your Bentos

10.5.2 BentoML and MLFlow inference

10.6 Using only MLFLow to create an inference service

10.7 KServe: An alternative to BentoML

10.8 Summary

11 Monitoring and explainability

11.1 Monitoring

11.1.1 Basic monitoring

11.1.2 Custom metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data Drift detection

11.2.1 Object detection

11.2.2 Movie recommender

11.3 Explainability

11.3.1 Object detection

11.3.2 Movie recommendation

11.4 Looking back, moving forward!

11.5 Summary

PART 4: EXTENDING MLOPS FOR LARGE LANGUAGE MODELS

12 Designing LLM-powered systems

12.1 LLMOps: New challenges, familiar principles

12.1.1 What makes LLM applications different

12.1.2 Extending our ML platform for LLMs

12.1.3 Essential tools for LLM applications

12.2 Building DataKrypt’s Dakka Bot: A simple RAG architecture

12.2.1 What you’ll build

12.2.2 Beyond single API calls: Designing for composability

12.2.3 Google’s Gemini LLM and embeddings

12.2.4 The retrieval component

12.2.5 The augmentation component

12.2.6 The generation component

12.3 Giving DakkaBot a UI

12.4 Observability for LLM applications

12.4.1 Set up LangFuse via Docker

12.4.2 Integrating LangFuse with DakkaBot

12.4.3 Enhanced observability in DakkaBotCore

12.4.4 Beyond traditional metrics

12.5 Summary

13 Production LLM system design

13.1 Prompt engineering: Code for the GenAI era

13.1.1 Treating prompts as critical infrastructure

13.1.2 LangFuse prompt management for DakkaBot

13.1.3 LangFuse prompt management for production

13.2 Testing LLM applications

13.2.1 Evaluation framework for LLM responses

13.2.2 Safety and adversarial testing

13.3 Governance and safety in production

13.3.1 Implementing safety guardrails

13.4 Cost optimization strategies

13.4.1 Understanding LLM economics

13.4.2 Model selection strategy

13.4.3 Caching strategies

13.4.4 Prompt optimization for efficiency

13.4.5 Production cost monitoring

13.4.6 From traditional ML to LLMOps

13.5 Summary

Appendices

Appendix A: Installation and setup

A.1 Local Installation Of Command Line Tools (Mac And Linux)

A.1.1 Argo CD

A.1.2 Kubeflow

A.1.3 Cloud Provider K8s Setup

A.1.4 MLflow Setup

A.2 Deploy MLflow

A.2.1 Redis Online Store Setup

A.2.2 BentoML Setup

A.2.3 Evidently UI Setup

Appendix B: Basics of YAML

B.1 Basic YAML file

B.1.1 Data Types in YAML

B.2 Aliases and Anchors

B.2.1 References (Merging and Reusing Data)

B.2.2 Complex Data Types

B.2.3 Custom Data Types

B.2.4 Block Style vs. Flow Style

B.2.5 Key Sorting and Case Sensitivity

B.2.6 Best Practices

Overview

1 Getting started with MLOps and ML engineering

This chapter introduces the gap between building models and operating them reliably in production, positioning MLOps as the discipline that closes it. It sets expectations for a hands-on journey that turns readers into confident ML engineers by focusing on the complete life cycle, practical patterns, and the skills companies value. Rather than theory-heavy content, the chapter frames the book as an incremental, project-driven guide that emphasizes reproducibility, automation, and real-world decision-making across diverse ML use cases.

The chapter first maps the ML life cycle as an iterative process: problem formulation, data collection and preparation (often with labor-intensive labeling), data versioning for reproducibility, model training, evaluation, and stakeholder-driven validation. It then contrasts this experimentation loop with the dev/staging/production phase, where pipelines are fully automated and triggered via CI, culminating in deployable services (commonly REST APIs). Production concerns include containerization, scalability, performance testing, versioned releases with rollback strategies, and monitoring that spans system metrics, data/model drift, and business KPIs. Retraining is treated as a first-class, automatable step, scheduled or triggered by performance thresholds.

To execute all this, the chapter outlines the core skill set (solid software engineering, practical ML fluency, data engineering, and a bias toward automation) and introduces an incremental path to building an ML platform. Kubeflow and its Pipelines provide orchestration; additional components such as feature stores and model registries prevent training-serving skew and enable promotion flows; CI/CD, container registries, and Kubernetes operationalize deployment. Tool choices are pragmatic and context-driven, and the “build vs buy” discussion encourages assembling a platform at least once to understand constraints and integrations. The chapter previews three projects—an OCR system, a movie recommender, and a RAG-powered documentation assistant—showing how the same MLOps foundations extend to LLMOps with vector databases, guardrails, and cost-aware operations.

The experimentation phase of the ML life cycle

The dev/staging/production phase of the ML life cycle

MLOps is a mix of different skill sets

The mental map of an ML setup, detailing the project flow from planning to deployment and the tools typically involved in the process

Traditional MLOps (right) extended with LLMOps components (left) for production LLM systems. Chapters 12-13 explore these extensions in detail.

An automated pipeline being executed in Kubeflow.

Feature Stores take in transformed data (features) as input, and have facilities to store, catalog, and serve features.

The model registry captures metadata, parameters, artifacts, and the ML model and in turn exposes a model endpoint.

Model deployment consists of the container registry, CI/CD, and automation working in concert to deploy ML services.

Summary

The Machine Learning (ML) life cycle provides a framework for confidently taking ML projects from idea to production. While iterative in nature, understanding each phase helps you navigate the complexities of ML development.
Building reliable ML systems requires a combination of skills spanning software engineering, MLOps, and data science. Rather than trying to master everything at once, focus on understanding how these skills work together to create robust ML systems.
A well-designed ML Platform forms the foundation for confidently developing and deploying ML services. We'll use tools like Kubeflow Pipelines for automation, MLFlow for model management, and Feast for feature management - learning how to integrate them effectively for production use.
We'll apply these concepts by building two different types of ML systems: an OCR system and a Movie recommender. Through these projects, you'll gain hands-on experience with both image and tabular data, building confidence in handling diverse ML challenges.
Traditional MLOps principles extend naturally to Large Language Models through LLMOps - adding components for document processing, retrieval systems, and specialized monitoring. Understanding this evolution prepares you for the modern ML landscape.
The first step is to identify the problem the ML model is going to solve, followed by collecting and preparing the data to train and evaluate the model. Data versioning enables reproducibility, and model training is automated using a pipeline.
The ML life cycle serves as our guide throughout the book, helping us understand not just how to build models, but how to create reliable, production-ready ML systems that deliver real business value.

FAQ

Why does this book emphasize MLOps and production ML systems?

Most ML projects don’t fail because the model is “bad” but because deploying, operating, and evolving ML in production is hard. This book focuses on the practices that make ML reliable at scale: automation, orchestration, monitoring, reproducibility, and sound engineering. You’ll learn to carry a model from idea to production and keep it healthy over time.

What is the ML life cycle at a high level?

The life cycle spans iterative experimentation (problem formulation, data prep, versioning, training, evaluation, validation) and a production phase (fully automated pipelines, deployment, monitoring, and retraining). It’s inherently non-linear: findings in later steps often push you back to earlier ones, especially around data.

How does the Experimentation phase differ from Dev/Staging/Production?

Experimentation: fast iteration, partial automation, exploring data, models, and metrics.
Dev/Staging/Production: fully automated pipelines triggered by CI or events, versioned releases, deployment as services (often REST), and ongoing monitoring.
The focus shifts from discovery to reliability, scale, security, and governance.

What are the core steps in the Experimentation phase?

Problem formulation: confirm ML is the right tool and define success criteria.
Data collection and preparation: source, annotate/label, and split into train/val/test.
Data versioning: track datasets and changes for reproducibility.
Model training: automate runs, capture parameters and artifacts.
Model evaluation: measure with appropriate metrics on unseen data.
Model validation: business/stakeholder checks before promotion.

Why is data versioning essential and why is it hard?

In ML, data changes alter behavior just like code changes do. Versioning data ensures experiments are reproducible and models can be audited. It’s challenging because data comes in varied formats and sizes, and the ecosystem lacks a “Git-equivalent” standard, so teams adopt fit-for-purpose tools and processes.

What’s the difference between model evaluation and model validation?

Evaluation: technical assessment on held-out data using metrics like precision, recall, AUC.
Validation: confirms the model’s behavior aligns with business expectations and constraints; often performed by stakeholders beyond the model builders.

What does a robust model deployment process look like?

Expose inference via an API (commonly REST).
Containerize (e.g., Docker) and deploy on orchestrators (e.g., Kubernetes).
Automate via CI/CD: build, push, and apply manifests.
Load test, enable autoscaling, and version releases with rollback strategies.

What should be monitored in production ML systems?

System/performance: latency, RPS, error rates, resource usage.
Data/model health: data drift, concept drift, model quality regression.
Business KPIs: domain-specific metrics (e.g., conversion, fraud catch-rate).
Alerts/triggers: thresholds that can kick off retraining or rollback.

When should models be retrained, and how should it be automated?

Retrain on a schedule (e.g., monthly) or based on triggers (e.g., drift or KPI drops). Use automated pipelines to re-ingest data, train, evaluate, validate, and deploy. Full automation shortens feedback loops and keeps quality stable as data and behavior shift.

What skills and platform components does this book cover?

Skills: solid software engineering, practical ML with common frameworks, data engineering basics, and heavy emphasis on automation/reproducibility.
Platform: Kubeflow/Kubeflow Pipelines for orchestration, CI/CD and container registries for deployment, feature stores (to share features and prevent training-serving skew), model registries (to track runs and promote versions), and Kubernetes for scaling.
Build vs Buy: even if you use managed platforms (e.g., SageMaker, Vertex AI), assembling a platform from scratch once builds understanding and flexibility.
LLMOps: the same foundation extends to RAG/LLM systems with vector DBs, guardrails, and specialized monitoring later in the book.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more