Overview

1 Getting started with MLOps and ML engineering

This chapter sets the stage for building production-grade ML systems by shifting focus from model creation to the realities of deployment, reliability, and ongoing operations. It introduces MLOps as the engineering discipline that brings structure, automation, and accountability to the full ML life cycle, emphasizing hands-on, iterative learning over theory. Aimed at data scientists, software engineers, and ML engineers alike, it outlines a practical journey that builds confidence through real-world patterns, clear workflows, and progressively introduced tools—laying foundations that also extend naturally to modern LLM use cases.

The ML life cycle is presented as an iterative process that moves from problem formulation and data collection/labeling through data versioning, training, evaluation, and validation—then transitions to dev/staging/production where full automation, deployment, monitoring, and retraining take center stage. Pipelines orchestrate non-linear experimentation, enforce reproducibility, and enable CI-triggered end-to-end runs. In production, models are served as services, monitored for both system and business metrics, and watched for drift; retraining is automated on schedules or thresholds. The chapter underscores the need for disciplined engineering practices—versioning, testing, observability, and rollback plans—to keep models reliable under real-world conditions.

Success in MLOps requires a blend of strong software engineering, practical ML fluency, data engineering awareness, and a bias toward automation and reproducibility (with tools like Kubernetes and CI/CD). To ground these skills, the chapter lays out an incremental approach to building an ML platform using Kubeflow and its pipelines, then augmenting it with essential components such as a feature store, model registry, containerization, and deployment automation—while remaining pragmatic about tool choice and trade-offs. The journey is made concrete through three projects—an OCR system, a movie recommender, and a RAG-powered documentation assistant—demonstrating how the same core MLOps principles evolve from classic ML to LLMOps without abandoning the foundational platform and practices.

The experimentation phase of the ML life cycle
The dev/staging/production phase of the ML life cycle
MLOps is a mix of different skill sets
The mental map of an ML setup, detailing the project flow from planning to deployment and the tools typically involved in the process
Traditional MLOps (right) extended with LLMOps components (left) for production LLM systems. Chapters 12-13 explore these extensions in detail.
An automated pipeline being executed in Kubeflow.
Feature Stores take in transformed data (features) as input, and have facilities to store, catalog, and serve features.
The model registry captures metadata, parameters, artifacts, and the ML model and in turn exposes a model endpoint.
Model deployment consists of the container registry, CI/CD, and automation working in concert to deploy ML services.

Summary

  • The Machine Learning (ML) life cycle provides a framework for confidently taking ML projects from idea to production. While iterative in nature, understanding each phase helps you navigate the complexities of ML development.
  • Building reliable ML systems requires a combination of skills spanning software engineering, MLOps, and data science. Rather than trying to master everything at once, focus on understanding how these skills work together to create robust ML systems.
  • A well-designed ML Platform forms the foundation for confidently developing and deploying ML services. We'll use tools like Kubeflow Pipelines for automation, MLFlow for model management, and Feast for feature management - learning how to integrate them effectively for production use.
  • We'll apply these concepts by building two different types of ML systems: an OCR system and a Movie recommender. Through these projects, you'll gain hands-on experience with both image and tabular data, building confidence in handling diverse ML challenges.
  • Traditional MLOps principles extend naturally to Large Language Models through LLMOps - adding components for document processing, retrieval systems, and specialized monitoring. Understanding this evolution prepares you for the modern ML landscape.
  • The first step is to identify the problem the ML model is going to solve, followed by collecting and preparing the data to train and evaluate the model. Data versioning enables reproducibility, and model training is automated using a pipeline.
  • The ML life cycle serves as our guide throughout the book, helping us understand not just how to build models, but how to create reliable, production-ready ML systems that deliver real business value.

FAQ

What is MLOps and why does it matter for real-world ML systems?MLOps is the engineering practice of turning models into reliable, scalable services and keeping them healthy over time. Many ML projects fail not because the model is weak, but because teams lack repeatable pipelines, deployment practices, monitoring, and automation needed to run ML in production.
How is the ML life cycle structured, and why is it highly iterative?The chapter frames two phases: an Experimentation phase (problem formulation, data prep, data versioning, training, evaluation, validation) and a Dev/Staging/Production phase (full automation, deployment, monitoring, and retraining). Teams loop back frequently—discoveries in training or evaluation often send you back to data or problem framing.
How do I decide whether to use ML versus simple heuristics?Start by validating whether a heuristic can solve the problem. Choose ML when the data is complex, high-dimensional, or patterns are non-linear. Align business and technical stakeholders on a clear problem statement and success criteria before committing to ML.
What skills and prerequisites are important for MLOps and ML engineering?You need solid software engineering (debugging, performance, deployments), working knowledge of ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn), basic data engineering for data quality and pipelines, and a strong focus on automation and reproducibility (CI/CD, environment parity). A lightweight understanding of Kubernetes helps you run and scale services.
How should I approach data collection, labeling, and dataset splits?Identify trustworthy data sources, consider synthetic data when real data is scarce, and ensure careful labeling with clear guidelines. Organize data into training, validation, and test sets to get a realistic read on performance before release.
Why is data versioning essential, and how do teams handle it?Changing data changes model behavior, so versioning data (alongside code) is critical for reproducibility and audits. Track dataset snapshots, schemas, labels, and link them to experiment runs and artifacts so you can recreate results exactly.
What’s the difference between model evaluation and model validation?Evaluation measures model performance on held-out data using metrics like precision, recall, or AUC. Validation confirms the model behaves as expected for the business—often with stakeholder review, qualitative checks, and domain constraints—before promotion.
Why automate ML pipelines, and which tools are introduced?Automation reduces mistakes, speeds iteration, and guarantees reproducibility across environments. The chapter introduces pipeline orchestration with Kubeflow Pipelines to connect steps end-to-end and trigger runs via CI or programmatic events.
How do I deploy and operate ML models in production safely?Expose inference via a service (often REST), containerize with Docker, and run on Kubernetes for portability and scaling. Use CI/CD to build, version, and roll out services; load-test, autoscale for spiky demand, and keep rollback strategies ready. Model registries can help version, promote, and serve the right model.
What should I monitor in production, and when should I retrain?Track service health (RPS, latencies, error rates), data and model drift, and business KPIs that reflect value. Retrain on a schedule or when thresholds are breached (e.g., drift or KPI drops), and automate retraining and redeployment to close the loop.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free