Overview

1 Getting started with MLOps and ML engineering

This chapter sets the stage for building reliable, production-grade ML systems by focusing on the operational challenges where most projects stumble—data quality, automation, deployment, and maintenance—rather than on model complexity. It introduces a practical, hands-on path to becoming a confident ML engineer, emphasizing real-world patterns, reproducibility, and scalable architectures. The approach is iterative, grounded in end-to-end thinking from problem framing through deployment and monitoring, with an emphasis on learning by doing.

The ML life cycle is presented in two broad phases. During experimentation, teams iterate through problem formulation, data collection and preparation, data versioning, model training, evaluation, and stakeholder validation—ideally orchestrated as reproducible pipelines. In the dev/staging/production phase, those pipelines are fully automated and triggered via CI or programmatic events to support deployment, versioned releases, and continuous monitoring. Production concerns include containerization and scalability, performance and reliability testing, monitoring infrastructure and business metrics alongside data/model drift, and automated retraining based on schedules or thresholds.

Success in MLOps draws on strong software engineering foundations, practical ML and data engineering skills, and a bias toward automation for repeatability and auditability. The chapter advocates incrementally building an ML platform—centered on Kubernetes and Kubeflow Pipelines—then extending it with feature stores, model registries, and CI/CD-driven deployment. Tool choices are pragmatic and context-dependent, with a “build vs buy” lens informed by first-hand platform assembly. The roadmap applies these principles through three projects: an OCR service, a tabular movie recommender (highlighting feature stores, drift, and observability), and a RAG-based documentation assistant that extends the same foundation to LLMOps.

The experimentation phase of the ML life cycle
The dev/staging/production phase of the ML life cycle
MLOps is a mix of different skill sets
The mental map of an ML setup, detailing the project flow from planning to deployment and the tools typically involved in the process
Traditional MLOps (right) extended with LLMOps components (left) for production LLM systems. Chapters 12-13 explore these extensions in detail.
An automated pipeline being executed in Kubeflow.
Feature Stores take in transformed data (features) as input, and have facilities to store, catalog, and serve features.
The model registry captures metadata, parameters, artifacts, and the ML model and in turn exposes a model endpoint.
Model deployment consists of the container registry, CI/CD, and automation working in concert to deploy ML services.

Summary

  • The Machine Learning (ML) life cycle provides a framework for confidently taking ML projects from idea to production. While iterative in nature, understanding each phase helps you navigate the complexities of ML development.
  • Building reliable ML systems requires a combination of skills spanning software engineering, MLOps, and data science. Rather than trying to master everything at once, focus on understanding how these skills work together to create robust ML systems.
  • A well-designed ML Platform forms the foundation for confidently developing and deploying ML services. We'll use tools like Kubeflow Pipelines for automation, MLFlow for model management, and Feast for feature management - learning how to integrate them effectively for production use.
  • We'll apply these concepts by building two different types of ML systems: an OCR system and a Movie recommender. Through these projects, you'll gain hands-on experience with both image and tabular data, building confidence in handling diverse ML challenges.
  • Traditional MLOps principles extend naturally to Large Language Models through LLMOps - adding components for document processing, retrieval systems, and specialized monitoring. Understanding this evolution prepares you for the modern ML landscape.
  • The first step is to identify the problem the ML model is going to solve, followed by collecting and preparing the data to train and evaluate the model. Data versioning enables reproducibility, and model training is automated using a pipeline.
  • The ML life cycle serves as our guide throughout the book, helping us understand not just how to build models, but how to create reliable, production-ready ML systems that deliver real business value.

FAQ

What is MLOps, and how does it differ from just building models?MLOps is the discipline of taking ML models from experimentation to reliable production. It covers orchestration, deployment, monitoring, automation, versioning, and reproducibility. Unlike pure model building (often done in notebooks), MLOps emphasizes robust systems, collaboration, and operating models under real-world constraints such as scalability, security, and compliance.
What are the main stages of the ML life cycle introduced in this chapter?The chapter presents a lifecycle that includes: problem formulation; data collection and preparation; data versioning; model training; model evaluation; model validation with stakeholders; and, in dev/staging/production, full pipeline automation, model deployment, monitoring, and retraining.
How does the experimentation phase differ from the dev/staging/production phase?Experimentation is highly iterative and exploratory, often using orchestrated but semi-automated pipelines. The dev/staging/production phase fully automates those pipelines (typically via CI/CD triggers), adds deployment as a service (e.g., REST), and emphasizes operational concerns like security, scalability, robustness, and real‑time performance, plus continuous monitoring and retraining.
Why is problem formulation so important before using ML?It ensures ML is the right tool for the job. Many problems can be solved with simpler heuristics. When data is complex, high-dimensional, or non-linear, ML may add value. Clear problem statements align stakeholders, define success metrics, and guide data needs, model choice, and evaluation criteria.
Why is data versioning critical, and why is it hard?Data versioning enables reproducibility: changing data changes model behavior. It’s hard because data comes in many forms and sizes, and there’s no universally adopted “Git for data.” Good practice includes tracking dataset snapshots, schemas, transforms, and hashes alongside code and model artifacts.
Which skills are essential for MLOps and ML engineering success?Core skills include solid software engineering (debugging, performance tuning, deployments), practical ML proficiency (using frameworks like PyTorch/TensorFlow/Scikit‑learn), data engineering (data quality, pipelines), and automation (CI/CD, reproducibility). Familiarity with Kubernetes helps for portability and scaling. You don’t need to master everything at once—learn incrementally.
What is an ML platform, and why build one from scratch in this book?An ML platform is the foundation for end‑to‑end ML work: notebooks, pipeline orchestration (e.g., Kubeflow Pipelines), model registry, deployment, monitoring, and more. Building it from scratch teaches how components fit together, how to customize and troubleshoot, and prepares you to evaluate or extend managed offerings like SageMaker or Vertex AI.
What is a Feature Store and how does it prevent training–serving skew?A Feature Store centralizes curated features, serving them offline (training/batch) and online (real time) from a single source of truth. By reusing the same definitions and transforms across training and inference, it eliminates discrepancies that cause training–serving skew, while promoting feature reuse and consistency.
What is a Model Registry and how is it used in production?A Model Registry tracks models, versions, metadata, parameters, and artifacts for reproducibility. It supports promotion flows (e.g., staging to production), rollbacks, and governance. Services can load models directly from the registry, ensuring traceability of which version is running and why.
Which projects does the chapter introduce, and what do they teach?The book builds three projects: (1) an OCR-based ID card detector (image data, fine-tuning, deployment), (2) a movie recommender (tabular data, Feature Store, drift detection, observability), and (3) a RAG-powered documentation assistant (LLMOps: embeddings, vector DB, guardrails, cost and semantic evaluation). Together they demonstrate iterative ML development, platform-first thinking, and tool selection trade-offs.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free