Effective Platform Engineering you own this product

Build self-service interfaces to boost developer experience

Ajay Chankramath, Nic Cheneweth, Bryan Oliver, Sean Alvarez
Foreword by Kief Morris

October 2025
ISBN 9781633436497
400 pages

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

catalog / DevOps / Infrastructure as Code

resources: Source code Solutions to the exercises Book forum Source code on GitHub Register your pBook for a free eBook

table of content

Part 1 Getting started with platform engineering

1 What is platform engineering?

1.1 Platforms are more than just DevOps

1.1.1 Why should I care about platform engineering?

1.2 When to use platform engineering principles

1.2.1 When do these principles not apply?

1.3 Foundational concepts in platform engineering

1.3.1 Product delivery model for platforms

1.3.2 Platform product domains

1.3.3 Platform engineering principles

1.4 Platform engineering enablers

1.4.1 Developer experience

1.4.2 DevOps

1.4.3 SRE

1.4.4 Impact of generative AI in the platform engineering space

1.5 Let’s get started

2 Software-defined products and architectures

2.1 Product delivery model

2.1.1 Technical product ownership

2.1.2 Customers vs. stakeholders

2.1.3 Optimize for a product

2.1.4 The importance of a minimal valuable product and early adopters

2.2 Software-defined platform

2.2.1 The platform software delivery lifecycle

2.2.2 Observability-driven development

2.3 Evolutionary platform architecture

2.3.1 Backlog management for incremental design

2.3.2 Capturing user, stakeholder, and market feedback

2.3.3 Architectural fitness functions

2.3.4 Domain-driven platform design

2.3.5 Evolutionary impact of cloud-native technologies

3 Measuring your way to platform engineering success

3.1 Organizational aspects of platform engineering success

3.1.1 Changes needed for an organization to prepare for platform engineering

3.1.2 Prerequisites for the change

3.1.3 Implementing organizational changes

3.1.4 Building platforms in organizations

3.2 Path-to-production and platform value metrics

3.2.1 How do you identify the scope of your engineering platform?

3.2.2 Platform value modeling and metrics

3.3 Cognitive load and mechanical sympathy for developers and platform engineers

3.3.1 What is cognitive load for developers, and how do we measure it?

3.3.2 Why reduce cognitive load?

3.3.3 Techniques for reducing cognitive load

3.3.4 The need for mechanical sympathy

3.4 Common platform performance metrics

3.4.1 Adoption

3.4.2 Developer sentiment

3.4.3 Effectiveness

3.5 Evolving measures

3.5.1 Cost planning

3.5.2 Risk assessment

3.5.3 Mapping measurement to core platform principles

3.5.4 Mapping measurement to platform engineering domains

Part 2 Building engineering platforms

4 Governance, compliance, and trust

4.1 Developer autonomy

4.1.1 What does it mean to make a development team autonomous?

4.2 Policy-as-code

4.2.1 Introduction to policy-as-code using Open Policy Agent

4.3 Platform-managed trust

4.3.1 Software supply chain security

4.3.2 Zero-trust networking

4.3.3 Separating platform customer identity from cloud infrastructure identity

5 Evolutionary observability

5.1 Why observability matters

5.1.1 Observability is more than metrics and alerts

5.1.2 Use cases for observability beyond basic monitoring of applications

5.1.3 What does good look like?

5.1.4 Viewing observability through a single pane of glass

5.2 Observability as a platform service

5.2.1 The end-user access experience

5.2.2 Automatic collection of customer data

5.2.3 Who needs to respond when things need attention?

5.3 Observability platform as a separate internal product

5.3.1 Architecture of an observability platform

5.3.2 Should you build or buy?

5.3.3 Cross-platform observability

5.3.4 Strategies to drive adoption

5.4 Observability of published service-level indicators, service-level objectives, and service-level agreements

5.4.1 SLOs as code

6 Building a software-defined engineering platform

6.1 Building our own example engineering platform

6.2 Prerequisites to getting started

6.2.1 Getting started with the example tools

6.2.2 Developer tools selection criteria

6.3 Infrastructure pipeline orchestration practices

6.3.1 Account-level pipelines

6.3.2 Control plane-level pipelines

6.3.3 Namespace-level pipelines

6.3.4 Choosing an IaC framework

6.3.5 Test-driven development of infrastructure code

6.3.6 Static code analysis

6.3.7 Reusable pipeline code

6.3.8 Private executors (runners)

6.4 Cloud administrative identity

6.4.1 Top-level Domain Account

6.4.2 Platform product test environments

6.4.3 Internal customer nonproduction and production environments

6.4.4 Service accounts and permissions

7 Platform control plane foundations

7.1 Cloud account baseline

7.1.1 Account baseline security scanning

7.1.2 Account baseline observability

7.1.3 Hosted zones and delegated domains

7.2 Transit network layer

7.2.1 Role-based network structure

7.3 Customer identity

7.3.1 Authentication and authorization

7.3.2 OpenID Connect device-auth-flow and team membership claims

7.4 Cloud service control plane base

7.4.1 AWS managed node groups

7.4.2 Dependencies for AWS-managed EKS services

7.4.3 AWS managed EKS add-ons

7.4.4 Integrating an OIDC provider with the control plane base

7.4.5 Post-Terraform configuration

7.4.6 Strategy for testing EKS base

8 Control plane services and extensions

8.1 Value of services and extension domains for smaller teams

8.2 Value of services and extension domains for scaled platform delivery

8.2.1 Which services or extensions should be a part of every platform from the start?

8.3 Control plane services

8.3.1 Services pipeline orchestration

8.4 Control plane extensions

8.4.1 Kubernetes storage classes

8.4.2 Service mesh

8.4.3 Using operators for persistent data platform capabilities

8.5 Platform management APIs

8.5.1 Managing teams and namespaces for early adopters

Part 3 Scaling engineering platforms

9 Architecture changes to support scale

9.1 Scaling the control plane roles

9.1.1 Dynamic release pipeline

9.2 Scaling the orchestration of control plane services and extensions

9.2.1 Scaling the creation of OpenID Connect Assumable Roles

9.3 Scaling through platform event streaming

9.3.1 release-api

9.3.2 Adapter pattern

9.3.3 Adapter pattern: Issue tracking

9.3.4 Adapter pattern: CI hooks

9.3.5 Adapter pattern: Observability hooks

9.3.6 Adapter pattern: Configuration management database and audit gathering

10 Platform product evolution

10.1 Measuring the success of your platform organization

10.1.1 The platform value model

10.1.2 Epetech’s approach to measurement

10.1.3 Implementing feedback loops

10.2 Platform-as-a-product as the differentiator

10.2.1 Defining the platform vision and mission

10.2.2 Establishing a product roadmap

10.2.3 Implementing agile practices

10.2.4 The role of the platform product manager

10.2.5 Differentiating through user experience

10.3 Cultural shift from a traditional operations world

10.3.1 Embracing DevOps cultural principles

10.3.2 Breaking down silos with team topologies

10.3.3 Effects on Epetech’s culture

10.4 Site reliability engineering strategy, models, and aligning with organizational needs

10.4.1 Understanding SRE

10.4.2 Implementing SRE at Epetech

10.4.3 Aligning SRE with organizational goals

10.4.4 Outcomes of implementing SRE

10.5 Using intelligent assistants to help enhance engineering platforms

10.5.1 The role of intelligent assistants

10.5.2 Epetech’s EpetechBot

10.5.3 Exploring advanced tools

10.6 Comparing IDPs and developer portals

10.6.1 Understanding IDPs

10.6.2 Understanding developer portals

10.6.3 Comparing and contrasting IDPs and developer portals

10.7 Outcomes from building the platform products

10.7.1 Key takeaways from Epetech’s journey

10.7.2 Benefits realized by Epetech

Appendix

Appendix A: Solutions to the exercises

Appendix B: references

Overview

2 Software Defined Products and Architectures

Platform engineering is positioned as the creation of a durable, self-serve internal product that removes friction for software teams while meeting quality, security, and compliance expectations. Success depends as much on organizational design, decision habits, and engineering culture as on technology. The chapter stresses clear differentiation between customers (the developers who use the platform) and stakeholders (influencers such as security, finance, and legal), urging that stakeholder needs be expressed as service interfaces or outcome standards rather than prescribed implementations. Effective domain boundaries, measured through actual internal product usage, and a relentless focus on the developer experience are the primary yardsticks for platform value.

The recommended operating model is product-led: an empowered technical product owner, end-to-end lifecycle optimization from ideation to operations, “service-interface-first” capabilities, and rapid, feedback-driven increments that reach early adopters quickly. Architectural and tooling choices are validated through real experiments before commitments. The platform must be fully software-defined—everything in version control, delivered by automated pipelines—following a disciplined SDLC (plan, design, code, build/test, release, observe, operate). Observability-driven development expands “definition of done” to include instrumentation, dashboards, alerts, and change events so teams can understand behavior, detect regressions, and verify that delivered features create measurable value.

Architecturally, the chapter advocates designing for evolution: API-first patterns, domain-driven platform design, and automated architectural fitness functions that keep coupling in check, protect nonfunctional requirements, and enable portability. Backlog management is value-centric and incremental, informed by real user behavior, stakeholder input framed as outcomes, and market learning. Cloud-native foundations are preferred to reduce undifferentiated heavy lifting; extensible control planes enable policy enforcement, automation, and governance at scale, while serverless trade-offs are weighed when extensibility is limited. Start with a single team and a unified roadmap, expand to domain-aligned teams as demand grows, and preserve self-serve boundaries and cost efficiency throughout.

A Technical Product Owner (TPO, might be referred to as a Product Manager) serves as the voice of the customer, staying deeply connected to what the developers need and their experience in using the solution provided by the platform, while constantly assessing the measurable value being provided.

Internal stakeholders have legitimate concerns. But these outcomes can be achieved without the stakeholders controlling the implementation.

A Stakeholder Map visualizes a stakeholder’s influence over the backlog against their alignment to platform objectives. High-influence stakeholders must be highly aligned for a platform to succeed.

Look at all the activities, from ideation to operations, and assess engineering quality and impacts from organizational or architectural decisions. Yrjö Engeström’s human activity model inspires this Developer Activity model[2].

shows the Progression of an MVP delivery process. A fully functional product does not need to be delivered at the start, but all phases should deliver some incremental value to the customer.

shows the evolutionary path of platform development teams. Often, while an EP starts as an effort driven by a single team, multiple teams spin up over time to handle specific aspects of platform functions, with all development efforts under a top-level product owner.

The software delivery lifecycle (“SDLC”) covers all stages from feature planning through running in production.

With every change you make to the platform, think about the kind of data you will need in order to understand how the new technology, service, or feature is behaving. Alerting from service failure is not enough. When we deploy automatic configuration behavior or control guardrails, even healthy behavior could be interpreted as failure.

New features or changes should go through a Value flow. There will be routine operations and refactoring going on simultaneously, and using this flow will help us stay clear on how our current work fits into the bigger picture.

When we have a DevOps or traditional operations team, developers will be responsible for submitting requests for infrastructure or changes they need. The requests go into the queue.

Ideas, improvements, and sometimes requirements will come from a variety of sources. These four are the most important. And everything needs to successfully make it through the top row of our Value Flow in Figure 2.9, except for production outages…of course.

When a particular service within the API needs to interact with the API database, it calls a Repository function that provides all the normal database methods. The repository function would call the datastore functions for the type of database the API needs to use (relational or unstructured in our case).

Primary domains in an engineering platform. (From chapter 1)

Start by defining the primary boundary experience, which is our platform customers. Then, figure out where the underlying changes or capabilities lay, and when more than one team is involved, does an effective boundary experience exist between them?

The components of a Kubernetes Cluster. Courtesy: https://kubernetes.io/docs/concepts/overview/components/

shows an example of a pod sidecar. This container runs a process alongside the application container to enhance functionality. In this example, a sidecar continuously loads a new secret value from a vault so that the application will always get the latest updates when secrets are requested.

Summary

Successful delivery of a self-serve engineering platform happens through a Product Delivery Model.
Stakeholders have significantly different success criteria than do customers.
Shape the experience of using an engineering platform for customers of the platform.
Stakeholder needs can be met without destroying the customer experience.
Don’t underestimate the importance of tracking stakeholder impact on platform delivery. It can mean the difference between success and failure for the platform to provide a return on investment.
Optimize the product team's engineering practices and the organizational support for delivering a product.
Optimize the engineering process from ideation to platform operation, not around individual teams, technologies, or departments.
Features and capabilities should be delivered as service interfaces first.
Make final architecture and technologies decision only after realistic experimentation that measures the expected value.
Test experiences and implementation with alpha-users early
Find ways to introduce the fundamental aspects of new features before more complex implementations, using the feedback to influence next steps.
Try to start with a single engineering platform delivery team, with the initial squad setting the architectural foundations.
Grow the number of delivery teams within the platform product based on the most needed capabilities.
Adjacent teams within the organization can provide capabilities to the engineering platform only if they can also provide their service as-a-product and support the product roadmap of the platform.
The idea of a software-defined platform introduces more rigor into the SDLC process, using software engineering principles to increase the product's stability, scalability, extensibility, and maintainability.
Observability is an integral part of each story in the product delivery backlog, not something added on afterward.
Both the healthy and unhealthy behavior of each feature, capability, or technology of the platform should be easily observable.
Plan for evolutionary change, even in the way you manage the platform product backlog
Team engineering standards, such as test-driven development, include architecture fitness testing to support easier evolution.

FAQ

What is an Engineering Platform, and why is a self-serve experience essential?

An Engineering Platform is an internal product that enables development teams to create, test, deploy, and operate software continuously while meeting quality, security, and compliance. Self-serve is essential because it eliminates ticket-driven delays and lets teams get what they need on demand through APIs, making developer experience the core measure of success.

How does a Product Delivery Model differ from traditional IT project approaches?

Unlike one-off, tool-by-tool projects planned as capital purchases, a product delivery model treats capabilities as long-lived, continuously evolving products measured by user experience. Work ships in small increments, is optimized for change, and is prioritized by real usage and value—never by big-bang, Gantt-style rollouts.

What is Technical Product Ownership (TPO), and why is it critical for platform success?

The TPO is the empowered, accountable product leader who represents developers, prioritizes the backlog, tracks measurable value, and adapts the roadmap based on actual usage. Without true product ownership, platforms drift into low-value, compliance-driven tools that miss developer needs.

How should we handle customers vs. stakeholders in platform engineering?

Customers are the people who use the platform (primarily developers); stakeholders influence outcomes (e.g., security, finance, legal) but shouldn’t dictate implementations. Ask stakeholders to provide either self-serve APIs to their capabilities or clear, testable outcome standards. Use a stakeholder map (influence vs. alignment) to focus alignment work where it matters most.

What does “service interface first” mean, and why does it matter?

Every capability is exposed as an API first; CLIs and UIs are layered on top. This enables automation, multi-tenant usability, and long-term flexibility. Cloud providers excel at API-first services, and when a stakeholder can’t offer an API, outcome-based standards let platform teams build one without breaking the product experience.

How should we start delivering the platform—what does MVP and early adopters look like?

Ship the smallest valuable capability early, treat it as a hypothesis, and observe real usage to guide the next increment. Begin with a single, unified team and TPO to prove value, then scale to multiple teams under one product vision. Avoid “do-everything” first releases; most value comes from a few core paths.

What does it mean for a platform to be software-defined, and what SDLC practices are expected?

“Software-defined” means everything—infra, policies, pipelines, configs, observability—is versioned and deployed via automation. A healthy SDLC includes: plan (with measurable outcomes), design (experiment first), code (tests first), build→test (pipeline through consistent commits), release (preview→non-prod→prod), observe, operate, repeat. Dogfood the same tools you provide to users.

What is Observability-Driven Development (ODD), and how does it complement TDD?

ODD adds observability to acceptance criteria: define and automate metrics, logs, traces, events, and alerts that show both failures and healthy behavior. Instrument change events, make monitors part of “definition of done,” and use this data to diagnose faster and prove value—extending TDD from correctness to operability.

How do evolutionary architecture and fitness functions reduce migration and change risk?

Make technology choices only after real-world experiments and design for changeability (e.g., abstract datastore access to swap databases or clouds). Fitness functions are automated tests that assert architectural rules (layer boundaries, required tests/monitors, cloud-agnostic interfaces), continuously guarding changeability and readiness.

Why do cloud-native standards like Kubernetes matter, and when should we build vs. buy?

Kubernetes provides standardized APIs and extensibility (admission controllers, sidecars, meshes) to enforce governance with low friction—capabilities that serverless platforms often can’t extend. Prefer managed/SaaS for non-differentiating tech to minimize total cost of ownership and opportunity cost, and focus engineering time on strategic product value.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more