Overview

7 Platform Control Plane Foundations

This chapter establishes the foundation of a resilient, self-healing engineering platform by treating platform capabilities as product domains and using Kubernetes as the orchestration “control plane.” After bootstrapping initial IAM and pipelines, the focus shifts to pipeline-managed, software-defined infrastructure that delivers a consistent, low-friction, self-serve experience for developers. The approach emphasizes domain-driven boundaries, reliability through reconciliation of desired state, and an operating model that prioritizes small, frequent, well-tested changes.

First, the Cloud Account Baseline domain standardizes account-wide guardrails and enablement: recurring, idempotent security configuration and scanning; early integration of observability; and DNS hosted zones with clear, product-informed naming strategies. The chapter advocates platform-managed top-level domains that evolve toward customer-managed subdomains and, when valuable, custom domains, all automated via pipelines (for example, Route53 hosted zones and delegations). Next, the Transit Network Layer creates a scalable, low-friction network architecture—leveraging constructs like transit gateways or CloudWAN where appropriate—while adopting a role-based, per-cluster VPC model with ample IP planning and pragmatic simplifications. A key practice is keeping platform infrastructure code dedicated to platform use to avoid coupling, slow releases, and unnecessary complexity.

The chapter then separates customer identity from cloud IAM to enable a cohesive, cross-tool experience: authenticate via enterprise SSO, authorize by team membership (e.g., GitHub Teams), and issue verifiable tokens using an OIDC device flow through a SaaS IdP (such as Auth0). This identity is integrated with Kubernetes and EKS to drive RBAC and SSO across clusters. Finally, the Cloud Service Control Plane Base sets up EKS with AWS-managed add-ons (networking, DNS, storage drivers, pod identity) and Karpenter for elastic, continuously refreshed compute, plus an EFS target for durable, multi-writer storage. Robust validation combines cloud resource checks and cluster health tests with functional workloads (PVC expansion, EFS multi-write, dynamic node provisioning). Operational guidance centers on frequent, automated updates to managed add-ons, controlled but regular upgrades of Kubernetes and charts, safe refresh of managed node groups, and a developer CLI that streamlines login and kubeconfig generation using the platform’s customer identity.

The importance of domain-driven design and the platform product domains. The aws-iam-profiles pipeline we created was part of the Cloud Administrative Identity product domain. We now continue to the Cloud Account Baseline domain and create the account-level baseline resources.
If Epetech were using DataDog, it is at this point that we would set up a repository and pipeline to manage the account-level integration provided by DataDog for AWS. With that integration in place, with each additional capability or feature we implement, like the networks we will provision in section 7.2, observability would be a part of the natural definition of done for the implementation.
Our engineering platform must provide a self-serve experience for each internal customer (development team) to configure their service to receive traffic based on our company’s "product" decision for how the DNS domain and subdomain names reflect our digital products.
Traffic will come into services running on the platform in a couple of different ways.
This VPC structure creates a solid foundation for most starting EKS implementations.
In an actual starting pipeline for our VPCs, we would also have the same number of VPCs as we have EKS clusters. But we typically have several more instances than we will create in our Epetech example.
For the SAAS tools that will be a part of our platform, the common enterprise SSO integration for authentication is a good starting point. Some tools, like CircleCI, can integrate directly with GitHub authentication and do not necessarily need an independent integration.
For Epetech, we would like our identity provider service to have built-in or easily configurable means of integrating with GitHub, requiring authentication through whatever means we have set up in GitHub. As a result of (1) successful authentication, we want the IDP to (2) get the list of all teams the user is a member of in our GitHub organization. Finally, we want to (3) return to the user a secure means of accessing the Platform infrastructure or custom API resources.
In this authentication and authorization flow, the IDP acts just as a secure go-between. The user must authenticate through GitHub and grant their device authorization to receive a JSON Web Token.
This configuration can be done through the Auth0 UI or programmatically. In a production environment, you should always manage configuration in code.
With vendor-managed services, we effectively decide which version of the service we want to be deployed and perhaps a handful of specific settings. Nearly all of what goes into deploying and managing the service is the cloud vendor’s responsibility.

Summary

  • Establish cloud account-level security configuration early and manage within the engineering platform if the security stakeholders aren’t equipped to provide product-bound capabilities.
  • Provision account-level observability dependencies early.
  • A seamless and self-service experience for DNS and domain management is critical.
  • Decide on a platform-managed domain naming option, and evolve from there to include custom subdomains and bring-your-own-domain capabilities.
  • The left-of and right-of domain naming patterns for APIs and services are primarily a business-level product value decision.
  • An API gateway may not be necessary unless supporting third-party developers; focus on zero-trust network patterns and internal API management.
  • Set up release pipelines for DNS configurations and account-level resources to ensure consistent deployment across environments.
  • Design a cloud-vendor-managed transit network that makes adding networks a low-complexity task.
  • Zero-trust networking done right can simplify the execution of business decisions to make internal resources available to customers or third-party partners.
  • Implement a role-based network structure where each Kubernetes cluster has a dedicated VPC, named according to the cluster for easy future scalability.
  • Provision VPCs and subnets in specific regions with designated IP spaces to support different roles, such as nodes, databases, and ingress.
  • Make platform customer identity its capability within the engineering platform architecture as a key means of providing flexibility in creating user experiences and supporting evolutionary architecture - this is one of those decisions you will wish you got right at the start.
  • Use a SaaS Identity Provider like Auth0 to provide a standards-based security protocol and act as the provider between an authoritative source of authentication and the source of authorization claims.
  • The OAuth2 OIDC device-auth-flow is an adequate standard for users of the platform to generate short-lived credentials for accessing platform infrastructure and custom services from their laptops.
  • The primary permission boundary (user claim) should be team membership. This maps well to domain-bound team topologies, and when assumed to be the central goal in all the RBAC capabilities, is more likely to result in the most effective implementation the first time.
  • Create a dedicated pipeline for orchestrating the cloud provider-managed aspects of the kubernetes control plane.
  • Technologies like Karpenter provide more efficient means of maintaining short-lived nodes and node pools comprised of an efficient mix of node sizes and attributes.
  • Cloud-provided storage classes provide a vendor-managed solution for many everyday attached storage needs.
  • Integrate kubernetes directly with your identity provider solution to provide users a direct means of interacting with the kubernetes API.
  • Include automated collection of kubernetes configuration details in the control plane base pipeline.
  • Integration testing of the EKS pipeline includes deploying test applications that utilize the features in a customer-like manner to confirm the actual implementation health.
  • Arm nodes on most cloud providers offer a more performant and cost-effective option.
  • A platform CLI provides an effective touchpoint for users to interact with platform APIs. Whether creating CLI or UI touchpoints, the service interface (API) always comes first.

FAQ

What does “Platform Control Plane” mean in this chapter, and why is Kubernetes a good fit?Kubernetes acts as the platform’s control plane because it provides reconciliation (desired vs. actual state), self-healing, scheduling, and an extensible API. Beyond redeploying pods or scaling, its API lets us extend orchestration to platform capabilities, making it a solid foundation for an internal engineering platform.
What belongs in the Cloud Account Baseline domain versus Administrative Identity?Administrative Identity focuses on IAM for admins and pipelines. Cloud Account Baseline covers account-wide, provider-level settings such as centralized logging and flow log aggregation, SIEM/SOAR hooks, account-level security scanning, DNS hosted zones and delegations, and network-as-policy controls. These are idempotently enforced and apply across all accounts.
How should baseline security scanning and drift correction be implemented?Use a recurring, idempotent job (e.g., nightly) to enforce required settings and a separate recurring scan to detect drift before auto-remediation. Align with enterprise standards (e.g., CIS AWS Foundations) and ingest critical logs centrally. If the security team doesn’t own it, the platform team should codify and run it in the Cloud Account Baseline domain.
How should DNS and hosted zones be structured for a platform product experience?Treat naming as a product decision. Start with a platform-managed top-level domain (e.g., epetech.io) and provide self-serve, consistent right-of-domain paths. Delegate subdomains (e.g., api.epetech.io, dev.epetech.io, cluster-specific zones) using Route53 hosted zones and NS record delegation, possibly across separate prod/non-prod accounts.
When should we use an API Gateway versus relying on Kubernetes ingress/service mesh?Use an API gateway when you need third-party developer onboarding, key issuance, monetization flows, and developer portals. If your APIs are primarily for your own product’s customers and identity is handled in-app, a mesh + ingress can be simpler, cheaper, and less operationally complex.
What is the Transit Network Layer and how is it organized?It’s a dedicated networking domain spanning internet access, inter-service traffic, cross-AZ/region, multi-cloud, and on-prem connectivity. In AWS, use TGW/CloudWAN for scalable connectivity and role-based VPCs (one VPC per cluster). Reserve large CIDRs, tag subnets for roles (e.g., nodes/intra), and simplify where appropriate (e.g., fewer NATGWs for exercises).
Why separate customer identity from cloud IAM, and how is it implemented?Separating customer identity avoids coupling developers to provider IAM and enables a cohesive cross-tool UX. Use SSO for authentication and team-based authorization (e.g., GitHub Teams). Implement OIDC with a SaaS IDP like Auth0 using the Device Authorization Flow, include team memberships as claims, and integrate the OIDC provider with EKS for RBAC.
What goes into the AWS Control Plane Base pipeline?The pipeline provisions only vendor-managed elements: EKS, an AWS-managed node group for cluster services, core add-ons (kube-proxy, coredns, VPC CNI), storage classes (EBS/EFS via CSI drivers), eks-pod-identity-agent (for IRSA successor), and Karpenter for elastic compute. Use well-maintained Terraform modules and keep variance minimal.
How do we validate that the control plane and add-ons are healthy?Combine tests: AwSpec for AWS resources (EKS active, IAM roles, EFS), and Bats + kubectl for Kubernetes (nodes Ready, addons Running). Add functional tests: create and expand EBS PVCs; verify EFS RWX with multi-pod writes; deploy pods targeting Karpenter node pools and confirm node provisioning.
What is the recommended upgrade and versioning strategy for the base?Run the base pipeline frequently. Use “most recent” for AWS-managed EKS addons in non-prod/preview, review plans, and promote with explicit versions to prod. Pin Karpenter chart via variables, and refresh AWS-managed node groups with terraform apply -replace to roll to latest patched AMIs. Track Terraform module updates with tools like Renovate/Dependabot.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Platform Engineering ebook for free