7 Platform Control Plane Foundations
This chapter establishes the foundation of a resilient, self-healing engineering platform by treating platform capabilities as product domains and using Kubernetes as the orchestration “control plane.” After bootstrapping initial IAM and pipelines, the focus shifts to pipeline-managed, software-defined infrastructure that delivers a consistent, low-friction, self-serve experience for developers. The approach emphasizes domain-driven boundaries, reliability through reconciliation of desired state, and an operating model that prioritizes small, frequent, well-tested changes.
First, the Cloud Account Baseline domain standardizes account-wide guardrails and enablement: recurring, idempotent security configuration and scanning; early integration of observability; and DNS hosted zones with clear, product-informed naming strategies. The chapter advocates platform-managed top-level domains that evolve toward customer-managed subdomains and, when valuable, custom domains, all automated via pipelines (for example, Route53 hosted zones and delegations). Next, the Transit Network Layer creates a scalable, low-friction network architecture—leveraging constructs like transit gateways or CloudWAN where appropriate—while adopting a role-based, per-cluster VPC model with ample IP planning and pragmatic simplifications. A key practice is keeping platform infrastructure code dedicated to platform use to avoid coupling, slow releases, and unnecessary complexity.
The chapter then separates customer identity from cloud IAM to enable a cohesive, cross-tool experience: authenticate via enterprise SSO, authorize by team membership (e.g., GitHub Teams), and issue verifiable tokens using an OIDC device flow through a SaaS IdP (such as Auth0). This identity is integrated with Kubernetes and EKS to drive RBAC and SSO across clusters. Finally, the Cloud Service Control Plane Base sets up EKS with AWS-managed add-ons (networking, DNS, storage drivers, pod identity) and Karpenter for elastic, continuously refreshed compute, plus an EFS target for durable, multi-writer storage. Robust validation combines cloud resource checks and cluster health tests with functional workloads (PVC expansion, EFS multi-write, dynamic node provisioning). Operational guidance centers on frequent, automated updates to managed add-ons, controlled but regular upgrades of Kubernetes and charts, safe refresh of managed node groups, and a developer CLI that streamlines login and kubeconfig generation using the platform’s customer identity.
The importance of domain-driven design and the platform product domains. The aws-iam-profiles pipeline we created was part of the Cloud Administrative Identity product domain. We now continue to the Cloud Account Baseline domain and create the account-level baseline resources.
If Epetech were using DataDog, it is at this point that we would set up a repository and pipeline to manage the account-level integration provided by DataDog for AWS. With that integration in place, with each additional capability or feature we implement, like the networks we will provision in section 7.2, observability would be a part of the natural definition of done for the implementation.
Our engineering platform must provide a self-serve experience for each internal customer (development team) to configure their service to receive traffic based on our company’s "product" decision for how the DNS domain and subdomain names reflect our digital products.
Traffic will come into services running on the platform in a couple of different ways.
This VPC structure creates a solid foundation for most starting EKS implementations.
In an actual starting pipeline for our VPCs, we would also have the same number of VPCs as we have EKS clusters. But we typically have several more instances than we will create in our Epetech example.
For the SAAS tools that will be a part of our platform, the common enterprise SSO integration for authentication is a good starting point. Some tools, like CircleCI, can integrate directly with GitHub authentication and do not necessarily need an independent integration.
For Epetech, we would like our identity provider service to have built-in or easily configurable means of integrating with GitHub, requiring authentication through whatever means we have set up in GitHub. As a result of (1) successful authentication, we want the IDP to (2) get the list of all teams the user is a member of in our GitHub organization. Finally, we want to (3) return to the user a secure means of accessing the Platform infrastructure or custom API resources.
In this authentication and authorization flow, the IDP acts just as a secure go-between. The user must authenticate through GitHub and grant their device authorization to receive a JSON Web Token.
This configuration can be done through the Auth0 UI or programmatically. In a production environment, you should always manage configuration in code.
With vendor-managed services, we effectively decide which version of the service we want to be deployed and perhaps a handful of specific settings. Nearly all of what goes into deploying and managing the service is the cloud vendor’s responsibility.
Summary
- Establish cloud account-level security configuration early and manage within the engineering platform if the security stakeholders aren’t equipped to provide product-bound capabilities.
- Provision account-level observability dependencies early.
- A seamless and self-service experience for DNS and domain management is critical.
- Decide on a platform-managed domain naming option, and evolve from there to include custom subdomains and bring-your-own-domain capabilities.
- The left-of and right-of domain naming patterns for APIs and services are primarily a business-level product value decision.
- An API gateway may not be necessary unless supporting third-party developers; focus on zero-trust network patterns and internal API management.
- Set up release pipelines for DNS configurations and account-level resources to ensure consistent deployment across environments.
- Design a cloud-vendor-managed transit network that makes adding networks a low-complexity task.
- Zero-trust networking done right can simplify the execution of business decisions to make internal resources available to customers or third-party partners.
- Implement a role-based network structure where each Kubernetes cluster has a dedicated VPC, named according to the cluster for easy future scalability.
- Provision VPCs and subnets in specific regions with designated IP spaces to support different roles, such as nodes, databases, and ingress.
- Make platform customer identity its capability within the engineering platform architecture as a key means of providing flexibility in creating user experiences and supporting evolutionary architecture - this is one of those decisions you will wish you got right at the start.
- Use a SaaS Identity Provider like Auth0 to provide a standards-based security protocol and act as the provider between an authoritative source of authentication and the source of authorization claims.
- The OAuth2 OIDC device-auth-flow is an adequate standard for users of the platform to generate short-lived credentials for accessing platform infrastructure and custom services from their laptops.
- The primary permission boundary (user claim) should be team membership. This maps well to domain-bound team topologies, and when assumed to be the central goal in all the RBAC capabilities, is more likely to result in the most effective implementation the first time.
- Create a dedicated pipeline for orchestrating the cloud provider-managed aspects of the kubernetes control plane.
- Technologies like Karpenter provide more efficient means of maintaining short-lived nodes and node pools comprised of an efficient mix of node sizes and attributes.
- Cloud-provided storage classes provide a vendor-managed solution for many everyday attached storage needs.
- Integrate kubernetes directly with your identity provider solution to provide users a direct means of interacting with the kubernetes API.
- Include automated collection of kubernetes configuration details in the control plane base pipeline.
- Integration testing of the EKS pipeline includes deploying test applications that utilize the features in a customer-like manner to confirm the actual implementation health.
- Arm nodes on most cloud providers offer a more performant and cost-effective option.
- A platform CLI provides an effective touchpoint for users to interact with platform APIs. Whether creating CLI or UI touchpoints, the service interface (API) always comes first.
Effective Platform Engineering ebook for free