Technology
Cloud Native Maturity Model - Technology
Navigation
The Cloud Native Maturity Model is composed of six separate documents - the Prologue and the five key reference documents:
Position on Included Technologies
The Cloud Native Maturity Model includes references to only CNCF graduated or incubating projects. The Maturity Model’s default position on CNCF sandbox projects will be to exclude unless referenced in later stages of maturity (i.e. users that have achieved level 4 or 5). It does not and will not include any reference to commercial software.
Introduction
The Cloud Native Maturity Model covers five major dimensions - People, Process, Policy, Technology and Business Outcomes. This paper addresses technology - the practical tooling that makes up cloud native applications, platforms, and infrastructure. As well as referring to specific technologies, this paper aims to show the stages you may go through as you move from starting out all the way through to cloud native excellence.
This paper illustrates just one path, but all journeys differ. This is absolutely as it should be, as organizations all start out at different points and have different destinations (business outcomes). Different locations, sizes, starting points (greenfield or long established), regulatory environments, and of course people, all influence the cloud native journey.
The technology section of the Cloud Native Maturity Model is not exhaustive. We would love contributions to ensure the model is robust and useful for all users. All readers are encouraged to submit GitHub PRs with comments and suggestions.
The Technology Overview
We anticipate you’ll have a reasonable understanding of why you want to adopt cloud native technology i.e. your expected business outcomes. Clarity on why you wish to undertake this journey to achieving full cloud native adoption is your largest asset. At a high level, the key steps in your cloud native journey will look something like this:
Level 1: You’ll have your initial experimentation and adoption of Kubernetes. You’ll start with relatively basic tools and technology. You’ll assess your existing toolset to see how they fit within the new landscape (what plays well with cloud native, and what doesn’t?). You’ll have limited automation, but don’t worry, it’s coming! Your focus is on getting the baseline technology implemented, and you won’t be in production yet.
Level 2: This marks your first step into production. You’ve worked hard to build your foundation in Level 1, and now you are moving to production. You might have started with something relatively small and simple, but this leap to production has certainly required you to address some significant steps. You’ll probably have had to incorporate monitoring and observability into your workloads. You’ll have brought key observability tooling in and started monitoring your clusters for standard metrics such as RAM, CPU etc. While you might be starting to evaluate application tracing, don’t worry about it too much if you have started to gather core metrics. Your focus here is on getting an application running in production and having enough platform resource, observability and operational capability to support it within your organization.
Level 3: Here you start to scale. Your suite of tools is more standardized. You’re getting your release tooling, secrets management and policy tooling in place. You’re also starting to get a level of buy-in across your organization, which is helping to propel you forward. This is where you will be running the largest number of tools as you will be in the thick of evaluating, implementing, and running in production.
Level 4: You’ve got full control over your environment, and you’ve built your confidence, with rapid adoption of cloud native patterns for new applications and platforms. You’ve also gained organizational commitment to cloud native and this is adding to your momenting. You’re starting to feel like you’ve “crossed the chasm.”
Level 5: Your investment is now focused on automation in functional and non-functional areas such as scanning, policy, security and testing. You’ve got operators doing your operations for you and you’re fully automated.
Please note, technology is changing rapidly with AI developments. Within each section, you should consider how AI can help you improve or streamline actions. There is a huge opportunity with AI and while we want the Cloud Native Maturity Model to cover AI, we defer to the AI community within the CNCF for its expertise. You can read more in the CNCF AI whitepaper.
Architecture and Solution Design
This section outlines how architecture and solution design evolve from ad hoc development and basic planning to fully automated, policy-driven platforms with codified standards. As maturity increases, organizations align architecture with business goals, optimize for scale and efficiency, and empower developers through self-service and well-defined patterns.
- Level 1: The focus is on getting workloads into development, without paying much attention to the specific availability requirements of components. The goal is to prepare for production and begin building cloud native capabilities. Capacity planning, such as network address space, should be considered early to avoid significant rework later. It’s tempting to start small when prototyping, but small designs often make their way into production. Invest effort in upfront planning for areas that are difficult to adjust later.
In cloud native environments, traditionally distinct architecture domains are now tightly integrated. Architects must work more closely with core service teams such as networking, security, and storage. From a solution design perspective, begin evaluating how your application might need to change to better support cloud native architecture (e.g., a monolith may not be suitable). Authentication and authorization can be challenging for legacy applications—consider replacing LDAP with OAuth or OIDC 2.0.
Begin building up the supporting services for your cloud native workloads. Consider the logical and technical interfaces that make up your cloud perimeter, as these heavily influence how workloads are architected and how they integrate with upstream and downstream systems. Avoid boxing yourself into a corner early on to prevent common issues later. When planning your production architecture, consult reference architectures such as those provided by the CNCF reference architectures.
Level 2: You’re now in production. As an architect, you’re thinking about the non-functional business requirements of applications, including performance, capacity, availability, disaster recovery, and security. As you implement platforms, consider how they can meet these needs and how they might fail—not only through infrastructure failure but also through logical corruption or misconfiguration. You may conduct exercises to identify failure scenarios and the risks they pose, which will guide engineering decisions.
It’s easier to meet requirements at higher levels of abstraction (e.g., instead of replicating a stateful VM, run multiple pods and load balance across them; use tools like Rook or Longhorn instead of replicating a SAN). Levels 1 and 2 are valuable stages for exploring and learning about the applications transitioning to cloud native and their service requirements, helping reduce operational risk later.
You are starting to establish production patterns that will be reused going forward. Common Terraform modules may be approved for specific workload types, and patterns for tools like secret management are being established. You are standardizing how platforms, services, and applications are built and maintained—using tools like Kustomize at scale, mandating Helm charts, or taking an operator-first approach. You may set standards for container integration (e.g., buildpacks or Containerfiles for everything).
A catalog of basic patterns is beginning to emerge. The focus is on raising the standard across applications, platforms, and infrastructure to ensure high-quality production services. This sets the foundation for scaling in Level 3. Security classification becomes a key non-functional requirement, alongside availability and recovery. For each classification, you should know which applications fall under it and have defined patterns for supporting security components—e.g., encryption policies, artifact placement, Kubernetes secret handling, and key rotation.
You are also starting to develop placement patterns (on-prem, cloud, edge) for infrastructure and platform services such as Kubernetes clusters, and to select right-fit solutions based on these. Even if the goal is full cloud rehosting, technical or policy constraints may require keeping some workloads or services on-prem. - Level 3: At this level, architectural planning becomes more intentional. You are operating in production and discovering new ways cloud native can deliver business value—such as supporting multiple standards (e.g., object storage), increasing platform agility, and improving cost and availability. Costs may initially rise before falling. New capabilities emerge—for example, handling bursty workloads by using spot nodes, which in turn improves availability.
A subtle shift occurs: tools previously outside the runtime platform are now integrated (e.g., GitOps and its dependency on source control repositories), which raises availability requirements for those tools. It’s essential to understand your dependencies within the cloud native ecosystem.
You may find preferred tools or standards don’t work for all use cases (e.g., the 80/20 rule), especially with specialized workloads that scale horizontally or vertically. These edge cases need their own standards. Rather than letting those teams create them in isolation, extend your portfolio of services and standards to include them. Having multiple teams consume a pattern will help with driving up quality over time as feedback is incorporated about what works and what doesn’t.
The growing cloud native ecosystem offers many new capabilities. Emphasis now falls on defining what is acceptable for use and under what conditions—based on non-functional requirements. For example, you might offer multiple placement region options, with tiered requirements for regional versus zonal clusters, depending on application criticality. Creating a capability catalog helps prevent overprovisioning and reduce costs.
Security rules and standards are codified as policies and checked regularly—including within the software supply chain. You have patterns in place for data sovereignty, data residency, and complying with regulations across jurisdictions (e.g., some countries require financial or medical data to stay within their borders). You must understand your provider’s residency guarantees.
At this stage, you’re expected to trial new technologies, vendors, and tools. Learn and decide quickly, and clearly document what is and isn’t supported. Think in terms of full technology lifecycles and retire tools to pay down technical debt as soon as possible.
You also begin standardizing workload migration processes and architectural requirements. A triage process may be needed per application to handle networking, IAM, and analysis components. Define a clear policy set for all application workloads during migration. Migrating workloads requires time and architecture guidance. For example: do we refactor? How do we replace authentication solutions previously available on-prem? Reference architectures and implementations help development teams accelerate migration and refactoring. - Level 4: The capability catalog becomes more refined, narrowing down to a short list of highly expressive options. You can deploy infrastructure, platforms, and applications rapidly—and decommission just as easily. Lifecycle management across infrastructure, platform, and application is in place, with a strong emphasis on automation.
You are far more efficient in resource allocation and take advantage of placement options (e.g., deploying in Iowa instead of the West Coast to save on cost). Interfaces are clean and consistent, and applications scale effectively within your platform. Interfaces are well-defined and come in multiple forms to suit different needs, all with strong observability.
Developer experience is excellent, with clearly defined stages, easy-to-use interfaces, and built-in guidance. The platform leads developers toward secure, compliant, and efficient designs. Policy requirements are encoded into the platform and surfaced through portals, helping developers build according to organizational standards.
Solutions design becomes less manual. Policy is baked into the platform, and architectural consulting focuses more on refining and reviewing designs for alignment with business goals. You can spin up entire environments to test new capabilities without impacting production.
As tribal knowledge becomes codified in the platform, it must meet the same security classification standards as the workloads it supports (e.g., usingapplying HSMs with OpenBao when supporting medical applications with patient data). - Level 5: Everything is self-service and fully automated. Environments are right-sized, secured, cost-effective, and provisioned or decommissioned on demand.
Developers have access to a mature portfolio of capabilities and services that meet all functional and non-functional requirements. Automation includes AI/ML for detecting issues, suggesting optimizations, and generating pull requests for rapid review. It is easy and safe for developers to get the capabilities they need while being guided toward the right solutions.
Architectural decisions are made with full awareness of their implications and impacts. Design reviews take the form of pull requests, with the entire application stack—including IaC and Kubernetes manifests—rendered in code, accelerating delivery.
Platforms and Infrastructure
For the sake of clarity, we regard a platform or infrastructure that is used by application developers as production.
Level 1: You’re beginning to build your cloud infrastructure, whether on-premises or in the cloud. It’s important to consider foundational technologies early—networking, firewalls, IAM, access controls, and policies—and whether any of these need to change. As you experiment with Kubernetes, keep track of emerging needs and decisions; these will serve as breadcrumbs guiding your journey toward cloud native.
Expect to address RBAC policies, load balancer or ingress configurations, cluster dashboards, privileged access (or the removal of it), and container logging. Your goal is to move from managing servers as ‘pets’ to treating them as ‘livestock’ by investing in declarative infrastructure with Infrastructure as Code (IaC) tools.
Platform teams need dedicated engineering clusters to validate and iterate on their infrastructure work. These clusters provide the flexibility to build, test, and tear down resources manually as needed. However, any cluster intended for application developers—including those labeled as “development” or “integration”—must be treated as production-grade. These environments should be provisioned using your strategic IaC tooling (e.g., OpenTofu), with versioned modules that ensure consistency across dev, test, and production. For example, if deploying GKE with specific NIC/subnet settings, the module should be versioned and reused across all application environments to ensure predictability and compliance.
If a consolidated DevOps practice isn’t yet in place, involve your future operations team now to build familiarity and alignment. For cloud service providers, you’ll also need to consider regions, encryption key management, and integration with your corporate network.
Regardless of environment, version-control everything and adopt IaC to manage configurations. You’ll be deploying frequently and must be able to tear down environments just as easily without leaving behind configuration drift or cruft (‘crumbs’ of left over technical detail, such as files or configuration).
You’re also beginning to shape your deployment and operating models. Key questions will emerge: Does the same team own both the app and the cluster? How are clusters provisioned? How are applications onboarded? Are you adopting models like namespace-as-a-service or cluster-as-a-service? Will there be a shared responsibility model for the clusters?
Level 2: At this stage, you’re focused on managing configuration consistently across your environment. You’re moving away from imperative practices, such as shell scripts or “ClickOps” (manual configuration through web UIs), and any remaining use of these approaches is tightly controlled, documented, and versioned in source control (e.g., Git).
You now have solid solution architecture in place to define key requirements. The Kubernetes cluster is central to your operations, and you’re successfully meeting non-functional requirements such as disaster recovery (DR), high availability (HA), and secure access through properly configured IAM and RBAC. Networking is aligned with application needs, and infrastructure capacity planning is actively balancing cost and performance. For stateful applications, selecting appropriate storage classes and volume types is critical.
This is a good time to consider adopting GitOps tools or managed services to initialize and maintain Kubernetes clusters, especially for setting up core components like ingress controllers or cluster-scoped operators. While these investments may carry upfront costs, they pay long-term dividends in consistency, security, and scalability.
You’ll also need to revisit decisions made in Level 1—some of which were valid at the time but may not scale. Paying down infrastructure technical debt is essential at this point to avoid bottlenecks later. Consistency is key to supporting broader scale as you mature.
Wherever possible, leverage the operator pattern for deploying and managing third-party tools. Operators ensure consistency and lifecycle management in contrast to ad hoc, manual installations.
Your deployment and operating models are now established and delivering value to the business. However, they may be overly tailored to your initial workloads, requiring either refinement or reevaluation to accommodate new teams or applications. Your chosen operating model—whether centralized or federated—will significantly influence technical implementation. For example, centralization may point toward models such as namespace-as-a-service.
At Level 2, given the production status of the workload, it is worth considering your overall platform strategy. There are architecture aspects to this, as well as engineering. Review the Platforms White Paper from the CNCF as well as the Platform Maturity Model. As the white paper explains, “a platform is an integrated collection of capabilities defined and presented according to the needs of the platform’s users.” The paper also explains that there are consistent user experiences for managing the platform’s capabilities such as web portals, project templates and self-service APIs. As such, consider that your organization does already have a set of capabilities that together provide developers the ability to get business functionality into production; however given the likely unintegrated nature of them, with potentially different user experiences with different tooling, you’ll want to consider creating an internal developer platform (IDP). At level 2, the key aim will be the thinnest viable platform,
Level 3: At this stage, you’re building confidence in your infrastructure by gaining deep visibility into how it’s operating. Monitoring, alerting, and resource usage tracking become top priorities—not just at the node level (CPU, memory, etc.) but also across the entire cluster. You’re evolving from simply running infrastructure to managing it proactively.
Whereas in earlier stages you may have remediated failing components manually, now you’re replacing and redeploying them automatically. You’re beginning to manage infrastructure like software, leveraging Kubernetes as the control plane for elasticity and self-healing behavior. This means offloading more responsibility to the cluster itself through mechanisms like horizontal and vertical pod autoscalers, as well as application-focused autoscalers such as KEDA.
Advanced Kubernetes scheduling practices come into play, including the use of priority classes, quality of service tiers, affinity rules, and TopologySpreadConstraints. These capabilities help manage resource allocation and eviction behavior, which are critical in elastic, multi-tenant environments where infrastructure directly impacts user experience.
Your infrastructure must now support a more sophisticated software delivery lifecycle. This includes provisioning sandbox environments for application developers, platform engineers, and infrastructure teams to experiment and validate changes safely. Your deployment and operating models become more flexible to support a wider range of workload types and team needs.
You’re beginning to collect and act on platform usage data to inform architectural decisions. For example, if 95% of teams operate effectively within a namespace-as-a-service model, you may double down on optimizing that offering. If others require dedicated clusters, you’ll need to decide whether to offer cluster-as-a-service or guide them toward scalable alternatives. These decisions are costly to reverse, so it’s essential to balance strategic vision with measured data from real usage.
Infrastructure becomes harder to change as it scales—so flexibility must increasingly come from the platform and application layers rather than the infrastructure itself. To support this, you’ll need to clearly communicate platform capabilities, limitations, and upcoming evolutions. You’ll also begin to prioritize investments based on both customer needs and cost-awareness, recognizing that in a cloud environment, every resource consumed has a price.
Level 4: By this stage, Kubernetes and its API are second nature. You’ve matured your infrastructure practices and Infrastructure as Code (IaC) tooling, and you’re likely exploring ClusterAPI to automate the deployment and full lifecycle management of your clusters.
As platform maturity increases, so does the need for refined control and governance. You’re now implementing policies across the infrastructure control plane and related controllers to ensure consistent behavior, security, and compliance at scale.
Your deployment and operating models have been further refined, recognizing that no single approach fits every team. Earlier solutions—like namespace-as-a-service—were effective for rapid onboarding in lower-risk environments. Now, with more sophisticated needs across development and product teams, you’re supporting a broader range of workloads and operator-based services such as storage, databases, messaging, and logging.
This introduces more complexity and risk with each change to your operating model. Development teams are making more demands, and your platform must be flexible without sacrificing stability. To meet this need, you’re investing in centrally maintained Infrastructure as Code templates (e.g., using OpenTofu), while still allowing for team-specific customization such as networking or other infrastructure components.
The challenge at this level is to plan and support a diverse set of workloads across the enterprise—from lightweight APIs to complex, large-scale systems with demanding functional and non-functional requirements. Flexibility, consistency, and strong governance become the pillars of your infrastructure strategy as you scale to meet the needs of a growing and diverse engineering organization.
Level 5: At this stage, your entire infrastructure lifecycle—provisioning, upgrades, and decommissioning—is fully managed through software and codified processes. Every infrastructure change is made via code, enabling repeatability, auditability, and rapid iteration.
You are maximizing efficiency and minimizing technical debt. Infrastructure is highly flexible, cost-optimized, and aligned with business needs, supported by robust self-service interfaces that empower teams without compromising control. FinOps practices are well-integrated, ensuring that both resource utilization and cost are continuously optimized.
Tight GitOps-based control loops are in place across the entire infrastructure, eliminating configuration drift and enforcing consistency. Everything is version-controlled, declaratively defined, and managed as part of a cohesive system.
Shared or pooled costs are minimized or fully allocated. You have clear, effective mechanisms for cost management and chargeback, ensuring that infrastructure expenses are transparent and aligned with team or business unit usage.
Application Patterns and Refactoring
As you begin your cloud native journey, start with a small, manageable application—ideally a stateless, greenfield microservice. This will help you validate the fundamentals: Kubernetes access (kubectl), networking, platform capabilities, CI/CD processes, and initial security patterns. It also provides an opportunity to define application architecture standards, deployment templates, and policy guardrails that can scale across future projects.
While microservices are a common starting point, Kubernetes is increasingly used as a general-purpose runtime—meaning monoliths may still exist or even be newly developed, depending on business needs. It’s important to define the differences between microservices and monoliths early, and understand how microservices typically align better with Kubernetes’ strengths around scalability, independent deployment, and resilience. Regardless of architecture, focus on patterns that enable gradual modernization—such as the strangler pattern—which can guide future refactoring efforts.
Early cloud native adoption should focus on simple use cases that expose platform and process gaps quickly. Application and platform teams should work closely together at the start, ideally with cross-functional teams (e.g., developers paired with cloud specialists), to accelerate learning and reduce duplication of effort. Over time, these functions will split as platforms mature and can support broader reuse.
You’ll likely be dealing with technical debt, and some applications will depend on third-party services with their own roadmaps. These realities should inform your prioritization. Application teams must also become familiar with the cloud native services and capabilities offered by the platform team and may need to collaborate with platform product managers to shape future needs.
These early applications will serve as blueprints for broader adoption.
Here is a working model for the microservices path. You may adapt this to your model.
- Level 1: Begin by reviewing microservice patterns and architecture in the context of your specific applications. Non-functional requirements—such as latency, resilience, scalability, and third-party integrations—must be carefully considered. For example, splitting logic across pods can introduce latency (particularly across datacenters or regions), so it’s critical to evaluate architectural patterns early.
If you’re refactoring a monolith, expect significant redesign. Existing architectures may not have the technical scaffolding to support cloud native patterns. State management is a key concern—refactoring may require substantial changes here. This process should reinforce the understanding that moving to cloud native is a long-term commitment.
Cloud native platforms offer abstractions and capabilities that were previously hard to implement. With this flexibility comes the need to understand the quality and trade-offs of various infrastructure components—for example, object vs. block storage, or the selection of container network interfaces ( CNIs) or networking resources within Kubernetes itself: ingress controllers, and service meshes like the Gateway API. Gaining a comprehensive view of available options is essential as you refactor for Kubernetes.
A shift to declarative models introduces new non-functional requirements for application teams. One fundamental change is the ephemeral nature of infrastructure—developers must now design applications assuming that no single instance will persist. Availability responsibilities move from infrastructure to application, requiring developers to build resilience into the code.
Instead of depending on individual pods, applications must be managed via higher-level Kubernetes resources like Deployments or StatefulSets. This lowers infrastructure costs but increases developer accountability for reliability and availability.
Pods, by nature, are ephemeral. This affects caching strategies—developers must either implement persistent caching or use persistent volumes where necessary. Because applications will sit behind ingress controllers or load balancers, readiness and liveness probes become critical. End-to-end readiness checks—such as backend transaction validation—ensure only functional services are exposed to users.
Developers must also understand pod-to-container relationships and models like sidecars, which help separate concerns. IP addresses are dynamic, so service discovery must rely on DNS, the Kubernetes API or other appropriate means. - Level 2: You’re in production now, and the focus shifts to scale, availability, observability, and alignment between your platform and applications. You may be introducing service meshes and more advanced monitoring.
If you’re adopting GitOps, developers need a clear understanding of its key principles and how to get started. Irrespective of this decision, they should begin using Kubernetes-native configuration management tools. This includes externalizing configuration using ConfigMaps, Secrets, or other runtime mechanisms—rather than embedding configuration in the image at build time. This approach improves validation and reduces drift, making practices likegit diff
effective for tracking changes.
Because the platform is software, it requires regular maintenance. Kubernetes releases approximately three times per year, so establishing a proactive cluster lifecycle and maintenance process is critical. Regular updates should be scheduled as part of ongoing operations—not treated as exceptional events.
From the start, developers must understand that pods are ephemeral. They should account for this by implementing node and pod affinities, PodDisruptionBudgets, and TopologySpreadConstraints to ensure service continuity during cluster upgrades and disruptions. - Level 3: At Level 2, application patterns are well defined and there’s a strong push for consistency in foundational practices. At Level 3, you begin expanding beyond the basics and may encounter the limitations of your existing tooling.
Applications are refactored to better align with platform-native resource types—such as using object storage instead of persistent volumes—and to adopt operator-first patterns for lifecycle management. Where developers previously worked within a namespace-as-a-service model, they may now explore cluster-as-a-service options to gain greater isolation or flexibility.
You may introduce abstraction layers like Dapr to decouple infrastructure services (e.g., messaging, storage) from application code, simplifying development and improving portability. Kubernetes is no longer just an infrastructure platform—it’s evolving into a true application hosting platform, a foundation for your internal PaaS.
New application patterns are emerging, while older, less scalable ones are phased out. These shifts are guided by your evolving non-functional requirements and the capabilities of the underlying platform. - Level 4: At this stage, cloud native
design patterns are formalized and shared across the organization—often documented in Git repositories or collaboration tools like Confluence. Consistency in implementation is not only visible but may now be actively enforced.
The organization strikes a balance between standardization and flexibility. While standardization is ideal for scale and maintainability, some variation remains to support the “right tool for the job” approach. This balance directly influences application architecture and development patterns.
By this point, the organization is converging on a well-defined set of tools and practices that align with both platform capabilities and business needs. - Level 5: At this level, all new greenfield applications are developed with a cloud native-first approach—unless specific requirements (such as ultra-low latency) dictate otherwise. You’re actively onboarding your existing application portfolio to the platform using proven, repeatable processes.
Applications are now fully aligned with platform strengths and capabilities. You’ve arrived at the “right tool for the job” through an organic, Darwinian evolution—where scalable, resilient services have emerged as the standard. Infrastructure providers no longer constrain your design decisions.
Mature, well-matched application patterns are in place, supported by robust self-service APIs across both the platform and cloud native tooling. These APIs offer a wide range of capabilities that enable teams to operate efficiently at scale.
At this level, you’re harnessing the full power of cloud native and the elasticity of cloud infrastructure. For many organizations, this ability to scale seamlessly can have a direct and significant impact on cash flow, operational efficiency, and long-term viability.
Container and Runtime Management
- Level 1: At this stage, the focus is on learning to build and run containers. Teams must upskill in writing container files, building images, running them in clusters, and understanding how containers differ from virtual machines. Developers also need to work comfortably with containers on their local machines.
A platform team is established to build and manage Kubernetes clusters. Developers begin using development and integration tooling (e.g., Tekton), while infrastructure teams introduce Infrastructure as Code (e.g., OpenTofu) to provision cloud environments, including projects, VPCs, IAM, and Kubernetes infrastructure.
To prepare for production, container image builds are integrated into CI pipelines, and a container registry is adopted with clear versioning and tagging practices. Kubernetes is now the application runtime platform, and you are gathering all necessary deployment artifacts such as YAML files for Deployments, StatefulSets, Services, Ingress, LoadBalancers, PersistentVolumes, and more.
You likely begin using Helm charts (e.g., for Ingress-NGINX) and deploy your first operators for core functionality such as secrets management. Understanding the Kubernetes operator model and custom resources (CRDs) becomes highly valuable. - Level 2: Now running in production, you begin augmenting the basics with tools for security, policy enforcement, and workload configuration. You establish practices around container hygiene and begin defining policies for base images and dependency management. This may include:
Using standardized base images (e.g., UBI or internal Ubuntu mirrors)
Allowing upstream images from sources like Docker Hub or Quay.io with SBOM validation
Maintaining a catalog of hardened, source-built images
Security practices include automated scanning, runtime observability, and policy controls. CNCF projects become strong candidates to support observability and governance requirements.
Level 3: As your workloads grow and you scale operations, consistent tooling across clusters becomes essential for maintaining visibility into your Kubernetes environments. Tools like KArmada enable multi-cluster configuration management, helping ensure consistency and reducing configuration drift—even across regions or continents.
Namespace as a Service models are common, but maintaining efficiency becomes more challenging. Cost management becomes critical—when infrastructure scales by a factor of 10, even small inefficiencies (e.g., 80% vs. 90% utilization) have a large impact.
Release management complexity increases. You need to answer questions like: How do we support multiple clusters with a tool like Argo CD? Network planning becomes more important, particularly around IP address management and API server quotas. Even small clusters can place heavy load on Kubernetes API servers, so it’s vital to monitor quota limits with your cloud provider should you not be hosting on premise clusters.
Managing custom resources and operators also becomes a significant operational concern. You must define and document shared responsibilities across infrastructure, platform, and application teams. Operational processes such as cluster upgrades must be well-thought-out and follow best practices—like managing all configuration in source control.
Observability expands dramatically. The volume of metrics, logs, and traces increases with scale. You need to archive logs effectively and ask whether you should audit system calls if you are doing so. Certificate management and encryption (including key rotation) become critical, often necessitating service meshes for automatic certificate issuance and mTLS. Manual processes won’t scale—automation becomes mandatory.
You will likely be exploring different techniques and operators and discovering their limitations depending on your deployment model. Centralized tooling can become a bottleneck—for example, pulling artifacts from a single repository or funneling logs to a single destination. Distributed control may be necessary to avoid these constraints, such as using multiple Fluent Bit instances to aggregate logs instead of overwhelming a central log ingestion endpoint.
You begin to weigh tradeoffs between centralized and distributed models, each offering different benefits and challenges. Questions around maintaining application artifacts and dependencies at scale also emerge—you may be managing tens or hundreds of thousands of containers that require updates and patching.
Readiness and liveness probes take on new importance at scale. Readiness probes should validate end-to-end functionality, including downstream dependencies, to ensure the pod is truly ready for traffic. Liveness probes help maintain workload availability. As always, workloads must be treated as cattle, not pets.
Level 4: By Level 4, the platform vision and architecture are clearly defined. Experiments and ad hoc tooling from Level 3 are rationalized and replaced with strategic, standardized solutions. You may still have overlapping tools in use, but now you have a better understanding of their tradeoffs and suitability. Some of this clarity results from hard lessons and suboptimal solutions adopted under pressure at earlier levels.
Technical debt from Level 3 is acknowledged, and although paying it down is difficult, it now becomes a focused effort.
Level 5: At the highest level of maturity, your platform responds to events automatically. All security and operational data is centralized, allowing for coordinated and efficient action. The system is no longer reactive but proactive. Automated event responses, centralized observability, and full lifecycle automation make the container runtime environment robust, scalable, and secure.
Application Release and Operations
Managing a cluster with Infrastructure as Code (IaC) differs from managing application release and deployment, though many of the same techniques and tools apply to both.
Level 1: When starting with Kubernetes, it is important to gain as much hands-on experience as possible. You will become familiar with the Kubernetes API, write YAML manifests, and begin exploring configuration management and templating tools. Early on, you’ll be supporting dev, test, and prod environments, so it’s critical to evaluate tooling that can help you manage your manifests effectively from the start.
Version control systems—traditionally used by developers—become essential for release and operations in the cloud native world. Since releases are defined in code, they must be tracked, forming the basis for GitOps. This requires careful evaluation of branching models that align with your organization’s release policies. Ensure you understand the capabilities of your GitOps tooling and your release process. Remember the developer principle of “Don’t Repeat Yourself.”
Level 2: You are now using GitOps operators for rapid, consistent deployment across all environments. Controlling access to configuration repositories and ensuring they reflect what is deployed is critical. You are consuming Helm charts and upstream package artifacts to configure third-party tools.
Supply chain security techniques are being incorporated into your releases (see the Security and Policy section). Observability is now vital to operating cloud native applications. With increased flexibility in environment creation, new release paradigms—such as maintaining two production environments to allow upgrades at any time—can offer significant benefits.
You are adopting a “roll forward” approach to issue remediation, applying the last known good configuration to the cluster. Using resources as feature flags (e.g., with Kustomize) is an effective strategy for testing upgrades. Application teams are expected to maintain all components and third-party dependencies in their solutions.
If development and operations remain separate, there must be an approval process for promoting changes to production, with operations reviewing each release. SBOMs are required for any third-party applications, possibly as a contractual or licensing requirement. It is essential to follow strong security practices for both container images and Kubernetes deployments, and to document and share these practices. Robust configuration management accelerates testing and security patching.Level 3: Developers are now responsible for their own releases, including developing their own continuous deployment pipelines. They use the same deployment process for dev, test, and production environments. You are placing greater emphasis on provenance and controlling what enters your clusters (see Security and Policy section).
Instrumentation is robust and includes tracing, observability, service meshes, and mutual TLS. Awareness of cloud provider offerings increases, and performance becomes a key concern. You must now balance performance and cost.
Sharing learning across the organization is essential to avoid perpetuating inefficient practices and technical debt. Upstream third-party tools often release updates as quickly as your internal platform, making it important to stay current with new versions, features, and best practices. As you scale, capacity limits in your initial tooling may surface, making capacity planning and deeper tooling utilization critical.Level 4: You are actively securing the supply chain, and policies now govern both the release pipeline and runtime state. The Kubernetes API is used not only for container orchestration but also to manage other data center components having likely been extended with Crossplane and other technologies.
Organizations at this level can create and destroy production-ready clusters on demand and take advantage of beta and alpha APIs. Releases are automated, reliable, consistent, measurable, auditable, revertable, and quick. Artifacts are standardized and predictable.
Automation—including admission controllers—is used to validate workloads before production release. Security and policy controllers enforce defense-in-depth strategies. Kubernetes becomes a foundational platform component, and developers begin coding against infrastructure capabilities. Tools like Crossplane illustrate this evolution by integrating cloud and infrastructure lifecycles directly into applications.Level 5: Code is released at the highest level of abstraction and with idempotence from the underlying infrastructure, enabling maximum velocity and minimal vendor lock-in. You are effectively managing controlling artifacts and addressing technical debt.
Continuous deployment to production is now in place, supported by a fast, controlled, and automated release pipeline.
Testing and Issue Detection
Testing and issue detection evolve significantly as organizations adopt cloud native practices. This section outlines how testing, observability, and operational readiness mature across each level, from manual validation to automated recovery and continuous validation.
Level 1: When just starting out, most testing is manual, focused on your initial production candidate application. With Kubernetes, your attention is on basic network connectivity and confirming that applications can be successfully deployed. You will perform smoke tests and user acceptance testing (UAT).
In Levels 1 and 2, the emphasis is on consistency in container image builds and the continuous delivery process. You’ll rely on existing tools for unit testing and static code analysis. Both functional requirements (e.g., application logic) and non-functional requirements (e.g., performance, capacity, and availability) must be considered.
You’ll begin implementing liveness and readiness probes and incorporating observability tools. It’s important to define clear service level agreements (SLAs) based on business and customer expectations.Level 2: Now that you’re in production, you’ll begin experimenting with tools that support security, policy enforcement, misconfiguration detection, resource management, and observability—starting in staging or development environments.
Development teams are supported by platform and infrastructure teams for environment management. Tooling decisions should prioritize customer-impacting functionality. For example, don’t focus on low-priority policy controls if customers are experiencing latency issues that could be addressed by monitoring through a service mesh.
You’re actively prioritizing based on business needs and customer satisfaction. Production feedback becomes a valuable source of insight. Metrics should be tracked and visualized from both platform and application sources. While logging may be challenging at this stage, it is essential for effective troubleshooting. You may also start evaluating tracing tools.
This phase can be both exciting and challenging, as team members gain production experience at different speeds. Consistent deployments make testing easier, and a strong release pipeline improves issue remediation.
Level 3: Building on your tool and process experimentation, you now implement these practices in production. You establish robust alerting and dashboards, expanding your observability capabilities. Consistency in builds and deployments supports reproducible testing.
Development and test environments become shorter-lived and are spun up or torn down based on business requirements, following consistent specifications. These environments may be entire clusters or, for cost efficiency, isolated namespaces within a single cluster. You’ll begin investing in user interfaces that allow developers to create and destroy environments easily.
To manage the overhead of dynamic environments, automation is key. You are scaling your build pipelines, leveraging platform capabilities such as cost-efficient regional placement, resource management, and parallelization to increase build concurrency and support more comprehensive testing.
Vulnerability scanning at the container and cluster level, along with SBOMs, enables effective identification of dependency-related security issues. You’re also becoming more aware of the tradeoff between developer velocity and test coverage, including how frequently and which types of tests are run.
Level 4: Issues may now span multiple applications, requiring you to aggregate data across systems to identify trends. You’ve established consistent patterns for validating all environments—including production—and can create and destroy them with ease.
This includes platform considerations like load balancing and secret management to ensure environments feel seamless and plug-and-play. Data consistency across environment instances becomes increasingly important.
Recovery processes are now integrated into standard operations, including chaos engineering to simulate failures and validate system resilience. This helps ensure non-functional requirements like availability and fault tolerance are continuously met in practice.Level 5: At this level, both platforms and applications recover automatically and immediately. Restoration to a known good state is always possible and predictable. You have strong test coverage for both functionality and quality of service, and testing occurs as early in the development lifecycle as possible. Immutability and idempotency principles ensure systems can consistently return to a reliable state.
Security and Policy
This section outlines how security and policy practices mature alongside cloud native adoption, starting with basic IAM and secret management and evolving toward automated, policy-driven platforms. As organizations progress, they implement defense-in-depth strategies, enforce compliance through policy as code, and continuously optimize security in response to changing threats.
- Level 1: Begin building your secured CI/CD pipeline if you haven’t already, and remember that your current practices with VMs will evolve significantly. You have developed an Identity Provider and Identity and Access Management (IAM) infrastructure, integrating it into your clusters using tools such as RBAC and service accounts.
Following the 12-Factor principles, configuration is stored in the environment—including secrets, which are base64-encoded (not encrypted)—to allow stage-specific configurations (e.g., dev, test, prod). Much more can be stored in the environment, enabling immutable images and a strong separation between application and configuration. Avoid embedding environment-specific information, such as credentials, directly in container images.
You are becoming familiar with the Kubernetes API and are aware of its users. You also understand Kubernetes’ flat networking model, where all pods can connect to each other by default, with no inherent workload isolation. - Level 2: Ensure that development and operations teams follow best practices for container, secrets, and security management. In production, you must address encryption, authentication, and authorization. This includes certificate management and a functioning CA infrastructure that can issue certificates to running pods.
You are implementing secret management tools and automation. TLS or mutual TLS is being deployed at the cluster level and between pods—especially for sensitive workloads. A service mesh may be considered to enhance traffic visibility and manage network security features.
Your systems are auditable, with logs and events captured and retained. Generic accounts that are not traceable to individuals (e.g., “administrator” or “kubeadmin”) are not used, in contrast to service accounts used by software to access resources.
You may limit service exposure to load balancers and restrict network access to the production cluster to prevent unnecessary or unexpected exposure. These measures are especially important in multi-tenant clusters (e.g., Namespace as a Service).
Access policies are expanding to include source control, automation components, and dependencies used for managing clusters. You are extending GitOps to the platform layer, ensuring consistency and convergence to a known state, and reverting any drift—whether malicious or accidental.
You are beginning to restrict API access, and validating incoming requests using admission controllers, possibly including mutating admission controllers. - Level 3: Now is the time to automate deployment guardrails and platform components like certificate management while implementing security best practices through policy as code. Define your enforcement strategy and begin adopting relevant third-party benchmarks and standards. Consider incorporating anomaly and threat detection technologies.
As production environments grow more complex, some issue remediation may require changes to your policy-as-code, Infrastructure as Code, or application code.
You are evaluating SPIFFE/SPIRE as you move towards a zero trust model for security, and are well underway with certificate and trust store automation, and service mesh integration. Admission controllers now read from your policy platform, enforcing organization-wide or application-specific rules.
You are scanning container images, identifying and addressing CVEs, and maintaining SBOMs to provide provenance. You aim to meet SLSA Build Level 1 requirements. Machine learning may also be introduced to enhance threat detection practices. - Level 4: Apply your security policies to production, if you haven’t already, and continue tuning them. You have implemented measures to reduce attack surfaces, such as preventing manual pod access (e.g., no shell inside pods), while providing safer alternatives with full audit trails (e.g., Falco).
You’re improving security posture by removing the need for insecure workarounds or legacy practices. At this stage, you are working toward SLSA Build Level 2 compliance. - Level 5: Security policies are continuously optimized in response to evolving threats and business requirements. Exceptions are minimized and formally controlled. You are working toward compliance with SLSA Build Level 3.
Cost Efficiency, Resource Usage and Sustainability
This section outlines how efficiency and sustainability practices evolve from basic resource tuning and cost awareness to full optimization of workloads across architecture, geography, and carbon impact. As organizations mature, they incorporate FinOps, sustainability reporting, and developer accountability to achieve peak efficiency and minimize wasted resources.
Level 1: A common temptation at this level is to focus solely on containerizing and deploying workloads, without tuning resource requests and limits. Addressing these early yields long-term benefits. Developers will be involved, particularly in sizing memory allocations (e.g., Java heap sizes).
Level 2: The primary focus is reaching production. This often results in cluster sprawl across environments, leading to increased costs and operational complexity. You begin limiting resource consumption and may explore multi-tenancy (e.g., Namespace as a Service) or consolidating environments (e.g., dev and test in a single cluster with separate namespaces).
You might evaluate CNCF projects like Capsule to reduce the overhead of running multi-tenant clusters. Chargeback and FinOps capabilities are introduced in a basic form, such as tracking CPU and pod requests or applying quotas at the namespace level. These practices will become more granular as you mature. You also begin experimenting with vertical and horizontal pod autoscaling—initially based on platform metrics, with early exploration into application-level metrics to inform scaling decisions using KEDA.
Level 3: Sticker shock may occur here (or even earlier). You begin consolidating workloads and resources more aggressively, potentially using spot VMs and deploying in lower-cost regions, while balancing availability and performance. Idle development and test workloads are scaled down outside business hours where possible.
You become more aware of underlying infrastructure—such as machine types, chipsets like ARM or RISC-V, and specialized hardware like GPUs. FinOps becomes essential as multiple teams run production workloads. Cluster efficiency improves through approaches like Namespace as a Service, and you adopt tools to report and manage resource requests effectively.
Level 4: You begin reporting on carbon emissions and shift focus from pure cost efficiency to include sustainability goals. This includes carbon reporting and machine right-sizing, considering both size and architecture. Cost-benefit analysis becomes more rigorous—for example, evaluating whether a workload justifies GPU usage.
Developers now share responsibility for efficiency and are expected to optimize code based on the capabilities of the underlying platform and infrastructure.
Level 5: You’ve reached peak efficiency. Resource usage is optimized, FinOps reporting is mature, and your systems are highly sustainable—leveraging efficient chip architectures, immediate responsiveness, and minimal overhead. Wasted resources are minimized, and workloads are right-sized and deployed on the most appropriate platforms in the most efficient regions.
AI
The CNCF AI White Paper describes “Cloud Native Artificial Intelligence [as] an evolving extension of Cloud Native” that “…refers to approaches and patterns for building and deploying AI applications and workloads using the principles of Cloud Native. Enabling repeatable and scalable AI-focused workflows allows AI practitioners to focus on their domain”. In this context, cloud native works to solve with its scalability, resilience, observability and manageability many of the challenges that AI suffers.
Level 1: Starting out with initial development and experimentation, organizations explore basic AI concepts and conduct small-scale experiments, typically where the outcome is known using discriminative AI such as the classification of email. Developers working with AI will be working mostly on rapid prototyping and gaining access to resources such as storage, networking and processing for training (the process of building an AI model from data) and inference (computing results from AI models). Kubernetes facilitates resource access and model dependencies can be effectively managed through containerization and as OCI artifacts, models can be stored in registries and caching can be enabled.
Level 2: With the first production deployment of AI models, the emphasis shifts to ensuring the AI workload’s stability, basic scalability and service resiliency. Initial observability is important, and model drift needs to be tracked. Tools such as OpenTelemetry and Prometheus can also assist with monitoring load, number of accesses and response latency. Security is also a concern in production, with model serving instances requiring firewall production, access control, penetration testing, and compliance checks. Reaching production is a major step, but it’s not the last in increasing maturity.
Level 3: As the organization begins to scale, more complex AI applications like Generative AI and Large Language Models (LLMs) are introduced, and the focus shifts to standardizing MLOps processes and addressing inconsistencies between development and production environments. As such, if not already employed, then you may find yourself investigating solutions like Python SDKs for KubeFlow or general-purpose distributed computing engines like Ray (and KubeRay in Kubernetes). For more advanced Kubernetes scheduling needs, Kueue, Volcano and other non-CNCF tools may help. Data volumes and locality will also drive technical solution architecture. Rising processing demands from LLMs will lead to increased demand for accelerators such as GPUs, TPUs and other technologies. Given the cost implications, technologies such as that allow for sharing of resources such as vGPU and multi instance GPU may be investigated. Kubernetes development around Dynamic Resource Allocation may also help. In order to promote sustainability as well as reduce costs, models may be autoscaled for serving and placed into geographical regionals that are powered by cleaner energy.
Level 4: At this stage the organisation has achieved significant control over its AI operations, with mature MLOps practices and has formalized its cloud native design patterns for AI workloads. There’s a clear understanding of the AI supply chain, and policies are actively governing its security and runtime state. We see at this level the beginning of a symbiotic relationship where AI is used to improve cloud native systems themselves. Projects like K8sGPT can enter the hands of operators to enhance their productivity using LLMs for processing logs. This allows less technical users to operate complex systems. AI can also be used to identify workload patterns and anticipate load, and optimize resource scheduling for things like power conservation, resource utilization, latency and priorities. Best practices for the software supply chain should also be followed here. This will also include taking advantage of capabilities such as hardware-supported Trusted Execution Environments to protect sensitive data and valuable ML models. Security may also be enhanced by using AI itself as a ‘red team’ member for the identification of security gaps.
Level 5: In line with the overall drive towards efficiency, the AI lifecycle is fully automated and highly optimized through cloud native technologies, and is also deeply integrated into the operational fabric of the cloud native environment, with the organization achieving peak efficiency and sustainability for its AI workloads. Cloud native supports this through full infrastructure automation and continuous deployment for AI. AI itself is used for operational tasks, and there is optimised resource usage with full FinOps capabilities. Promoting environmental sustainability and developing energy-efficient AI models is crucial at this level.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.