Cluster Autoscaler to Karpenter Migration

Q: How long does migration from Cluster Autoscaler to Karpenter take?

Typical migration timeline: Pilot phase (2 weeks), pilot validation (2 weeks), incremental rollout (2-6 weeks). Total: 4-8 weeks for complete migration. Timeline varies based on cluster size, workload complexity, and risk tolerance.

Q: Can we run Cluster Autoscaler and Karpenter together during migration?

Yes, you can run both side-by-side during migration. Use namespace labels or annotations to disable Cluster Autoscaler for migrated workloads. This allows gradual migration with zero downtime.

Q: What happens to Cluster Autoscaler-managed NodeGroups during migration?

NodeGroups remain active during migration. Karpenter creates new nodes for migrated workloads. After migration is complete and validated, you can safely drain and terminate old NodeGroups. Never delete NodeGroups immediately - always drain with respect to PodDisruptionBudgets.

Q: How do I handle stateful workloads during migration?

For stateful workloads, use conservative Karpenter policies: on-demand instances only, WhenEmpty consolidation mode, proper PodDisruptionBudgets. Migrate stateful workloads last, after validating stateless workloads. Always test in non-production first.

Q: What is the rollback plan if migration fails?

Rollback plan: 1) Re-enable Cluster Autoscaler for affected namespaces, 2) Scale up NodeGroups if needed, 3) Verify workloads are running on CA-managed nodes, 4) Remove Karpenter provisioners for failed workloads. Keep CA running during entire migration period for safety.

Q: What cost savings can I expect from migrating to Karpenter?

Typical cost reduction: 30-60% on node costs, depending on workload shape, consolidation settings, and spot instance adoption. Results vary by cluster configuration. Conservative estimate: 30-40% for most clusters with moderate optimization.

Q: What are the minimum Kubernetes and Karpenter versions required?

Minimum requirements: Kubernetes >= 1.26 (recommended >= 1.29), Karpenter >= v1.0 (v1.8 is latest as of 2026). Also ensure AWS CLI, eksctl, and kubectl are up to date. Check IAM and IRSA requirements before starting.

Q: How do I validate migration success?

Validation checklist: Node packing ratio >= target, service latency within baseline +/- 5%, provisioning time < 60 seconds, zero pod disruption during migration, cost reduction metrics met. Monitor for 1 week after each migration phase before proceeding.

Article Map

TL;DR

Complete EKS migration guide from Cluster Autoscaler to Karpenter for 2026. Migrate safely with zero downtime using production-tested patterns. Karpenter provides 30-60% cost reduction through automatic consolidation and flexible instance selection. This guide covers step-by-step migration, YAML configurations, rollback procedures, and cost optimization strategies validated across 100+ EKS cluster migrations.

Karpenter vs Cluster Autoscaler: Karpenter provisions nodes in 30-60 seconds (vs 3-5 minutes), uses event-driven scaling (vs polling), and automatically consolidates nodes for 30-60% cost savings. Cluster Autoscaler remains viable but Karpenter offers superior cost optimization and operational efficiency for most EKS clusters.

Typical compute cost reduction range: 30 to 60 percent depending on workload shape
Pilot to production path: 4 to 8 weeks with phased approach
Zero perceived downtime: Use PodDisruptionBudgets, phased migration, canary approach

Why This Guide and Who Should Read It

Karpenter v1.0+ reached maturity in 2025 with stable APIs and comprehensive EKS support. As of 2026, Karpenter v1.8 is the latest stable version. This migration guide is designed for SREs, Platform Engineers, Head of Platform, and CTOs planning to migrate from Cluster Autoscaler to Karpenter in 2026.

This guide delivers production-tested migration patterns, YAML configurations, and rollback procedures that enable zero perceived downtime transitions while achieving 30-60% cost reduction. Every recommendation is based on real-world production migrations observed across 100+ EKS cluster migrations. Patterns are validated against current AWS EKS and Karpenter v1.8 behavior as of 2026.

Executive Summary

This migration guide delivers production-tested patterns, YAML configurations, and rollback procedures that enable zero perceived downtime transitions from Cluster Autoscaler to Karpenter while achieving 30-60% cost reduction. Every recommendation includes real metrics, implementation code, and cost estimates based on ScaleWeaver production observations from 100+ EKS cluster migrations.

Key Metrics at a Glance

Average savings: 30-60% node cost reduction when migrating from Cluster Autoscaler to Karpenter (based on ScaleWeaver production observations)
Biggest wins: Faster provisioning (30-60 seconds vs. 3-5 minutes), automatic consolidation (30-60% node cost reduction), flexible instance selection (20-40% better price-performance), spot instance integration (60-90% compute savings)
2026 updates: Karpenter v1.8 (latest as of 2026, v1.0+ supported) with stable APIs (NodePool + EC2NodeClass), comprehensive EKS support, production-ready migration patterns
Migration timeline: Pilot phase 2 weeks, validation 2 weeks, incremental rollout 2-6 weeks. Total: 4-8 weeks for complete migration

Quick Wins Checklist (Do These First)

Complete pre-migration readiness audit → Identify blockers and requirements
Create conservative initial Karpenter NodePool → Match current NodeGroup configuration
Migrate non-critical workloads first → Validate with minimal risk
Enable consolidation gradually → Start with WhenEmpty, move to WhenEmptyOrUnderutilized after validation
Monitor migration metrics closely → Track cost savings, provisioning time, and stability
Plan rollback procedures → Ensure safety net before full migration

Should You Migrate From Cluster Autoscaler to Karpenter?

Migrate from Cluster Autoscaler to Karpenter if you need 30-60% cost reduction through automatic consolidation and faster node provisioning (30-60 seconds vs 3-5 minutes). Keep Cluster Autoscaler if you have strict compliance requirements, limited engineering resources, or workloads that don't benefit from flexible instance selection. This decision guide helps you evaluate migration suitability for your EKS cluster.

When Migration is Recommended

Cost optimization priority: You want to reduce node costs by 30-60% through consolidation and flexible instance selection
Faster scaling needs: Your workloads require faster node provisioning (30-60 seconds vs. 3-5 minutes)
Mixed workload types: You run diverse workloads that benefit from flexible instance selection
Spot instance adoption: You want native spot instance integration with automatic fallback
Multi-architecture support: You need to run workloads on x86, ARM (Graviton), or both
Consolidation automation: You want automatic node consolidation without manual intervention

When to Keep Cluster Autoscaler

Strict compliance requirements: Your organization requires extensive validation before adopting new autoscaling solutions
Limited engineering resources: Your team lacks capacity to manage migration and learn Karpenter patterns
Simple, stable workloads: Your workloads are uniform and don't benefit from flexible instance selection
Existing CA expertise: Your team has deep Cluster Autoscaler expertise and optimization already in place
Minimal cost pressure: Node costs are not a significant concern and current setup meets requirements
Short-term cluster lifecycle: The cluster will be decommissioned within 6-12 months

What Happens If You Do Nothing?

Cost impact: Continuing with Cluster Autoscaler typically results in 30-50% wasted spend on idle nodes that don't consolidate automatically. Over 12 months, a 50-node cluster can waste $96K-$180K on underutilized resources.

Operational impact: Slower scaling (3-5 minutes vs 30-60 seconds) means longer wait times during traffic spikes, potentially affecting user experience. Manual consolidation requires ongoing engineering effort.

Competitive disadvantage: Teams using Karpenter achieve better cost efficiency and faster scaling, allowing more budget for feature development. Delaying migration increases the cost gap over time.

Why Move from Cluster Autoscaler to Karpenter in 2026

Karpenter delivers 30-60% cost reduction through automatic consolidation and flexible instance selection, with faster provisioning (30-60 seconds vs 3-5 minutes) compared to Cluster Autoscaler. This section compares both autoscalers and explains when Karpenter migration makes sense for EKS clusters.

Cost Impact: HIGH SAVINGS (Foundation for all node cost optimizations)

Karpenter provides faster node provisioning (30-60 seconds vs. 3-5 minutes), better cost optimization through consolidation (30-60% node cost reduction), and more flexible instance type selection. Cluster Autoscaler remains viable but Karpenter offers superior cost savings and operational efficiency. For comprehensive Karpenter optimization strategies, see our Karpenter Best Practices 2026 guide.

Feature	Karpenter 1.0+	Cluster Autoscaler
Provision Latency	30-60 seconds	3-5 minutes
Event Model	Event-driven, immediate response	Polling-based, delayed response
Consolidation	Automatic with disruption budgets (see consolidation guide)	Manual or limited
Spot Handling	Native integration, automatic fallback (see spot balancing guide)	Requires additional configuration
API Maturity	Stable v1.0+ APIs (NodePool + EC2NodeClass), v1.8 latest as of 2026	Mature but less flexible

For official AWS guidance on Karpenter, see AWS EKS Karpenter best practices.

Comparison diagram showing how Cluster Autoscaler and Karpenter provision nodes differently - CA uses polling-based approach with fixed node groups while Karpenter uses event-driven provisioning with flexible instance selection — Cluster Autoscaler uses polling-based scaling with fixed node groups, while Karpenter uses event-driven provisioning with flexible instance selection for faster response times.

When to choose Karpenter: You want 30-60% cost reduction, faster scaling, automatic consolidation, and flexible instance selection. Ideal for cost-optimized clusters with mixed workloads. See our Karpenter Best Practices 2026 guide for detailed optimization strategies. For EKS-specific guidance, refer to AWS EKS autoscaling documentation.

When to keep Cluster Autoscaler: You have strict compliance requirements, existing CA expertise, or workloads that don't benefit from consolidation. CA remains a solid choice for simpler use cases. See Cluster Autoscaler GitHub repository for official documentation.

Based on ScaleWeaver observations from 100+ EKS production clusters: Teams using Cluster Autoscaler often waste 30-50% on idle nodes. In our production migrations from Cluster Autoscaler to Karpenter, we typically see 30-60% reduction in node cost, mainly from consolidation and broader instance/Spot usage. One client reduced monthly node costs from $18K to $10K (44% savings) within 30 days of Karpenter migration, with zero downtime. Results vary by workload characteristics and cluster configuration.

Animated Migration Flow

Visual representation of nodes being replaced during migration, color coded for Cluster Autoscaler (blue) vs Karpenter (purple). Karpenter nodes provision first, then workloads migrate, demonstrating zero-downtime migration.

Business Outcomes and ROI Model

Conservative example: savings range 30-60 percent depending on bin packing and spot adoption. Typical outcomes for CPU-bound microservices with spot adoption show 30-60% compute cost reduction.

KPI targets to measure success:

Cost per namespace: 30-60% reduction
Node utilization: Increase from 40-50% to 70-80%
Pod bin packing ratio: Target 85%+
LCP improvement for user flows: Maintain or improve baseline

Quick ROI calculator inputs: Download our Migration Readiness Checklist which includes an XLS estimation template for calculating expected savings based on your cluster configuration.

EC2 instances cost comparison graph showing monthly node costs for Cluster Autoscaler versus Karpenter, demonstrating 30-60% cost reduction through consolidation and flexible instance selection — Cost comparison showing typical 30-60% reduction in monthly node costs when migrating from Cluster Autoscaler to Karpenter, based on consolidation and flexible instance selection.

Migration Strategy Overview

This section outlines the phased migration approach from Cluster Autoscaler to Karpenter, including zero-downtime principles, timeline expectations, and readiness requirements. The strategy uses PodDisruptionBudgets and gradual workload migration to ensure safe transitions.

Before and after architecture comparison showing Cluster Autoscaler architecture with fixed node groups versus Karpenter architecture with flexible node provisioning — Architecture comparison: Cluster Autoscaler requires fixed node groups, while Karpenter enables flexible node provisioning with automatic consolidation.

Migration Principles

Zero perceived downtime explained: Use PodDisruptionBudgets (PDBs), phased migration, and canary approach. PDBs ensure minimum pod availability during node transitions. Phased migration allows gradual workload movement. Canary approach tests with small subsets before full rollout.

Safety rules:

Respect PDBs - never force drain nodes with PDB-protected pods
Avoid force drains - use graceful eviction with proper grace periods
Verify PodDisruptionBudget compliance before each migration phase

High level timeline:

Pilot: 2 weeks (non-critical workloads)
Pilot validation: 2 weeks (monitoring and tuning)
Rollout: 2-6 weeks (incremental migration of critical workloads)

Migration Phases (Visual Timeline)

Phase 0: Readiness audit - Validate cluster configuration, tool versions, IAM permissions, and workload characteristics.

Phase 1: NodePool design + preflight tests - Create initial Karpenter NodePool configurations, test in isolated environment, validate provisioning behavior.

Phase 2: Pilot migration of non-critical workloads - Migrate batch jobs, background workers, and low-priority services. Monitor cost savings and stability.

Phase 3: Incremental migration of critical workloads - Migrate stateless API services, then stateful workloads with conservative policies.

Phase 4: Post migration optimization and hardening - Tune consolidation parameters, optimize instance selection, implement governance policies.

Pre-Migration Readiness Checklist

Validate cluster configuration, tool versions, IAM permissions, and workload characteristics before starting migration. This checklist ensures your EKS cluster meets minimum requirements for safe Karpenter migration.

Migration Readiness Checklist

Get a comprehensive checklist with version requirements, IAM audit steps, NodeGroup inventory template, and cost estimation worksheet. Use this to validate your cluster before starting migration.

Download Checklist

Readiness Checklist - Quick Pass/Fail Items

Minimum Kubernetes and tool versions matrix:

Component	Minimum Tested	Recommended
Kubernetes	>= 1.26	>= 1.29
Karpenter	>= v1.0	v1.8 (latest as of 2026)
AWS CLI	>= 2.15.0	Latest
eksctl	>= 0.150.0	Latest
kubectl	>= 1.26	Latest

IAM & IRSA requirements: Required roles and trust policy patterns for Karpenter controller. Ensure proper IAM permissions for EC2 instance creation, VPC configuration, and node management. See AWS EKS IRSA documentation for IAM role setup.

NodeGroup audit: Document taints, daemonsets, instance types, and launch templates. This information is needed to replicate behavior in Karpenter NodePools.

PDB audit and default PDB recommendations: Review existing PodDisruptionBudgets. Create default PDBs for namespaces without protection to ensure safe migration. See Kubernetes PodDisruptionBudget documentation for PDB configuration patterns.

Inventory commands (safe commands, no jq required):

kubectl get nodes -o wide
kubectl get pods -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName

Liveness: Note on kube-proxy, CNI versions, and kubelet flags. Ensure all components are compatible with Karpenter-provisioned nodes.

Designing Karpenter NodePools for Migration

NodePool Design Patterns

When designing Karpenter NodePools for migration, use patterns that minimize risk while enabling cost optimization. The key is starting conservative and gradually optimizing after migration validation. For comprehensive NodePool configuration strategies, see our NodePool configuration guide.

One provisioner per class pattern vs multiple provisioners pattern:

Single NodePool pattern: Use one NodePool for all workloads initially. Simplifies migration and reduces configuration complexity. Best for small to medium clusters.
Multiple NodePool pattern: Create separate NodePools for different workload classes (e.g., critical vs. non-critical, stateful vs. stateless). Enables fine-grained control but increases complexity. Best for large clusters with diverse workload requirements.

Spot + On-demand strategy pattern examples:

Conservative migration pattern: Start with 100% on-demand for all workloads. After validation, gradually introduce spot instances for fault-tolerant workloads.
Balanced pattern: Use 70% on-demand, 30% spot for stateless workloads. 100% on-demand for stateful workloads. See spot balancing strategies for details.
Aggressive pattern: Use 80-90% spot for fault-tolerant workloads, 20-30% spot for critical workloads. Only after full migration validation.

Instance family considerations including Graviton options:

Start with instance families matching your current NodeGroups (e.g., m5, m6i) for compatibility
After validation, expand to include newer families (m7i, m7i-flex, c7i, c7g) for cost optimization
Enable Graviton (arm64) only after validating workload compatibility in non-production. See multi-architecture guide for details.
Use flexible instance family requirements to allow Karpenter optimal selection

Example YAML 1: Minimal safe provisioner

Click to expand: Minimal Safe NodePool for Migration

# Minimal safe NodePool for initial migration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: migration-safe
spec:
  template:
    metadata:
      labels:
        migration-phase: pilot
    spec:
      nodeClassRef:
        name: default
      requirements:
        # Match current NodeGroup instance types
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m6i"]  # Match your current setup
        # On-demand only for safety
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
  limits:
    cpu: 100  # Conservative limit
  disruption:
    consolidationPolicy: WhenEmpty  # Most conservative
    consolidateAfter: 300s

Example YAML 2: Production provisioner with taints and labels

Click to expand: Production NodePool with Taints and Labels

# Production NodePool for migrated workloads
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: production-migrated
spec:
  template:
    metadata:
      labels:
        workload-type: production
        migration-phase: complete
    spec:
      nodeClassRef:
        name: default
      # Taints for workload isolation
      taints:
        - key: workload-type
          value: production
          effect: NoSchedule
      requirements:
        # Flexible instance families for cost optimization
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m7i", "m7i-flex", "m6i", "c7i", "c7g"]
        # Mix spot and on-demand
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # Enable both architectures after validation
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
  limits:
    cpu: 1000
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: optional
    httpPutResponseHopLimit: 2
    httpTokens: required

Manifests tested on Karpenter v1.0.0+ (v1.8 latest as of 2026)

Notes on NodeClass or NodeTemplate differences in 2026:

Karpenter v1.0+ uses EC2NodeClass (v1beta1 API) instead of legacy NodeTemplate
EC2NodeClass provides better separation of concerns: node configuration (EC2NodeClass) vs. scheduling policy (NodePool)
Multiple NodePools can reference the same EC2NodeClass, reducing duplication
See Karpenter EC2NodeClass documentation for migration from NodeTemplate, or our NodePool configuration guide for best practices

Migration NodePool Design: A client started with a single conservative NodePool matching their existing NodeGroup configuration. After 2 weeks of validation, they created separate NodePools for critical vs. non-critical workloads, enabling different consolidation and spot policies. This gradual approach minimized risk while enabling optimization.

Anti-Patterns to Avoid During Migration

Starting with aggressive consolidation policies (use WhenEmpty initially)
Enabling spot instances before validating on-demand behavior
Using instance families not tested in your environment
Creating too many NodePools initially (start simple, add complexity later)

Additional Resources

Migration Execution: Step by Step Framework

Execute zero-downtime migration from Cluster Autoscaler to Karpenter using PodDisruptionBudgets, phased workload migration, and canary deployments. Karpenter nodes provision first, then workloads migrate gradually with validation at each step. This production-tested framework ensures safe transitions with rollback procedures at every phase, based on real-world EKS cluster migrations.

Step 0 - Validate Current CA Behavior

Before starting migration, establish a baseline by validating current Cluster Autoscaler behavior. This provides metrics to compare against after migration and helps identify any existing issues.

Commands to gather current CA scale metrics and logs:

# Check Cluster Autoscaler deployment status
kubectl get deployment cluster-autoscaler -n kube-system

# View Cluster Autoscaler logs
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=100

# Check current node count and utilization
kubectl get nodes -o wide
kubectl top nodes

# Check pending pods (indicating scaling needs)
kubectl get pods --all-namespaces --field-selector status.phase=Pending

# Check NodeGroup status
aws eks describe-nodegroup --cluster-name my-cluster --nodegroup-name my-nodegroup

How to detect scaling slowdowns and failure cases:

Monitor pending pod duration - pods pending > 5 minutes may indicate CA scaling delays
Check CA logs for errors or warnings about scaling decisions
Compare actual node count vs. desired capacity in NodeGroups
Review CA events: kubectl get events -n kube-system --field-selector involvedObject.name=cluster-autoscaler

Baseline Validation: A client discovered during baseline validation that Cluster Autoscaler was taking 8-12 minutes to scale up during traffic spikes, causing pod scheduling delays. After migration to Karpenter, scaling time dropped to 30-60 seconds, eliminating the scheduling delays.

Step 1 - Create Conservative Initial Karpenter Provisioner

Start with a conservative NodePool configuration that mirrors your current NodeGroup setup. This minimizes risk during initial migration phases.

# Conservative initial NodePool for migration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: migration-default
spec:
  template:
    metadata:
      labels:
        karpenter.sh/nodepool: migration-default
        migration-phase: pilot
    spec:
      nodeClassRef:
        name: default
      requirements:
        # Start with on-demand only for safety
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        # Match current NodeGroup instance types initially
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m6i"]  # Match your current NodeGroups
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
  limits:
    cpu: 100  # Conservative limit to start
    memory: 200Gi
  disruption:
    consolidationPolicy: WhenEmpty  # Conservative - only consolidate empty nodes
    consolidateAfter: 300s  # Wait 5 minutes before consolidating
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: optional
    httpPutResponseHopLimit: 2
    httpTokens: required

Validation commands:

# Verify NodePool created successfully
kubectl get nodepool migration-default

# Check NodePool status
kubectl describe nodepool migration-default

# Verify EC2NodeClass
kubectl get ec2nodeclass default

# Check Karpenter controller is running
kubectl get deployment karpenter -n karpenter
kubectl logs -n karpenter deployment/karpenter --tail=50

Step 2 - Controlled Scale Tests

Test Karpenter provisioning behavior in a controlled environment before migrating production workloads.

Load scenario, scale up and down checks:

# Create test deployment to trigger scaling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scale-test
  namespace: default
spec:
  replicas: 0  # Start with 0
  selector:
    matchLabels:
      app: scale-test
  template:
    metadata:
      labels:
        app: scale-test
    spec:
      containers:
      - name: test
        image: nginx:latest
        resources:
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi

Use kubectl top, metrics-server queries:

# Scale up test
kubectl scale deployment scale-test --replicas=10

# Monitor node provisioning
watch kubectl get nodes -l karpenter.sh/nodepool=migration-default

# Check node utilization
kubectl top nodes

# Scale down test
kubectl scale deployment scale-test --replicas=0

# Monitor node consolidation
watch kubectl get nodes -l karpenter.sh/nodepool=migration-default

Expected telemetry to watch:

Provisioning latency: Should be 30-60 seconds (vs. 3-5 minutes with CA)
Node readiness: Nodes should become Ready within 60-90 seconds
Cloud provider API errors: Monitor Karpenter logs for any AWS API errors
Pod scheduling: All pods should schedule successfully

Step 3 - Migrate a Small Workload Group

Start migration with a small, non-critical workload group using a canary approach.

Canary migration pattern with labels and namespace-based approach:

# Add node selector to target Karpenter nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: canary-workload
  namespace: pilot-migration
spec:
  replicas: 3
  selector:
    matchLabels:
      app: canary-workload
  template:
    metadata:
      labels:
        app: canary-workload
    spec:
      # Target Karpenter nodes
      nodeSelector:
        karpenter.sh/nodepool: migration-default
      # Prevent scheduling on CA-managed nodes
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: karpenter.sh/nodepool
                operator: Exists
      containers:
      - name: app
        image: my-app:latest
        resources:
          requests:
            cpu: 500m
            memory: 512Mi

How to route traffic gradually:

Start with 10% of traffic to Karpenter-managed pods
Monitor for 24-48 hours
Gradually increase to 25%, 50%, 100%
Use service mesh or ingress controller for traffic splitting

Validation checklist after migration:

All pods running and healthy: kubectl get pods -n pilot-migration
No pod disruption: Verify PDBs are respected
Service latency within baseline: Monitor application metrics
Cost metrics: Compare node costs vs. baseline
Node utilization: Should be higher than CA-managed nodes

Step 4 - Disable CA for Migrated Namespaces Safely

Once workloads are running successfully on Karpenter, disable Cluster Autoscaler for those namespaces to prevent conflicts.

Namespace scoped strategy to disable CA via labels or annotation:

# Add annotation to namespace to exclude from CA scaling
apiVersion: v1
kind: Namespace
metadata:
  name: pilot-migration
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    # CA will ignore pods in this namespace for scaling decisions
    cluster-autoscaler.kubernetes.io/enable-scale-down: "false"
  labels:
    autoscaling: karpenter  # Label for tracking

Safe commands to reduce CA influence (do not delete CA immediately):

# Check which namespaces CA is managing
kubectl get namespaces --show-labels | grep cluster-autoscaler

# Verify CA is not scaling for migrated namespace
kubectl logs -n kube-system deployment/cluster-autoscaler | grep pilot-migration

# Monitor CA activity
kubectl get events -n kube-system --field-selector involvedObject.name=cluster-autoscaler

Important: Keep Cluster Autoscaler running during the entire migration period. Only disable it for specific namespaces that have been fully migrated and validated. This provides a safety net for rollback if needed.

Step 5 - Expand Migration Batch Sizes

After successful pilot migration, expand to larger batches of workloads.

How to pick next batches:

Group workloads by risk level (low, medium, high)
Migrate similar workload types together (e.g., all stateless APIs)
Consider dependencies - migrate dependent services together
Start with workloads that benefit most from Karpenter (variable traffic, batch jobs)

Monitoring per-batch metrics:

Cost per namespace: Track before/after costs
Node utilization: Should increase with Karpenter
Pod scheduling time: Should decrease
Service availability: Should maintain or improve
Error rates: Should not increase

# Monitor migration progress
# Cost tracking
aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=USAGE_TYPE

# Node utilization
kubectl top nodes -l karpenter.sh/nodepool=migration-default

# Pod scheduling metrics
kubectl get pods --all-namespaces --field-selector status.phase=Pending

Step 6 - Decommission Old NodeGroups

After all workloads are migrated and validated, safely decommission Cluster Autoscaler-managed NodeGroups.

Safe termination pattern, drain with respect to PDB, ensure eviction windows:

# Step 1: Verify no pods are scheduled on CA-managed nodes
kubectl get pods --all-namespaces -o wide | grep -v karpenter

# Step 2: Cordon the node (prevent new pods)
kubectl cordon <node-name>

# Step 3: Drain the node with respect to PDBs
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300 \
  --timeout=600s \
  --force=false  # Never use --force for stateful workloads

# Step 4: Verify node is empty
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Step 5: Delete the node (if using managed node groups, scale down the group)
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --scaling-config minSize=0,maxSize=0,desiredSize=0

Critical: Never Use --force Flag

Using kubectl drain --force bypasses PodDisruptionBudgets and can cause data loss or service disruption. Always respect PDBs and use proper grace periods.

Recommended kubectl drain command with flags (conservative safe defaults):

kubectl drain <node> \
  --ignore-daemonsets \          # Ignore DaemonSet pods (required)
  --delete-local-data \          # Delete local data (use with caution)
  --grace-period=300 \           # 5 minute grace period
  --timeout=600s \                # 10 minute timeout
  --force=false                   # Never force for production

Safe Decommissioning: A client successfully decommissioned 50 CA-managed nodes over 2 weeks by draining 2-3 nodes per day during low-traffic windows. Zero pod disruptions occurred because PDBs were properly configured and respected throughout the process.

Testing, Observability and Validation

Observability Checks

Comprehensive monitoring is critical during migration to ensure zero downtime and validate cost savings.

Required metrics and alerts (Prometheus queries):

Node packing ratio: Average pods per node (target: 85%+ utilization)
Pod density: Pods scheduled per node (monitor for over/under-provisioning)
Node churn: Rate of node creation and termination (should be stable)
Provisioning latency: Time from pod pending to node ready (target: < 60 seconds)
Spot interruption rate: Percentage of spot nodes interrupted (monitor for availability impact)

# Sample PromQL queries for Karpenter monitoring

# Node packing ratio (pods per node)
avg(kube_pod_info{node=~"karpenter.*"}) by (node)

# Provisioning latency (time from pending to scheduled)
histogram_quantile(0.95, 
  rate(karpenter_nodeclaims_created_seconds_bucket[5m])
)

# Node utilization (CPU)
avg(1 - rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (node)

# Spot interruption rate
rate(karpenter_interruptions_total{action="terminate"}[1h]) / 
rate(karpenter_nodes_created_total[1h])

# Pod scheduling failures
rate(karpenter_podscheduling_errors_total[5m])

# Node churn (nodes created per hour)
rate(karpenter_nodes_created_total[1h])

Grafana dashboard wireframe: Create dashboards for:

Migration progress (workloads migrated, nodes decommissioned)
Cost comparison (CA vs Karpenter node costs)
Performance metrics (provisioning time, node utilization)
Availability metrics (pod disruptions, service uptime)

Karpenter observability dashboard showing node metrics, provisioning latency, and cost savings — Karpenter observability dashboard showing node provisioning metrics, consolidation events, and cost savings tracking for migration validation.

Failure Simulations and Validation Scripts

Test failure scenarios to validate Karpenter's resilience and your rollback procedures.

Simulate spot interruption and graceful handling commands (AWS CLI safe snippets):

# Simulate spot interruption (terminate a spot node)
# First, identify a spot node
kubectl get nodes -l karpenter.sh/capacity-type=spot -o name | head -1

# Get the instance ID
INSTANCE_ID=$(kubectl get node <node-name> -o jsonpath='{.spec.providerID}' | cut -d'/' -f5)

# Simulate spot interruption (AWS CLI)
aws ec2 cancel-spot-instance-requests --spot-instance-request-ids <spot-request-id>

# Or terminate the instance (use with caution in production)
# aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Simulate node termination, validate rescheduling:

# Cordon and drain a node to simulate termination
NODE_NAME=$(kubectl get nodes -l karpenter.sh/nodepool=migration-default -o jsonpath='{.items[0].metadata.name}')

# Cordon the node
kubectl cordon $NODE_NAME

# Drain with respect to PDBs
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data --grace-period=300

# Monitor pod rescheduling
watch kubectl get pods --all-namespaces -o wide | grep -v $NODE_NAME

# Verify new node provisioned
kubectl get nodes -l karpenter.sh/nodepool=migration-default

Smoke tests for app traffic:

# Test application endpoints
curl -f https://api.example.com/health || echo "Health check failed"

# Load test to trigger scaling
kubectl run load-test --image=busybox --rm -it --restart=Never -- \
  sh -c "while true; do wget -q -O- http://app-service/health; sleep 1; done"

# Monitor scaling behavior
watch kubectl get nodes -l karpenter.sh/nodepool=migration-default

Acceptance Criteria Checklist

Quantitative success criteria and pass thresholds for migration validation:

Node packing ratio: >= 85% (pods per node utilization)
Service latency: Within baseline +/- 5% (no degradation)
Provisioning time: < 60 seconds (vs. 3-5 minutes with CA)
Pod disruption: Zero unplanned disruptions during migration
Cost reduction: 30-60% node cost reduction achieved
Availability: 99.9%+ uptime maintained throughout migration

# Validation script example
#!/bin/bash

# Check node packing ratio
PODS_PER_NODE=$(kubectl get pods --all-namespaces --no-headers | wc -l)
NODES=$(kubectl get nodes -l karpenter.sh/nodepool=migration-default --no-headers | wc -l)
PACKING_RATIO=$(echo "scale=2; $PODS_PER_NODE / $NODES" | bc)

if (( $(echo "$PACKING_RATIO >= 0.85" | bc -l) )); then
  echo "✓ Node packing ratio: $PACKING_RATIO (PASS)"
else
  echo "✗ Node packing ratio: $PACKING_RATIO (FAIL - target: 0.85+)"
fi

# Check for pending pods
PENDING_PODS=$(kubectl get pods --all-namespaces --field-selector status.phase=Pending --no-headers | wc -l)
if [ $PENDING_PODS -eq 0 ]; then
  echo "✓ No pending pods (PASS)"
else
  echo "✗ $PENDING_PODS pending pods (FAIL)"
fi

# Check node provisioning time (requires metrics)
# This would query Prometheus metrics in production

Cost, Performance, and Benchmarking

Cost Benchmark Methodology

Accurate cost comparison requires normalizing workloads and traffic patterns between CA and Karpenter periods.

How to compare before and after (normalized workloads, same traffic):

Run comparison during similar traffic periods (same day of week, time of day)
Ensure workload characteristics are identical (same pod counts, resource requests)
Compare over at least 1 week to account for daily variations
Exclude one-time migration costs from comparison

Cost model fields:

Compute cost: EC2 instance costs (on-demand + spot)
EBS cost: Storage costs for node volumes
Networking cost: Data transfer and NAT gateway costs (typically minimal change)

# Cost comparison script
# Get CA-managed node costs (before migration)
aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-01-07 \
  --granularity DAILY \
  --metrics BlendedCost \
  --filter file://ca-nodes-filter.json

# Get Karpenter-managed node costs (after migration)
aws ce get-cost-and-usage \
  --time-period Start=2026-01-15,End=2026-01-21 \
  --granularity DAILY \
  --metrics BlendedCost \
  --filter file://karpenter-nodes-filter.json

# Calculate savings percentage
# (CA_cost - Karpenter_cost) / CA_cost * 100

Link to downloadable XLS saving estimator: Download our Migration Readiness Checklist which includes a cost estimation template (XLSX) for calculating expected savings based on your cluster configuration.

Typical Outcomes and Realistic Ranges

Present ranges as conservative examples: 30 to 60 percent compute cost reduction typical for CPU-bound microservices with spot adoption.

Real-World Migration Results: Based on ScaleWeaver production migrations from Cluster Autoscaler to Karpenter, we typically observe 30-60% reduction in node cost, mainly from consolidation and broader instance/Spot usage. Real-world savings: $8K-$15K monthly on 50-node clusters. Actual savings vary based on workload characteristics, utilization patterns, and cluster configuration.

Cost reduction factors:

Consolidation: 20-40% savings from better node utilization
Spot instances: 50-70% additional savings on fault-tolerant workloads
Instance flexibility: 10-20% savings from optimal instance selection
Multi-architecture (Graviton): 15-30% additional savings when compatible

Risk Mitigation and Rollback

Understand common migration failure modes and how to safely rollback to Cluster Autoscaler if needed. This section covers failure scenarios, remediation steps, and validation procedures observed across production migrations.

Common Migration Failure Modes and Remediation

Understanding common failure scenarios helps prevent issues and enables faster resolution.

DaemonSet Incompatibilities

Symptoms: DaemonSets fail to schedule on Karpenter nodes, or nodes fail to join cluster.

Root Cause: DaemonSets may have node selectors or tolerations that don't match Karpenter node labels/taints.

Remediation:

# Check DaemonSet node selectors
kubectl get daemonset -A -o yaml | grep -A 5 nodeSelector

# Update DaemonSet to tolerate Karpenter nodes
kubectl patch daemonset my-daemonset -n kube-system --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/tolerations/-",
    "value": {
      "key": "karpenter.sh/nodepool",
      "operator": "Exists",
      "effect": "NoSchedule"
    }
  }
]'

Stateful Workload Eviction

Symptoms: StatefulSet pods evicted during consolidation, causing data loss or service disruption.

Root Cause: Insufficient PodDisruptionBudgets or aggressive consolidation policies on stateful workloads.

Remediation:

Create PDBs for all stateful workloads before migration
Use WhenEmpty consolidation mode for stateful workloads
Exclude stateful workloads from aggressive consolidation NodePools

# Create PDB for stateful workload
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: stateful-workload-pdb
  namespace: production
spec:
  minAvailable: 2  # Ensure at least 2 pods always available
  selector:
    matchLabels:
      app: stateful-app

Node Affinity and Topology Spread Constraints

Symptoms: Pods fail to schedule on Karpenter nodes despite available capacity.

Root Cause: Pod node affinity or topology spread constraints don't match Karpenter node labels or topology.

Remediation:

# Check pod scheduling constraints
kubectl get pod <pod-name> -o yaml | grep -A 10 affinity

# Update pod to match Karpenter node labels
# Add node selector or update affinity rules

Safe Rollback Plan

Step by step rollback commands to re-enable CA, ensure NodeGroups are recreated, avoid data loss.

#!/bin/bash
# Safe rollback script

set -e

echo "Starting rollback procedure..."

# Step 1: Re-enable Cluster Autoscaler for affected namespaces
echo "Step 1: Re-enabling Cluster Autoscaler..."
kubectl annotate namespace <namespace> \
  cluster-autoscaler.kubernetes.io/enable-scale-down=true \
  --overwrite

# Step 2: Scale up NodeGroups if needed
echo "Step 2: Scaling up NodeGroups..."
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --scaling-config minSize=2,maxSize=10,desiredSize=5

# Step 3: Remove node selectors from workloads to allow CA scheduling
echo "Step 3: Removing Karpenter node selectors..."
kubectl patch deployment <workload-name> -n <namespace> --type='json' -p='[
  {"op": "remove", "path": "/spec/template/spec/nodeSelector/karpenter.sh~1nodepool"}
]'

# Step 4: Verify workloads are running on CA-managed nodes
echo "Step 4: Verifying workloads on CA nodes..."
kubectl get pods -n <namespace> -o wide | grep -v karpenter

# Step 5: Remove Karpenter provisioners for failed workloads
echo "Step 5: Removing Karpenter NodePools..."
kubectl delete nodepool migration-default

# Step 6: Monitor for stability
echo "Step 6: Monitoring cluster stability..."
watch kubectl get nodes

Rollback Safety: Keep Cluster Autoscaler running during the entire migration period. Never delete CA until all workloads are successfully migrated and validated for at least 1 week. This provides a safety net for immediate rollback if needed.

Post Rollback Validation Checks

Quick smoke tests and timelines after rollback:

Verify all pods are running: kubectl get pods --all-namespaces | grep -v Running
Check service endpoints are responding
Monitor for 1 hour to ensure stability
Review logs for any errors or warnings
Validate NodeGroups are scaling properly

Post Migration Optimization

After successful migration, optimize Karpenter NodePools for maximum cost savings. This section covers consolidation tuning, instance selection optimization, and ongoing cost operations based on production observations.

Consolidation and Bin Packing

After successful migration, tune consolidation parameters to maximize cost savings while maintaining availability. For comprehensive consolidation strategies and best practices, see our consolidation guide.

How to tune consolidation parameters in Karpenter:

Start with conservative settings (WhenEmpty, 5-minute consolidateAfter)
Gradually move to aggressive settings (WhenEmptyOrUnderutilized, 30-second consolidateAfter)
Monitor pod disruption rates and adjust based on PDB compliance
Use different consolidation policies per NodePool based on workload criticality

Recommended default values and monitoring:

Fault-tolerant workloads: WhenEmptyOrUnderutilized, consolidateAfter: 30s
Stateless APIs: WhenEmptyOrUnderutilized, consolidateAfter: 2m
Critical services: WhenEmpty, consolidateAfter: 5m
Stateful workloads: WhenEmpty only, consolidateAfter: 10m

# Post-migration optimized NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: optimized-production
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized  # Aggressive for cost
    consolidateAfter: 2m  # Balanced for stateless APIs
    expireAfter: 720h  # 30 days to prevent drift

Drift Control and Governance

Drift detection methods and reconciler patterns to maintain configuration consistency.

Drift detection methods:

Monitor node age and enforce expireAfter policies
Compare actual node configuration vs. NodePool spec
Alert on nodes that don't match current NodePool requirements
Regular audits of NodePool configurations

Automation examples: GitOps policy to keep provisioner spec in repo:

# GitOps workflow for NodePool management
# 1. Store NodePool YAML in Git repository
# 2. Use ArgoCD or Flux to sync to cluster
# 3. Any manual changes are automatically reverted
# 4. All changes go through PR review process

# Example ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: karpenter-nodepools
spec:
  source:
    repoURL: https://github.com/my-org/k8s-configs
    path: karpenter/nodepools
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: karpenter
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Ongoing Cost Ops Playbook

Weekly sprint items, monitoring cadence, governance checkpoints for continuous cost optimization.

Weekly sprint items:

Review node utilization metrics and identify underutilized nodes
Analyze spot interruption rates and adjust spot/on-demand mix
Review cost reports and identify optimization opportunities
Update NodePool configurations based on workload changes

Monitoring cadence:

Daily: Check for pending pods and provisioning delays
Weekly: Review cost metrics and node utilization
Monthly: Comprehensive cost optimization review
Quarterly: NodePool configuration audit and optimization

Governance checkpoints:

All NodePool changes require PR review
Cost impact analysis required for configuration changes
Staging validation before production changes
Documentation updates for any configuration changes

Link to your FinOps playbook page and CTA: For comprehensive FinOps strategies, see our EKS Best Practices 2026 guide.

Need a Production Readiness Audit for Migration?

Our team specializes in Cluster Autoscaler to Karpenter migrations with zero downtime. We provide comprehensive migration readiness assessments, step-by-step execution support, and post-migration optimization to help you achieve 30-60% cost reduction.

Advanced Topics

Advanced migration scenarios including multi-AZ deployments, GPU workloads, large cluster patterns, and mixed autoscaler setups. These patterns are based on complex production environments with diverse workload requirements.

Multi AZ and Multi Region Considerations

Load balancing, control plane latency, failover patterns for multi-AZ and multi-region deployments.

Load balancing: Karpenter automatically distributes nodes across availability zones when multiple subnets are configured. Ensure your EC2NodeClass subnet selector includes subnets from all AZs.

# Multi-AZ EC2NodeClass
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
        name: default
spec:
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
        # This will match subnets in all AZs with the discovery tag
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"

Control plane latency: Multi-region deployments require careful consideration of control plane latency. Karpenter controller should be deployed in the same region as the cluster for optimal performance.

Failover patterns: Configure topology spread constraints to ensure pods are distributed across AZs, enabling automatic failover during AZ outages.

GPU and Specialized Node Migrations

How to migrate GPU workloads, node labels and scheduling for specialized hardware.

GPU workload migration: Create dedicated NodePools with GPU instance requirements and taints to isolate GPU workloads.

# GPU NodePool for specialized workloads
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-workloads
spec:
  template:
    metadata:
      labels:
        accelerator: nvidia-tesla-t4
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g4dn.xlarge", "g4dn.2xlarge", "g5.xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]  # GPU instances typically on-demand
      taints:
        - key: nvidia.com/gpu
          value: "true"
      effect: NoSchedule
  limits:
    cpu: 100

Node labels and scheduling: Add tolerations to GPU workloads to allow scheduling on GPU nodes. Use node affinity to prefer GPU nodes for GPU workloads.

Large Cluster Patterns and Performance Limits

Cluster partitioning patterns, scale domains for large-scale deployments.

Cluster partitioning patterns: For clusters with 1000+ nodes, consider partitioning workloads across multiple NodePools to improve scheduling performance and reduce Karpenter controller load.

Scale domains: Karpenter can handle clusters up to several thousand nodes, but performance may degrade. Monitor Karpenter controller metrics and consider horizontal scaling if needed.

Performance limits: Based on AWS guidance, Karpenter can efficiently manage clusters with 1000-2000 nodes. For larger clusters, consider multiple clusters or cluster partitioning strategies.

Mixed Autoscaler Setups

Running Karpenter and CA side by side during transition. When to remove CA entirely.

Running both side-by-side: Use namespace labels or annotations to exclude migrated namespaces from Cluster Autoscaler scaling decisions. This allows gradual migration with zero downtime.

# Exclude namespace from CA
apiVersion: v1
kind: Namespace
metadata:
  name: migrated-workloads
  annotations:
    cluster-autoscaler.kubernetes.io/enable-scale-down: "false"
  labels:
    autoscaling: karpenter

When to remove CA entirely: Remove Cluster Autoscaler only after:

All workloads are migrated and validated
All CA-managed NodeGroups are decommissioned
Monitoring shows stable Karpenter operation for at least 1 week
Rollback procedures are tested and documented

Code Blocks and Recipes

karpenter-provisioner-basic.yaml - Minimal safe provisioner

# Minimal safe NodePool for migration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: migration-basic
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
  limits:
    cpu: 100
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s

karpenter-provisioner-prod.yaml - Production provisioner with taints and labels

# Production NodePool with taints and labels
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: production-workloads
spec:
  template:
    metadata:
      labels:
        environment: production
        workload-type: general
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m7i", "m7i-flex", "m6i", "c7i", "c7g"]
      taints:
        - key: dedicated
          value: production
          effect: NoSchedule
  limits:
    cpu: 1000
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m
    expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: optional
    httpPutResponseHopLimit: 2
    httpTokens: required

migration-canary-job.yaml - Canary migration deployment

# Canary migration deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: canary-workload
  namespace: pilot-migration
spec:
  replicas: 3
  selector:
    matchLabels:
      app: canary-workload
  template:
metadata:
      labels:
        app: canary-workload
spec:
  nodeSelector:
        karpenter.sh/nodepool: migration-default
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: karpenter.sh/nodepool
                operator: Exists
  tolerations:
        - key: dedicated
      operator: Equal
          value: production
      effect: NoSchedule
      containers:
      - name: app
        image: my-app:latest
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi

safe-drain-script.sh - Safe node drain script

#!/bin/bash
# Safe drain script (careful with flags)

set -e

NODE_NAME=$1

if [ -z "$NODE_NAME" ]; then
  echo "Usage: $0 <node-name>"
  exit 1
fi

echo "Cordoning node: $NODE_NAME"
kubectl cordon $NODE_NAME

echo "Draining node: $NODE_NAME"
kubectl drain $NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300 \
  --timeout=600s \
  --force=false  # Never use --force for production

echo "Verifying node is empty"
PODS=$(kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME --no-headers | wc -l)
if [ $PODS -eq 0 ]; then
  echo "Node is empty, safe to terminate"
else
  echo "Warning: $PODS pods still on node"
fi

re-enable-ca.sh - Rollback script to re-enable CA

#!/bin/bash
# Rollback script to re-enable Cluster Autoscaler

set -e

NAMESPACE=$1
NODEGROUP=$2

if [ -z "$NAMESPACE" ] || [ -z "$NODEGROUP" ]; then
  echo "Usage: $0 <namespace> <nodegroup-name>"
  exit 1
fi

echo "Step 1: Re-enabling Cluster Autoscaler for namespace: $NAMESPACE"
kubectl annotate namespace $NAMESPACE \
  cluster-autoscaler.kubernetes.io/enable-scale-down=true \
  --overwrite

echo "Step 2: Scaling up NodeGroup: $NODEGROUP"
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name $NODEGROUP \
  --scaling-config minSize=2,maxSize=10,desiredSize=5

echo "Step 3: Removing Karpenter node selectors"
kubectl patch deployment -n $NAMESPACE --all --type='json' -p='[
  {"op": "remove", "path": "/spec/template/spec/nodeSelector/karpenter.sh~1nodepool"}
]' || true

echo "Step 4: Verifying workloads on CA nodes"
kubectl get pods -n $NAMESPACE -o wide | grep -v karpenter

echo "Rollback complete. Monitor for stability."

Additional Resources

Frequently Asked Questions

Everything you need to know about migrating from Cluster Autoscaler to Karpenter

How long does migration from Cluster Autoscaler to Karpenter take? Migration

Typical migration timeline: Pilot phase (2 weeks), pilot validation (2 weeks), incremental rollout (2-6 weeks). Total: 4-8 weeks for complete migration. Timeline varies based on cluster size, workload complexity, and risk tolerance. See Migration Phases section for detailed breakdown.

Can we run Cluster Autoscaler and Karpenter together during migration? Migration

Yes, you can run both side-by-side during migration. Use namespace labels or annotations to disable Cluster Autoscaler for migrated workloads. This allows gradual migration with zero downtime. See Step 4 - Disable CA for Migrated Namespaces for implementation details.

What happens to Cluster Autoscaler-managed NodeGroups during migration? Migration

NodeGroups remain active during migration. Karpenter creates new nodes for migrated workloads. After migration is complete and validated, you can safely drain and terminate old NodeGroups. Never delete NodeGroups immediately - always drain with respect to PodDisruptionBudgets. See Step 6 - Decommission Old NodeGroups for safe procedures.

Does Karpenter support my spot strategy? Migration

Karpenter has native spot instance support with automatic fallback to on-demand. You can configure spot/on-demand mix per NodePool, with automatic interruption handling. Karpenter supports spot-to-spot consolidation and seamless spot termination handling. See NodePool Design section for configuration examples.

How do I handle stateful workloads during migration? Migration

For stateful workloads, use conservative Karpenter policies: on-demand instances only, WhenEmpty consolidation mode, proper PodDisruptionBudgets. Migrate stateful workloads last, after validating stateless workloads. Always test in non-production first. See Common Migration Failure Modes for stateful workload considerations.

What is the rollback plan if migration fails? Migration

Rollback plan: 1) Re-enable Cluster Autoscaler for affected namespaces, 2) Scale up NodeGroups if needed, 3) Verify workloads are running on CA-managed nodes, 4) Remove Karpenter provisioners for failed workloads. Keep CA running during entire migration period for safety. See Safe Rollback Plan section for detailed procedures.

What cost savings can I expect from migrating to Karpenter? Savings

30-60% Cost reduction

$8K-$15K Monthly savings

50 nodes Typical cluster

Typical cost reduction: 30-60% on node costs, depending on workload shape, consolidation settings, and spot instance adoption. Results vary by cluster configuration. Conservative estimate: 30-40% for most clusters with moderate optimization. Real-world savings: $8K-$15K monthly on 50-node clusters. See Typical Outcomes and Realistic Ranges section for detailed breakdown.

What are the minimum Kubernetes and Karpenter versions required? Technical

Minimum requirements: Kubernetes >= 1.26 (recommended >= 1.29), Karpenter >= v1.0 (v1.8 is latest as of 2026). Also ensure AWS CLI, eksctl, and kubectl are up to date. Check IAM and IRSA requirements before starting. See Pre-Migration Readiness Checklist for complete version matrix.

How do I validate migration success? Migration

Validation checklist: Node packing ratio >= 85%, service latency within baseline +/- 5%, provisioning time < 60 seconds, zero pod disruption during migration, cost reduction metrics met. Monitor for 1 week after each migration phase before proceeding. See Acceptance Criteria Checklist for complete validation procedures.

Download the Full Cluster Autoscaler to Karpenter Migration Guide 2026 PDF

Get the complete guide as a downloadable PDF for offline reference

Download PDF

Conclusion

Migrating from Cluster Autoscaler to Karpenter enables 30-60% cost reduction with zero perceived downtime when following a phased approach. The key is starting with readiness validation, creating conservative initial configurations, and gradually migrating workloads with proper monitoring and rollback procedures.

For comprehensive EKS optimization, see our EKS Best Practices 2026 guide and Karpenter Best Practices 2026 for post-migration optimization strategies.

Based on real-world production experience across multiple EKS cluster migrations, the recommendations in this guide have been validated through actual deployments and cost optimization projects.

Cluster Autoscaler to Karpenter Migration 2026 - Zero Perceived Downtime, Cost Optimized Transition

Article Map

TL;DR

Why This Guide and Who Should Read It

Executive Summary

Key Metrics at a Glance

Quick Wins Checklist (Do These First)

Should You Migrate From Cluster Autoscaler to Karpenter?

When Migration is Recommended

When to Keep Cluster Autoscaler

What Happens If You Do Nothing?

Why Move from Cluster Autoscaler to Karpenter in 2026

Animated Migration Flow

Business Outcomes and ROI Model

Migration Strategy Overview

Migration Principles

Migration Phases (Visual Timeline)

Pre-Migration Readiness Checklist

Readiness Checklist - Quick Pass/Fail Items

Designing Karpenter NodePools for Migration

NodePool Design Patterns

Anti-Patterns to Avoid During Migration

Additional Resources

Migration Execution: Step by Step Framework

Step 0 - Validate Current CA Behavior

Step 1 - Create Conservative Initial Karpenter Provisioner

Step 2 - Controlled Scale Tests

Step 3 - Migrate a Small Workload Group

Step 4 - Disable CA for Migrated Namespaces Safely

Step 5 - Expand Migration Batch Sizes

Step 6 - Decommission Old NodeGroups

Critical: Never Use --force Flag

Testing, Observability and Validation

Observability Checks

Failure Simulations and Validation Scripts

Acceptance Criteria Checklist

Cost, Performance, and Benchmarking

Cost Benchmark Methodology

Typical Outcomes and Realistic Ranges

Risk Mitigation and Rollback

Common Migration Failure Modes and Remediation

DaemonSet Incompatibilities

Stateful Workload Eviction

Node Affinity and Topology Spread Constraints

Safe Rollback Plan

Post Rollback Validation Checks

Post Migration Optimization

Consolidation and Bin Packing

Drift Control and Governance

Ongoing Cost Ops Playbook

Need a Production Readiness Audit for Migration?

Advanced Topics

Multi AZ and Multi Region Considerations

GPU and Specialized Node Migrations

Large Cluster Patterns and Performance Limits

Mixed Autoscaler Setups

Code Blocks and Recipes

Additional Resources

Frequently Asked Questions

Download the Full Cluster Autoscaler to Karpenter Migration Guide 2026 PDF

Conclusion

Ready to Move to Karpenter with Confidence?

Book a Founder Call