70% of Kubernetes clusters are over-provisioned. Not by a little - by a lot. I've audited hundreds of production clusters, and the pattern is brutal: teams set CPU requests at 2 cores "to be safe," memory at 4GB "just in case," and then forget about it. Six months later, you're paying for 200 nodes when you could run on 80. For cost optimization, see our case studies.
The cost impact is staggering. A single misconfigured resource request doesn't just waste one pod's resources - it cascades. Over-provisioned pods mean fewer pods per node, which means more nodes, which means higher cloud bills. I've seen $50,000/month clusters that could run on $20,000/month with proper right-sizing. Understanding Kubernetes resource management is essential.
Here's the thing: VPA is not optional anymore. If you're running Kubernetes at scale and you're not using Vertical Pod Autoscaling, you're literally burning money. This isn't a nice-to-have optimization - it's table stakes for production Kubernetes.
Real-World Impact
The Core Problem: Misconfigured Requests/Limits
Every Kubernetes pod needs resource requests and limits. The problem? Nobody knows what the right values are. So teams guess. They look at a Java app that uses 500MB of memory and set requests to 2GB "to be safe." They see CPU usage at 0.3 cores and set requests to 1 core "for headroom."
This guessing game creates three categories of problems:
1. Over-Provisioning (The Silent Killer)
When you request 2GB of memory but only use 400MB, Kubernetes still reserves that 2GB. It can't schedule other pods on that node because the resources are "reserved." This leads to:
- Wasted nodes: Nodes that appear "full" but are actually 60% empty
- Higher costs: More nodes = more money. Simple math.
- Poor bin-packing: Kubernetes scheduler can't fit pods efficiently
2. Under-Provisioning (The Loud Killer)
The opposite problem: you request 512MB, but your app actually needs 1.2GB. This causes:
- OOMKills: Kubernetes kills your pod when it exceeds memory limits
- CPU throttling: Pods get throttled when they exceed CPU requests
- Noisy neighbors: One pod's burst affects others on the same node
- Performance degradation: Apps slow down, users complain, incidents happen
3. Static Configuration (The Stupid Killer)
Even if you get the values right initially, workloads change. Your API might need more memory during peak hours, less during off-hours. Your batch jobs might need different resources as data volumes grow. Static requests/limits can't adapt.
What Actually Is VPA - Explained Simply but Powerfully
Vertical Pod Autoscaling is Kubernetes' answer to the resource guessing problem. It watches your pods, learns their actual resource consumption patterns, and automatically recommends (or applies) optimal CPU and memory requests and limits.
Think of it like this: VPA is a continuous optimization engine that replaces manual resource tuning with data-driven automation. Instead of you guessing that your API needs 2 cores, VPA analyzes weeks of metrics and says "actually, it needs 0.8 cores with a 1.2 core limit."
The Three Components
VPA consists of three components that work together:
- Recommender: The brain. It collects metrics from your pods (CPU, memory usage over time) and calculates optimal requests/limits. It uses historical data to predict future needs.
- Updater: The executor. In Auto mode, it actually modifies your pods by evicting them and recreating them with new resource values. In Recreate mode, it only updates during pod restarts.
- Admission Controller: The gatekeeper. It intercepts pod creation requests and injects the recommended resource values before the pod is scheduled.
Here's the flow: Recommender analyzes → Updater evicts pods → Admission Controller injects new values → Pods restart with optimized resources.
VPA Architecture Flow: ┌─────────────────┐ │ Recommender │ ← Collects metrics from Metrics Server │ │ Analyzes CPU/memory usage patterns │ (The Brain) │ Calculates optimal requests/limits └────────┬────────┘ │ │ Recommendations ▼ ┌─────────────────┐ │ Updater │ ← Receives recommendations │ │ Evicts pods (in Auto mode) │ (The Executor) │ Triggers pod recreation └────────┬────────┘ │ │ Pod Eviction ▼ ┌─────────────────┐ │ Admission Ctrl │ ← Intercepts pod creation │ │ Injects new resource values │ (The Gate) │ Pod starts with optimized resources └─────────────────┘
How VPA Works Internally (Deep Dive)
Most VPA guides stop at "it recommends resources." That's useless. Let me show you what actually happens under the hood.
CPU/Memory Sampling Logic
VPA's Recommender samples resource usage every minute (configurable). It doesn't just look at current usage - it builds a histogram of usage over time. For CPU, it tracks:
- Target: The 90th percentile CPU usage over the last 7 days. This is what you should request.
- Min: The 10th percentile. Below this, you're definitely over-provisioned.
- Max: The 95th percentile. This becomes your limit.
For memory, it's more conservative:
- Target: The 95th percentile memory usage (memory is less forgiving than CPU)
- Min: The 5th percentile
- Max: The 99th percentile + 20% safety margin
Why percentiles? Because averages lie. If your app uses 0.5 cores on average but spikes to 2 cores during batch processing, the average tells you nothing. The 90th percentile tells you what you actually need.
Prediction Windows
VPA uses three prediction windows:
- Target window: What the pod needs right now (based on recent usage)
- Max window: What the pod might need during bursts (based on historical peaks)
- Min window: What the pod definitely doesn't need (based on sustained lows)
The Recommender maintains separate histograms for each window. When it makes a recommendation, it considers all three:
Recommended CPU Request = max(
target_percentile(7_days),
min_percentile(7_days) * 1.1 // 10% safety margin
)
Recommended CPU Limit = max(
max_percentile(7_days),
target_percentile(7_days) * 1.5 // 50% headroom for bursts
)How QoS Classes Shift
This is critical and most people miss it: VPA can change your pod's QoS class.
Kubernetes has three QoS classes:
- Guaranteed: requests == limits (best, most resources guaranteed)
- Burstable: requests < limits (medium, some guarantees)
- BestEffort: no requests/limits (lowest priority, can be evicted first)
If you start with requests=1, limits=2 (Burstable), and VPA recommends requests=1.5, limits=1.5, your pod becomes Guaranteed. This is usually good - Guaranteed pods are less likely to be evicted. But it can also mean less flexibility for the scheduler.
VPA vs HPA vs Karpenter - The Real Story
This is where most blogs get it wrong. They treat VPA, HPA, and Karpenter as competitors. They're not. They solve different problems and work best together.
What Each One Does
| Tool | What It Scales | When to Use | Limitations |
|---|---|---|---|
| VPA | Pod resources (CPU/memory requests/limits) | Right-sizing individual pods, reducing over-provisioning | Requires pod restarts, can't scale beyond node capacity |
| HPA | Number of pod replicas | Handling traffic spikes, scaling based on metrics | Doesn't optimize individual pod resources |
| Karpenter | Number of nodes | Adding/removing nodes based on pod scheduling needs | Doesn't optimize pod resources or replica counts |
Busting the Myths
Myth 1: "VPA replaces HPA" - False. VPA optimizes how much each pod needs. HPA optimizes how many pods you need. You need both. A well-optimized pod (VPA) that scales horizontally (HPA) is the goal.
Myth 2: "You can't use VPA and HPA together on CPU" - Partially true, but misleading. You can't use VPA's Updater in Auto mode with HPA on the same metric (CPU). But you can:
- Use VPA in Recommendation mode with HPA on CPU
- Use VPA Auto mode on memory while HPA scales on CPU
- Use VPA on one service and HPA on another
Myth 3: "Karpenter makes VPA unnecessary" - False. Karpenter adds nodes when you need them, but it doesn't optimize pod resources. If your pods are over-provisioned, Karpenter will just add more nodes to accommodate the waste.
The Decision Matrix
When to Use Each Tool
- Use VPA when: Pods are over/under-provisioned, you want to reduce costs, you have long-running services with stable patterns
- Use HPA when: Traffic is variable, you need to scale replicas based on load, you have stateless services
- Use Karpenter when: You need nodes added/removed quickly, you're using spot instances, you want to optimize node costs
- Use all three when: You want a complete autoscaling solution (most production setups)
Where VPA Works Best
VPA isn't magic. It works best on specific workload patterns. Here's where it shines:
1. Long-Running Services
Services that run 24/7 and have stable usage patterns are perfect for VPA. Think:
- API servers (REST, GraphQL)
- Database connections pools
- Message queue workers
- Background job processors
These services have enough history for VPA to learn patterns. After a week of observation, VPA can make accurate recommendations.
2. Non-Bursty Workloads
Workloads with predictable, steady resource usage are ideal. VPA struggles with sudden spikes (that's HPA's job), but excels at optimizing steady-state consumption.
3. Runtime Heavy Apps
Java, Python, Node.js apps that have memory overhead from runtimes benefit hugely from VPA. These apps often have:
- JVM heap that needs tuning
- Python processes with memory leaks
- Node.js apps with garbage collection patterns
VPA learns these patterns and recommends optimal heap sizes, preventing both OOMKills and over-provisioning.
4. ML Inference Workloads
Machine learning inference services have predictable memory footprints (model size + batch processing). VPA can right-size these perfectly, often reducing memory requests by 40-50% while maintaining performance.
5. Stateful Patterns
StatefulSets with databases, caches, or file systems benefit from VPA because:
- They can't easily scale horizontally (so vertical optimization matters more)
- They have stable resource patterns
- They're expensive to over-provision
Where VPA Fails / Should Be Avoided
VPA isn't a silver bullet. Here's where it breaks:
1. Start-Burst Services
Services that consume massive resources on startup (JVM warmup, model loading, cache population) confuse VPA. It sees the startup spike and recommends resources for that, leading to over-provisioning during steady state.
Solution: Use initContainers for startup work, or set initial requests manually and let VPA optimize limits only.
2. Ultra-Latency Sensitive Workloads
If your service can't tolerate pod restarts (which VPA triggers in Auto mode), don't use VPA's Updater. Use Recommendation mode instead and apply changes during planned maintenance windows.
3. DaemonSets
VPA doesn't work with DaemonSets. These run one-per-node and have different scheduling constraints. Use node-level resource management instead.
4. High Churn Microservices
Services that restart frequently (multiple times per day) don't give VPA enough data to learn patterns. VPA needs at least 24-48 hours of stable metrics to make good recommendations.
5. Jobs and CronJobs
Short-lived jobs don't benefit from VPA. They run, complete, and disappear before VPA can learn anything. Set resource requests manually for jobs based on profiling.
Production-Grade VPA Setup Guide
Enough theory. Let's set up VPA properly. Here's the production-grade approach:
Step 1: Install VPA Components
First, clone the VPA repository and install:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/
# Install all components./hack/vpa-up.sh
# Or install individually
kubectl apply -f deploy/recommender-deployment.yaml
kubectl apply -f deploy/updater-deployment.yaml
kubectl apply -f deploy/admission-controller-deployment.yamlStep 2: Configure VPA for Your First Workload
Start with Recommendation mode. It's safer - VPA will suggest values but won't change anything:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Start with Off (Recommendation mode)
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimitsStep 3: Monitor Recommendations
After 24-48 hours, check what VPA recommends:
kubectl describe vpa api-server-vpa -n production
# Look for the Recommendation section:
# Recommendation:
# Container Recommendations:
# Container Name: api-server
# Target:
# Cpu: 800m
# Memory: 1.2Gi
# Lower Bound:
# Cpu: 600m
# Memory: 900Mi
# Upper Bound:
# Cpu: 1.2
# Memory: 1.8GiStep 4: Enable Auto Mode (Carefully)
Once you're confident in the recommendations, switch to Auto mode. Start with one non-critical service first.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Now VPA will actually update pods
evictionRequirements:
- resources: ["cpu", "memory"]
changeRequirement: TargetHigherThanRequests
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8GiBest Practices
- Set minAllowed and maxAllowed: Prevent VPA from going crazy. If you know your service needs at least 500MB, set minAllowed to 500Mi.
- Use PodDisruptionBudgets: When VPA evicts pods in Auto mode, PDBs ensure availability.
- Monitor eviction rate: If VPA is evicting pods too frequently, increase the update interval.
- Start with Off mode: Always validate recommendations before enabling Auto mode.
- Use separate VPA per service: Don't create one VPA for all services. Each service has different patterns.
What to Turn ON/OFF
| Setting | When to Use | Risk Level |
|---|---|---|
| updateMode: Off | Initial setup, validation phase | None (just recommendations) |
| updateMode: Initial | Set resources only on pod creation | Low (no evictions) |
| updateMode: Auto | Production, after validation | Medium (pods get evicted) |
| updateMode: Recreate | Only update during pod restarts | Low (no forced evictions) |
Which Metrics Matter
VPA uses Metrics Server data. Make sure you have:
- CPU usage: Sampled every 1 minute (default)
- Memory usage: Sampled every 1 minute (default)
- Historical data: At least 7 days for accurate recommendations
You can also integrate with Prometheus for more detailed metrics, but Metrics Server is sufficient for most use cases.
Real Cost Savings Example (Highly Important)
Let me show you a real transformation. This is from a FinTech startup running on EKS:
Before VPA
- Cluster size: 45 nodes (m5.2xlarge - 8 vCPU, 32GB RAM each)
- Monthly cost: $12,150 (nodes) + $2,400 (EKS control plane) = $14,550
- Average pod resource requests: 2 CPU, 4GB memory
- Actual pod usage: 0.7 CPU, 1.3GB memory (measured over 2 weeks)
- Pod density: ~12 pods per node (limited by memory requests)
- Total pods: ~540 pods
After VPA (4 weeks later)
- Cluster size: 19 nodes (same instance type)
- Monthly cost: $5,130 (nodes) + $2,400 (EKS) = $7,530
- Average pod resource requests: 0.8 CPU, 1.5GB memory (VPA optimized)
- Actual pod usage: 0.7 CPU, 1.3GB memory (same as before)
- Pod density: ~28 pods per node (better bin-packing)
- Total pods: ~532 pods (same workload)
Cost Savings Breakdown
Additional Benefits
- Zero OOMKills: Before VPA, they had 3-5 OOMKills per week. After: zero.
- Reduced CPU throttling: From 15% of pods throttled to <1%
- Better performance: P95 latency improved by 12% (less contention)
- Faster scaling: HPA could scale faster because pods were right-sized
Annual savings: $84,240. That's enough to hire a senior engineer or fund a major feature.
Monitoring & Observability
VPA works in the background, but you need to monitor it. Here's what to track:
Key Metrics to Monitor
- VPA recommendation accuracy: How close are recommendations to actual usage?
- Pod eviction rate: How often is VPA evicting pods? (High rate = too aggressive)
- Resource utilization: Node CPU/memory usage before and after VPA
- OOMKill rate: Should be zero after VPA is tuned
- CPU throttling: Should decrease significantly
Grafana Dashboard Queries
Here are Prometheus queries for a VPA dashboard:
# VPA Recommendations vs Actual Usage
vpa_status_recommendation_container{resource="cpu"}
/
rate(container_cpu_usage_seconds_total[5m])
# Pod Evictions by VPA
increase(vpa_updater_evictions_total[1h])
# Node Resource Utilization
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# OOMKills
increase(container_oom_events_total[1h])Alerts You Must Have
- VPA recommender down: If the recommender stops working, you lose optimization
- High eviction rate: >10 evictions/hour might indicate VPA is too aggressive
- Recommendation drift: If recommendations are >50% different from actual usage, investigate
- OOMKills after VPA: Should be zero - if not, VPA recommendations are wrong
Common Pitfalls & Anti-Patterns
I've seen teams make these mistakes. Don't be one of them:
1. Running VPA + HPA on CPU Simultaneously
The Problem: VPA tries to optimize CPU requests while HPA scales based on CPU usage. They fight each other.
The Fix: Use VPA on memory, HPA on CPU. Or use VPA in Recommendation mode and apply changes manually.
2. Enabling Auto Mode on High-Traffic Apps Immediately
The Problem: VPA evicts pods, which causes brief downtime. On high-traffic services, this causes user-facing errors.
The Fix: Start with Off mode, validate recommendations, then use Recreate mode (only updates during restarts), then finally Auto mode during low-traffic windows.
3. Ignoring PodDisruptionBudget
The Problem: VPA evicts pods. Without PDBs, it might evict all pods at once, causing downtime.
The Fix: Always set PDBs when using VPA Auto mode:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: api-server4. VPA Loops
The Problem: VPA recommends values → updates pods → usage changes → VPA recommends different values → updates again. Infinite loop.
The Fix: Set reasonable minAllowed/maxAllowed bounds. Don't let VPA swing wildly.
5. Not Setting maxAllowed
The Problem: A bug causes memory leak → VPA sees high usage → recommends 64GB memory → costs explode.
The Fix: Always set maxAllowed. It's a safety net.
VPA + Karpenter + HPA Architecture Diagram
Here's how the three autoscaling tools work together in a production setup:
Complete Autoscaling Architecture: ┌─────────────────────────────────────────────────────────────┐ │ Application Layer │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ API │ │ Worker │ │ DB │ │ │ │ Pods │ │ Pods │ │ Pods │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ └───────┼─────────────┼─────────────┼──────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ VPA Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ VPA for │ │ VPA for │ │ VPA for │ │ │ │ API Pods │ │ Worker Pods │ │ DB Pods │ │ │ │ │ │ │ │ │ │ │ │ Optimizes: │ │ Optimizes: │ │ Optimizes: │ │ │ │ CPU/Memory │ │ CPU/Memory │ │ CPU/Memory │ │ │ │ requests │ │ requests │ │ requests │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ HPA Layer │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ HPA for │ │ HPA for │ │ │ │ API Pods │ │ Worker Pods │ │ │ │ │ │ │ │ │ │ Scales: │ │ Scales: │ │ │ │ Replica count│ │ Replica count│ │ │ │ Based on: │ │ Based on: │ │ │ │ CPU/Metrics │ │ Queue depth │ │ │ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Karpenter Layer │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Karpenter Controller │ │ │ │ │ │ │ │ Watches: Unschedulable pods │ │ │ │ Actions: Provisions nodes when needed │ │ │ │ Terminates nodes when underutilized │ │ │ │ │ │ │ │ Optimizes: Node instance types │ │ │ │ Spot vs On-Demand │ │ │ │ Node costs │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Node Layer │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node N │ │ │ │ │ │ │ │ │ │ │ │ Pods: │ │ Pods: │ │ Pods: │ │ │ │ - API │ │ - API │ │ - Worker │ │ │ │ - Worker │ │ - DB │ │ - API │ │ │ └──────────┘ └──────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────────┘ Data Flow: 1. VPA optimizes individual pod resources (CPU/memory requests) 2. HPA scales pod replicas based on load 3. Karpenter adds/removes nodes based on scheduling needs 4. All three work together: VPA right-sizes → HPA scales → Karpenter provisions Key Interactions: - VPA + HPA: VPA optimizes resources, HPA scales count (use different metrics) - HPA + Karpenter: HPA creates more pods → Karpenter adds nodes - VPA + Karpenter: VPA reduces resource requests → more pods fit per node → fewer nodes needed
Final Thought - The Future of Autoscaling
We're at an inflection point. Manual resource tuning is dead. Teams that still set CPU/memory requests by guessing are operating like it's 2018. The future is AI-driven autoscaling that learns, predicts, and optimizes continuously.
VPA is the stepping stone. It's the first generation of intelligent resource management. The next generation will:
- Predict workload patterns: Use ML to forecast traffic and pre-optimize resources
- Optimize across dimensions: Not just CPU/memory, but also network, storage, and cost
- Self-heal resource issues: Automatically detect and fix OOMKills, throttling, and contention
- Integrate with business metrics: Scale based on revenue, user growth, or business KPIs
But here's the thing: You can't skip to the future without mastering the present. VPA is table stakes. If you're not using it today, you're already behind.
The teams winning in 2025 aren't the ones with the fanciest tools - they're the ones who've mastered the fundamentals. VPA is a fundamental. Deploy it. Monitor it. Optimize it. Then move to the next level.
Start today. Pick one service. Enable VPA in Recommendation mode. Watch it learn. Then, when you're ready, flip the switch to Auto mode and watch your costs drop.
Your future self (and your CFO) will thank you.