How to Right-Size Kubernetes Resource Requests (and Stop Overpaying)
A practical, step-by-step guide to right-sizing Kubernetes resource requests — the single biggest cause of cloud waste in K8s clusters.
Why Resource Requests Matter More Than You Think
Every pod in your cluster declares two numbers per container: resource requests and resource limits. Most engineers understand that limits cap how much CPU and memory a container can consume. Fewer realize that requests are what actually drive your cloud bill.
The Kubernetes scheduler uses requests — not limits, not actual usage — to decide which node a pod lands on. If your pod requests 2 CPU cores, the scheduler reserves 2 cores on that node regardless of whether the pod ever uses more than 200 millicores. Those reserved resources cannot be allocated to other pods. You pay for the node capacity to back those reservations, even when the capacity sits idle.
The problem is systemic. Teams set resource requests during initial deployment — often by copying values from a blog post or Helm chart defaults — and never revisit them. The Datadog State of Kubernetes report found that the median Kubernetes cluster allocates 65% of requested CPU but actually uses only 18%. That gap between "requested" and "used" is pure waste, and it compounds across every pod in every namespace.
The Real Cost of Over-Provisioning
Let's make this concrete. Consider a typical web application pod:
| Requested | Actual p95 Usage | |
|---|---|---|
| CPU | 2000m (2 cores) | 480m (0.48 cores) |
| Memory | 4 GiB | 1.2 GiB |
On AWS, an m7i.xlarge instance in us-east-1 costs approximately $0.1008/hour on-demand (4 vCPU, 16 GiB). That works out to roughly $0.0252/hour per vCPU and $0.0063/hour per GiB of memory.
For a single pod, the wasted resources cost:
- CPU waste: (2.0 - 0.48) cores * $0.0252/hr = $0.0383/hr
- Memory waste: (4.0 - 1.2) GiB * $0.0063/hr = $0.0176/hr
- Total waste per pod: ~$0.056/hr = $40.32/month
Now scale that. If you have 50 instances of this deployment:
- Monthly waste: 50 * $40.32 = $2,016/month
- Annual waste: $24,192/year — from a single over-provisioned deployment
And this is just one workload. Most clusters have dozens of deployments, many of them over-provisioned by similar margins.
Step 1: Measure Actual Usage
Before you can right-size anything, you need usage data. Kubernetes provides this through the Metrics API, which requires metrics-server to be installed in your cluster (it ships by default with most managed Kubernetes services).
Check current resource usage:
# CPU and memory usage for all pods in production
kubectl top pods -n production --sort-by=cpu
# Example output:
# NAME CPU(cores) MEMORY(bytes)
# api-server-7d9f8b6c5-x2k4p 480m 1180Mi
# api-server-7d9f8b6c5-r8m2n 455m 1150Mi
# worker-5c7d8e9f1-k3j7p 120m 890Mi
Compare against current requests:
# Show requests alongside pod names
kubectl get pods -n production -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
CPU_LIM:.spec.containers[0].resources.limits.cpu,\
MEM_LIM:.spec.containers[0].resources.limits.memory
Get a cluster-wide view of allocation vs. usage:
# Node-level allocation (what the scheduler sees)
kubectl describe nodes | grep -A 5 "Allocated resources"
# Per-namespace resource consumption
kubectl top pods --all-namespaces --no-headers | \
awk '{ns=$1; cpu=$2; mem=$3; a[ns]+=cpu; b[ns]+=mem} END {for (n in a) print n, a[n], b[n]}'
The key metric is the ratio of actual usage to requested resources. If that ratio is consistently below 0.5, you're almost certainly over-provisioned.
For more granular historical data, query Prometheus (if you have it running) or use kubectl top output collected over several days. Point-in-time snapshots are useful but can miss periodic spikes. You need at least a week of data — ideally covering your peak traffic period — before making right-sizing decisions.
Step 2: Calculate the Right Size
The formula for right-sizing is straightforward, but the reasoning behind it matters.
For CPU requests:
new_cpu_request = p95_cpu_usage * 1.20
Use the 95th percentile of CPU usage over the measurement window, then add a 20% buffer. Why p95 and not average? Because the average masks spikes. If your pod averages 200m CPU but spikes to 800m during request bursts, setting the request to 240m (average + 20%) means the scheduler might place your pod on a node without enough headroom for those spikes. The result: CPU throttling, increased latency, and possibly failed health checks.
The 20% buffer accounts for organic traffic growth and variance above p95.
For memory requests:
new_memory_request = peak_memory_usage * 1.15
Use peak (not p95) memory usage and add only a 15% buffer. Memory works differently from CPU in Kubernetes:
- CPU is compressible: when a container exceeds its CPU limit, it gets throttled. The process slows down but keeps running.
- Memory is incompressible: when a container exceeds its memory limit, the kernel OOM-kills it. The pod restarts. Users see errors.
Because the penalty for under-provisioning memory is so much worse (process termination vs. slowdown), you should use the absolute peak rather than a percentile. But you can use a smaller buffer (15% vs. 20%) precisely because you're already measuring from the peak — there's less variance above it.
Worked example:
If Prometheus shows your pod's CPU usage over the last 7 days as:
- Average: 320m
- p95: 480m
- Max: 620m
And memory as:
- Average: 0.9 GiB
- p95: 1.1 GiB
- Peak: 1.2 GiB
Your right-sized requests would be:
- CPU: 480m * 1.20 = 576m (round to 600m)
- Memory: 1.2 GiB * 1.15 = 1.38 GiB (round to 1.5 GiB)
Compared to the original 2000m CPU / 4 GiB memory, that's a 70% reduction in CPU and 62.5% reduction in memory reservation.
Step 3: Understand QoS Classes Before Changing
Kubernetes assigns a Quality of Service (QoS) class to each pod based on its resource configuration. This class determines eviction priority when a node is under pressure. Before you change requests, you need to understand the implications.
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | Every container has requests = limits for both CPU and memory | Last to be evicted |
| Burstable | At least one container has requests < limits | Evicted after BestEffort |
| BestEffort | No requests or limits set at all | First to be evicted |
A common but expensive pattern is setting requests = limits everywhere to get Guaranteed QoS. While this gives your pods maximum scheduling priority, it means every byte of requested resource is locked and cannot be shared with other pods on the same node. For stateless web services that handle variable traffic, Burstable is usually the right choice.
A practical policy for most workloads:
- Set CPU limits to 2x the CPU request. This lets the pod burst during traffic spikes without reserving resources it rarely uses.
- Set memory limits to 1.5x the memory request. This provides headroom for memory allocation spikes while still protecting the node from runaway processes.
- For critical stateful workloads (databases, message brokers): keep Guaranteed QoS with requests = limits.
- For batch jobs and CronJobs: Burstable is almost always appropriate since these workloads have highly variable resource needs.
Step 4: Apply Changes Safely
Never right-size in production without a staged rollout. Here's a safe process.
Update the deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
spec:
template:
spec:
containers:
- name: api-server
resources:
requests:
cpu: "600m" # was: 2000m (-70%)
memory: "1.5Gi" # was: 4Gi (-62.5%)
limits:
cpu: "1200m" # was: 2000m
memory: "2.25Gi" # was: 4Gi
Deploy to staging first and monitor for 48 hours. Watch for three specific signals:
1. CPU throttling increase:
# Check if pods are being throttled (requires cAdvisor metrics)
kubectl exec -n production <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Look for: nr_throttled and throttled_time
# Or via Prometheus:
# rate(container_cpu_cfs_throttled_periods_total[5m]) /
# rate(container_cpu_cfs_periods_total[5m]) > 0.25
If throttle ratio exceeds 25%, increase the CPU request.
2. OOM kills:
# Check for OOM-killed containers
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.containerStatuses[*]}{.lastState.terminated.reason}{"\n"}{end}{end}' | grep OOMKilled
# Check restart count (OOM kills cause restarts)
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'
Any OOM kills mean the memory request (or limit) is too low. Increase immediately.
3. Application latency:
Check your APM tool (Datadog, New Relic, Grafana) for p99 latency increases after deployment. CPU throttling often manifests as tail latency spikes rather than average latency increases.
Only after 48 hours of stable operation in staging should you promote to production. Use the same monitoring checklist in production for another 48 hours.
Step 5: Automate the Process
Manual right-sizing works but doesn't scale. For clusters with hundreds of deployments, you need automation.
Vertical Pod Autoscaler (VPA)
The Kubernetes VPA watches actual resource usage and automatically adjusts requests.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # or "Off" for recommendation-only
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: "100m"
memory: "256Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
Caveats: VPA restarts pods to apply new requests (it cannot modify them in-place). This makes it unsuitable for workloads that cannot tolerate restarts. Also, VPA and HPA should not target the same metric (e.g., both targeting CPU) — they will fight each other. Use VPA for memory and HPA for CPU, or use Multidimensional Pod Autoscaling on GKE.
Monthly Manual Review
If full automation feels too aggressive, establish a monthly review cadence:
- Pull the top 20 most over-provisioned pods (by ratio of request to usage).
- Verify with a week of Prometheus data.
- Update manifests and deploy through your normal CI/CD pipeline.
- Track cumulative savings month-over-month.
Even this lightweight process typically captures 80% of the savings of full automation, because waste follows a power law — a handful of deployments account for most of the over-provisioning.
Cost Optimization Platforms
Dedicated tools continuously analyze every pod in your cluster and generate right-sizing recommendations with estimated dollar savings. They factor in workload classification (stateless vs. stateful), burst patterns, and cloud provider pricing to produce accurate recommendations.
Tools like K8Cost analyze your entire cluster against 70+ optimization rules and generate right-sizing recommendations with GitOps-ready manifests you can apply directly through your existing deployment pipeline — as JSON Patch, Strategic Merge Patch, Helm values, or Kustomize overlays.
Common Mistakes to Avoid
1. Setting requests = limits for everything. Guaranteed QoS sounds safe, but it locks resources and prevents any sharing on the node. Reserve Guaranteed QoS for stateful, latency-sensitive workloads. For the majority of stateless services, Burstable with a sensible limit-to-request ratio gives you both safety and efficiency.
2. Right-sizing based on average usage. Averages hide spikes. A pod that averages 200m CPU but regularly spikes to 800m will be CPU-throttled if you set the request to 240m. Always use p95 or p99 for CPU and absolute peak for memory.
3. Reducing memory too aggressively. OOM kills are disruptive — the process terminates immediately with no graceful shutdown. A throttled pod is slow; an OOM-killed pod is down. When in doubt, err on the side of more memory headroom, not less.
4. Forgetting init containers and sidecars. If your pod has an init container that downloads a 2 GiB model file on startup, or a sidecar proxy (Istio/Envoy) that consumes 128Mi at baseline, those need their own resource requests. The pod's total request is the sum of all containers. Right-sizing just the main container while ignoring sidecars leaves money on the table — or worse, causes OOM kills during initialization.
5. Ignoring HPA interactions. The Horizontal Pod Autoscaler scales replicas based on a target utilization percentage that is relative to the pod's requests. If you halve the CPU request without adjusting the HPA target, the HPA sees double the utilization and may spin up twice as many replicas — negating your savings entirely. Always review HPA configuration after changing requests.
How Much Can You Actually Save?
Based on industry benchmarks and real-world data across thousands of clusters:
- CPU over-provisioning: 40-65% is typical. Most clusters request 2-3x the CPU they actually use.
- Memory over-provisioning: 30-50% is typical. Memory tends to be closer to actual usage because engineers fear OOM kills.
- Node-level savings: After right-sizing pods, you often need fewer nodes. A cluster that was running 10
m7i.xlargenodes may only need 6 or 7 after right-sizing, because the bin-packing improves dramatically.
Realistic savings for a mid-size cluster:
| Cluster Size | Monthly Spend (Before) | Typical Savings | Monthly Savings |
|---|---|---|---|
| 50 pods, 5 nodes | $2,500 | 25-35% | $625 - $875 |
| 200 pods, 20 nodes | $10,000 | 30-40% | $3,000 - $4,000 |
| 500 pods, 50 nodes | $25,000 | 35-50% | $8,750 - $12,500 |
These numbers reflect right-sizing alone — before you consider spot instances, reserved instances, or cluster autoscaler tuning. Right-sizing is often the highest-ROI optimization because it requires no infrastructure changes, just manifest updates.
Next Steps
Start with your biggest namespaces. In most clusters, the top 5 namespaces account for 80% of resource consumption. Right-sizing those alone will capture the majority of savings.
Set a monthly cadence for right-sizing reviews. Put it on the team calendar. Even 30 minutes per month reviewing the most over-provisioned workloads compounds into significant savings over a year.
If you want to automate the process, VPA handles individual workloads well. For cluster-wide visibility with dollar-amount savings estimates, K8Cost's free tier (1 cluster, up to 3 nodes) identifies right-sizing opportunities automatically and generates the manifests to fix them. You can try it at app.k8cost.com.
The bottom line: most Kubernetes clusters waste 30-50% of their compute spend on resources that are requested but never used. Right-sizing is the fastest way to reclaim that budget — and you can start today with nothing more than kubectl top and a text editor.