Troubleshooting
This guide covers common issues you might encounter with OptiPod and how to resolve them.
General Troubleshooting Steps
Section titled “General Troubleshooting Steps”Before diving into specific issues, try these general steps:
- Check operator logs
kubectl logs -n optipod-system -l app=optipod-operator --tail=100- Verify policy status
kubectl get optimizationpolicy -Akubectl describe optimizationpolicy <policy-name> -n <namespace>- Check workload labels
kubectl get deployment <name> -o jsonpath='{.metadata.labels}'- Verify metrics provider
# For Prometheuskubectl get svc -n monitoring prometheus-server
# For metrics-serverkubectl get apiservice v1beta1.metrics.k8s.ioInstallation Issues
Section titled “Installation Issues”Operator Pod Not Starting
Section titled “Operator Pod Not Starting”Symptoms:
- OptiPod operator pod in CrashLoopBackOff
- Pod fails to start
Diagnosis:
kubectl get pods -n optipod-systemkubectl describe pod <pod-name> -n optipod-systemkubectl logs <pod-name> -n optipod-systemCommon Causes:
- Insufficient RBAC permissions
# Check RBACkubectl get clusterrole optipod-operatorkubectl get clusterrolebinding optipod-operatorSolution: Reinstall with correct RBAC:
kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/deploy/rbac.yaml- Missing CRDs
# Check CRDskubectl get crd optimizationpolicies.optipod.optipod.ioSolution: Install CRDs:
kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/config/crd/bases/- Resource constraints
Solution: Increase operator resources:
resources: requests: cpu: 200m memory: 256Mi limits: cpu: 500m memory: 512MiWebhook Configuration Issues
Section titled “Webhook Configuration Issues”Symptoms:
- Pods fail to start
- Webhook timeout errors
Diagnosis:
kubectl get mutatingwebhookconfiguration optipod-webhookkubectl describe mutatingwebhookconfiguration optipod-webhookSolution:
- Check webhook deployment
kubectl get pods -n optipod-system -l app=optipod-webhookkubectl logs -n optipod-system -l app=optipod-webhook- Check webhook service
kubectl get svc -n optipod-system optipod-webhook- Verify certificates
kubectl get secret -n optipod-system optipod-webhook-cert# Or check cert-manager certificatekubectl get certificate -n optipod-systemPolicy Issues
Section titled “Policy Issues”No Recommendations Generated
Section titled “No Recommendations Generated”Symptoms:
- Workloads labeled but no recommendations appear
- Recommendation annotations missing
Diagnosis:
# Check if workload is targetedkubectl get deployment <name> -o yaml | grep -A 5 labels
# Check policykubectl get optimizationpolicy -n <namespace>kubectl describe optimizationpolicy <policy-name> -n <namespace>
# Check controller logskubectl logs -n optipod-system -l app=optipod-controller | grep <workload-name>Common Causes:
- Insufficient metrics data
Solution: Wait for more data (typically 24-48 hours with 7-day rolling window)
- Workload not matching policy selector
Solution: Verify labels match:
# Check workload labelskubectl get deployment <name> -o jsonpath='{.metadata.labels}'
# Check policy selectorkubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.selector.workloadSelector}'- Policy in Disabled mode
Solution:
kubectl patch optimizationpolicy <policy-name> \ --type merge \ -p '{"spec":{"mode":"Recommend"}}'- Metrics provider unreachable
Solution: Check connectivity:
# Test Prometheuskubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl http://prometheus-server.monitoring.svc:9090/-/healthy
# Check metrics-serverkubectl top nodesRecommendations Not Updating
Section titled “Recommendations Not Updating”Symptoms:
- Old recommendations not refreshing
- Timestamp not changing
Diagnosis:
# Check last update timekubectl get deployment <name> \ -o jsonpath='{.metadata.annotations.optipod\.io/last-recommendation}'
# Check policy statuskubectl get optimizationpolicy <policy-name> -o yamlCommon Causes:
- Reconciliation interval not reached
Solution: Wait for reconciliation interval or adjust:
spec: reconciliationInterval: 12h # Reduce interval (default: 5m)- Controller not running
Solution:
kubectl get pods -n optipod-systemkubectl rollout restart deployment optipod-controller -n optipod-system- Metrics haven’t changed significantly
OptiPod may not update recommendations if metrics are stable and within acceptable range.
Recommendations Seem Incorrect
Section titled “Recommendations Seem Incorrect”Symptoms:
- Recommendations too high or too low
- Don’t match actual usage
Diagnosis:
# Check actual usagekubectl top pod -l app=<name>
# Check recommendation annotationskubectl get deployment <name> -o yaml | grep -A 10 "optipod.io/"
# Check policy configurationkubectl get optimizationpolicy <policy-name> -o yamlCommon Causes:
- Rolling window too short
Solution:
metricsConfig: rollingWindow: 14d # Increase window- Resource bounds too restrictive
Solution:
resourceBounds: cpu: min: 10m # Lower minimum max: 8000m # Raise maximum memory: min: 64Mi max: 16Gi- Safety factor too high or too low
Solution:
metricsConfig: safetyFactor: 1.2 # Adjust multiplier (default: 1.2)- Wrong percentile
Solution:
metricsConfig: percentile: P90 # Try P99 for more headroom, P50 for more aggressive- Unstable workload
Solution: Some workloads aren’t good candidates for optimization. Consider:
- Increasing rolling window
- Using higher percentile (P99)
- Increasing safety factor
- Using manual tuning instead
Auto Mode Issues
Section titled “Auto Mode Issues”Changes Not Being Applied
Section titled “Changes Not Being Applied”Symptoms:
- Policy in Auto mode but resources not updating
- No
optipod.io/last-appliedannotation
Diagnosis:
# Check policy modekubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.mode}'
# Check policy statuskubectl get optimizationpolicy <policy-name> -o yaml | grep -A 20 status
# Check controller logskubectl logs -n optipod-system -l app=optipod-controller | grep "Applied"Common Causes:
- Reconciliation interval not reached
Solution: Wait for next reconciliation or reduce interval:
spec: reconciliationInterval: 5m # Default is 5m- Update strategy requires pod restart
If using webhook strategy with onNextRestart, changes won’t apply until pods restart naturally.
Solution: Use immediate rollout or trigger restart:
updateStrategy: strategy: webhook rolloutStrategy: immediate # Force rolling restartOr manually restart:
kubectl rollout restart deployment <name>- Resource bounds preventing update
Recommendations may be outside configured bounds.
Solution: Check and adjust bounds:
resourceBounds: cpu: min: 10m max: 8000m memory: min: 64Mi max: 16GiMemory Safety Issues
Section titled “Memory Safety Issues”Symptoms:
- OptiPod not applying memory decreases
- Memory recommendations being blocked
Diagnosis:
# Check controller logs for safety blockskubectl logs -n optipod-system -l app=optipod-controller | grep -i "memory.*blocked"
# Check current memory usagekubectl top pod -l app=<name>
# Check recommendationskubectl get deployment <name> -o yaml | grep "optipod.io/memory"Common Causes:
- Memory safety check blocking decrease
By default, OptiPod blocks memory decreases that could cause OOM kills.
Solution: If you’re confident the decrease is safe:
updateStrategy: allowUnsafeMemoryDecrease: true # Use with caution!- Gradual decrease not yet implemented
⚠️ Note: The
gradualDecreaseConfigfeature is not yet implemented. Memory decreases are applied immediately in full, subject to safety checks.
If you need to slow down memory optimization, use more conservative settings:
metricsConfig: safetyFactor: 1.5 # Higher safety factor percentile: P95 # Higher percentileOOM Kills After OptiPod Changes
Section titled “OOM Kills After OptiPod Changes”Symptoms:
- Pods being OOM killed
- Started after OptiPod applied changes
Immediate Action:
# Disable Auto modekubectl patch optimizationpolicy <policy-name> \ --type merge \ -p '{"spec":{"mode":"Disabled"}}'
# Revert memorykubectl set resources deployment <name> \ --requests=memory=<previous-value> \ --limits=memory=<previous-value>Investigation:
# Check memory usagekubectl top pod -l app=<name>
# Check what was appliedkubectl get deployment <name> -o yaml | grep -A 5 "optipod.io/"
# Check for memory spikes in Prometheus# Query: container_memory_usage_bytes{pod=~"<name>.*"}Prevention:
metricsConfig: percentile: P99 # Use higher percentile safetyFactor: 1.5 # More headroom
resourceBounds: memory: min: 256Mi # Higher floor
# Note: gradualDecreaseConfig not yet implemented# Use higher safety factors and percentiles insteadMetrics Issues
Section titled “Metrics Issues”Prometheus Connection Failed
Section titled “Prometheus Connection Failed”Symptoms:
- Controller logs show Prometheus errors
- No recommendations generated
Diagnosis:
# Check controller logskubectl logs -n optipod-system -l app=optipod-controller | grep -i prometheus
# Test connectivity from controller podkubectl exec -n optipod-system <controller-pod> -- \ curl -v http://prometheus-server.monitoring.svc:9090/-/healthySolution:
- Verify Prometheus is accessible
Prometheus configuration is external to OptiPod (not in the CRD). Configure via:
- Helm values
- Environment variables
- ConfigMap
Check Helm values:
prometheus: address: http://prometheus-server.monitoring.svc:9090- Check network policies
kubectl get networkpolicy -n optipod-systemkubectl get networkpolicy -n monitoring- Verify Prometheus is running
kubectl get pods -n monitoring -l app=prometheusmetrics-server Not Available
Section titled “metrics-server Not Available”Symptoms:
kubectl topcommands fail- OptiPod can’t get metrics
Diagnosis:
# Check metrics-serverkubectl get apiservice v1beta1.metrics.k8s.iokubectl get pods -n kube-system -l k8s-app=metrics-serverSolution:
- Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml- Fix metrics-server issues
# Common fix for self-signed certskubectl patch deployment metrics-server -n kube-system --type='json' \ -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'GitOps Integration Issues
Section titled “GitOps Integration Issues”ArgoCD/Flux Detecting Drift
Section titled “ArgoCD/Flux Detecting Drift”Symptoms:
- GitOps tool shows resources as out-of-sync
- Constant sync attempts
Solution:
- Use webhook strategy with onNextRestart
This keeps workload specs unchanged in Git:
updateStrategy: strategy: webhook rolloutStrategy: onNextRestartOptiPod stores recommendations as annotations, webhook applies them at pod creation.
- Configure GitOps to ignore OptiPod annotations
For ArgoCD:
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: my-appspec: ignoreDifferences: - group: apps kind: Deployment jsonPointers: - /metadata/annotations/optipod.io~1cpu-request.* - /metadata/annotations/optipod.io~1memory-request.* - /metadata/annotations/optipod.io~1cpu-limit.* - /metadata/annotations/optipod.io~1memory-limit.*- Use SSA strategy with ArgoCD SSA support
If using ArgoCD 2.5+:
syncPolicy: syncOptions: - ServerSideApply=trueThen configure OptiPod:
updateStrategy: strategy: ssa useServerSideApply: true- Use Recommend mode
Most GitOps-friendly approach:
mode: Recommend # Manual application via GitOpsPerformance Issues
Section titled “Performance Issues”Controller Using Too Much CPU/Memory
Section titled “Controller Using Too Much CPU/Memory”Diagnosis:
kubectl top pod -n optipod-systemSolution:
- Increase controller resources
resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi- Reduce workload count
- Use more specific label selectors
- Split into multiple policies with different reconciliation intervals
- Increase reconciliation interval
spec: reconciliationInterval: 15m # Less frequent reconciliationSlow Recommendation Generation
Section titled “Slow Recommendation Generation”Solution:
- Reduce rolling window
metricsConfig: rollingWindow: 3d # Shorter window (less data to process)- Use metrics-server instead of Prometheus
For simpler deployments:
metricsConfig: provider: metrics-serverNote: metrics-server provides less historical data but is faster.
Getting Help
Section titled “Getting Help”If you can’t resolve your issue:
- Collect diagnostic information
# Controller logskubectl logs -n optipod-system -l app=optipod-controller > controller-logs.txt
# Policy statuskubectl get optimizationpolicy -A -o yaml > policies.yaml
# Workload statuskubectl get deployment <name> -o yaml > workload.yaml
# Eventskubectl get events -A --sort-by='.lastTimestamp' > events.txt- Check GitHub Issues
- Search existing issues: https://github.com/Sagart-cactus/optipod/issues
- Create new issue with diagnostic info
- Community Support
- Join discussions: https://github.com/Sagart-cactus/optipod/discussions
- Ask questions with full context