Troubleshooting Guide
This guide helps diagnose and resolve common issues with OptiPod.
Quick Diagnostics
Section titled “Quick Diagnostics”Check OptiPod Status
Section titled “Check OptiPod Status”# Check controller deploymentkubectl get deployment -n optipod-system optipod-controller
# Check controller logskubectl logs -n optipod-system deployment/optipod-controller --tail=100
# Check webhook deployment (if enabled)kubectl get deployment -n optipod-system optipod-webhook
# Check webhook logskubectl logs -n optipod-system deployment/optipod-webhook --tail=100Check Policy Status
Section titled “Check Policy Status”# List all policieskubectl get optimizationpolicy -A
# Get policy detailskubectl get optimizationpolicy <policy-name> -n optipod-system -o yaml
# Check policy conditionskubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.status.conditions}' | jqCheck Events
Section titled “Check Events”# All OptiPod eventskubectl get events -A --field-selector source=optipod
# Policy-specific eventskubectl get events -n optipod-system --field-selector involvedObject.name=<policy-name>
# Workload-specific eventskubectl get events -n <namespace> --field-selector involvedObject.name=<workload-name>Common Issues
Section titled “Common Issues”Issue 1: Policy Validation Failed
Section titled “Issue 1: Policy Validation Failed”Symptoms:
- Policy status shows
ValidationFailed - Event:
Policy validation failed - Policy not processing workloads
Possible Causes:
- Missing required fields
- Invalid resource bounds
- Invalid selector configuration
- Invalid metrics configuration
Diagnosis:
# Check policy validation errorkubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
# View policy speckubectl get optimizationpolicy <policy-name> -n optipod-system -o yamlSolutions:
Missing Selector:
# ❌ Invalid - no selectorspec: mode: Recommend metricsConfig: provider: prometheus
# ✅ Valid - has selectorspec: mode: Recommend selector: namespaceSelector: matchLabels: environment: production metricsConfig: provider: prometheusInvalid Resource Bounds:
# ❌ Invalid - min > maxspec: resourceBounds: cpu: min: "2000m" max: "1000m" # Error: max < min
# ✅ Valid - min < maxspec: resourceBounds: cpu: min: "100m" max: "2000m"Invalid Metrics Config:
# ❌ Invalid - missing providerspec: metricsConfig: rollingWindow: 24h
# ✅ Valid - has providerspec: metricsConfig: provider: prometheus rollingWindow: 24hIssue 2: No Workloads Discovered
Section titled “Issue 2: No Workloads Discovered”Symptoms:
- Policy status shows
workloadsDiscovered: 0 - No recommendations generated
- No events for workload processing
Possible Causes:
- Selector doesn’t match any workloads
- Workloads in excluded namespaces
- Workload types not included
- RBAC permissions missing
Diagnosis:
# Check policy selectorkubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.spec.selector}' | jq
# List workloads in target namespacekubectl get deployments,statefulsets,daemonsets -n <namespace> --show-labels
# Check RBAC permissionskubectl auth can-i list deployments \ --as=system:serviceaccount:optipod-system:optipod-controller
kubectl auth can-i list statefulsets \ --as=system:serviceaccount:optipod-system:optipod-controller
kubectl auth can-i list daemonsets \ --as=system:serviceaccount:optipod-system:optipod-controllerSolutions:
Fix Selector Mismatch:
# Check workload labelskubectl get deployment <name> -n <namespace> --show-labels
# Update policy selector to matchkubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"selector":{"workloadSelector":{"matchLabels":{"app":"<label-value>"}}}}}'Include Workload Types:
# Ensure workload types are includedspec: selector: workloadTypes: include: - Deployment - StatefulSet - DaemonSetFix RBAC Permissions:
# Check if RBAC role existskubectl get clusterrole optipod-controller-role
# Verify role bindingkubectl get clusterrolebinding optipod-controller-rolebinding
# If missing, reinstall OptiPod or apply RBAC manifestskubectl apply -f config/rbac/Issue 3: Recommendations Not Applied (Auto Mode)
Section titled “Issue 3: Recommendations Not Applied (Auto Mode)”Symptoms:
- Policy in Auto mode
- Recommendations exist in annotations
- No actual resource changes
- Event:
UpdateFailedor no events
Possible Causes:
- Global dry-run enabled
- Update strategy misconfigured
- RBAC permissions missing
- Workload doesn’t support updates
- SSA conflicts
Diagnosis:
# Check if dry-run is enabledkubectl get deployment optipod-controller -n optipod-system \ -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="DRY_RUN")].value}'
# Check update strategykubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.spec.updateStrategy}' | jq
# Check RBAC for updateskubectl auth can-i update deployments \ --as=system:serviceaccount:optipod-system:optipod-controller
kubectl auth can-i patch deployments \ --as=system:serviceaccount:optipod-system:optipod-controller
# Check for SSA conflictskubectl get deployment <name> -n <namespace> -o yaml | grep managedFields -A 20Solutions:
Disable Dry-Run:
# Remove DRY_RUN environment variablekubectl set env deployment/optipod-controller -n optipod-system DRY_RUN-
# Or set to falsekubectl set env deployment/optipod-controller -n optipod-system DRY_RUN=falseFix Update Strategy:
# Ensure update strategy is configuredspec: updateStrategy: strategy: ssa # or webhook allowInPlaceResize: true allowRecreate: false updateRequestsOnly: trueFix RBAC for Updates:
# Ensure ClusterRole has update/patch permissionsapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: optipod-controller-rolerules: - apiGroups: ["apps"] resources: ["deployments", "statefulsets", "daemonsets"] verbs: ["get", "list", "watch", "update", "patch"]Resolve SSA Conflicts:
# Check field managerskubectl get deployment <name> -n <namespace> \ -o jsonpath='{.metadata.managedFields[*].manager}' | tr ' ' '\n'
# If another manager owns fields, enable force in policykubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"updateStrategy":{"forceOwnership":true}}}'Issue 4: Metrics Collection Failed
Section titled “Issue 4: Metrics Collection Failed”Symptoms:
- Event:
MetricsCollectionFailed - No recommendations generated
- Controller logs show metrics errors
Possible Causes:
- Prometheus not accessible
- Authentication failed
- Workload has no metrics
- Query timeout
Diagnosis:
# Check Prometheus connectivitykubectl get svc -n monitoring prometheus-operated
# Test Prometheus querykubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -s "http://prometheus-operated.monitoring:9090/api/v1/query?query=up"
# Check OptiPod metrics configkubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.spec.metricsConfig}' | jq
# Check controller logs for metrics errorskubectl logs -n optipod-system deployment/optipod-controller | grep -i "metrics"Solutions:
Fix Prometheus URL:
# Update metrics config with correct URLspec: metricsConfig: provider: prometheus prometheusURL: "http://prometheus-operated.monitoring:9090"Configure Authentication:
# For basic authspec: metricsConfig: provider: prometheus prometheusURL: "http://prometheus-operated.monitoring:9090" prometheusAuth: type: basic secretRef: name: prometheus-auth namespace: optipod-system
# Create secretkubectl create secret generic prometheus-auth \ -n optipod-system \ --from-literal=username=admin \ --from-literal=password=<password>Wait for Metrics:
# Workloads need runtime data before metrics are available# Check if workload has been running long enoughkubectl get pods -n <namespace> -l app=<workload>
# Verify metrics exist in Prometheuskubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -s "http://prometheus-operated.monitoring:9090/api/v1/query?query=container_cpu_usage_seconds_total{pod=~\"<pod-name>.*\"}"Issue 5: Webhook Not Mutating Pods
Section titled “Issue 5: Webhook Not Mutating Pods”Symptoms:
- Webhook deployed
- Pods created without OptiPod resources
- No webhook events
Possible Causes:
- Webhook not registered
- Certificate issues
- Webhook service not accessible
- Namespace not labeled
- No matching policies
Diagnosis:
# Check webhook configurationkubectl get mutatingwebhookconfiguration optipod-webhook
# Check webhook servicekubectl get svc -n optipod-system optipod-webhook
# Check webhook endpointskubectl get endpoints -n optipod-system optipod-webhook
# Check webhook logskubectl logs -n optipod-system deployment/optipod-webhook
# Check certificatekubectl get secret -n optipod-system optipod-webhook-cert
# Test webhook healthkubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -k https://optipod-webhook.optipod-system.svc:443/healthSolutions:
Fix Webhook Registration:
# Verify webhook configuration existskubectl get mutatingwebhookconfiguration optipod-webhook -o yaml
# If missing, reinstall webhookhelm upgrade optipod optipod/optipod \ --set webhook.enabled=true \ --reuse-valuesFix Certificate Issues:
# Check cert-manager is installedkubectl get pods -n cert-manager
# Check certificate statuskubectl get certificate -n optipod-system optipod-webhook-cert
# If certificate not ready, check cert-manager logskubectl logs -n cert-manager deployment/cert-manager
# Force certificate renewalkubectl delete certificate -n optipod-system optipod-webhook-cert# Certificate will be recreated automaticallyLabel Namespace for Webhook:
# Webhook only processes labeled namespaceskubectl label namespace <namespace> optipod.io/webhook=enabled
# Or enable for all namespaces (not recommended)kubectl patch mutatingwebhookconfiguration optipod-webhook \ --type json \ -p '[{"op":"remove","path":"/webhooks/0/namespaceSelector"}]'Verify Policy Matches:
# Check if policy selector matches workloadkubectl get optimizationpolicy -A
# Ensure workload has recommendationskubectl get deployment <name> -n <namespace> \ -o jsonpath='{.metadata.annotations}' | jq | grep optipod.ioIssue 6: Performance Degradation After Optimization
Section titled “Issue 6: Performance Degradation After Optimization”Symptoms:
- Increased latency
- CPU throttling
- OOMKills
- Pod restarts
Possible Causes:
- Safety factor too low
- Percentile too low
- Metrics window too short
- Workload behavior changed
- Memory decreased too aggressively
Diagnosis:
# Check for OOMKillskubectl get events -n <namespace> --field-selector reason=OOMKilled
# Check CPU throttlingkubectl top pods -n <namespace>
# Check pod restartskubectl get pods -n <namespace> \ -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Check current resources vs recommendationskubectl get deployment <name> -n <namespace> -o yaml | grep -A 10 resources
# Check policy settingskubectl get optimizationpolicy <policy-name> -n optipod-system \ -o jsonpath='{.spec.metricsConfig}' | jqSolutions:
Emergency: Revert to Recommend Mode:
kubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"mode":"Recommend"}}'Increase Safety Factor:
kubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"metricsConfig":{"safetyFactor":1.5}}}'Use Higher Percentile:
kubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"metricsConfig":{"percentile":"P95"}}}'Increase Rolling Window:
kubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"metricsConfig":{"rollingWindow":"48h"}}}'Enable Gradual Memory Decrease (Not Yet Implemented):
⚠️ Note: This configuration is accepted but not yet implemented.
kubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"updateStrategy":{"gradualDecreaseConfig":{"enabled":true,"memoryDecreasePercentage":10}}}}'Manually Restore Resources:
# Restore previous resource valueskubectl patch deployment <name> -n <namespace> \ --type strategic \ --patch '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"requests":{"cpu":"1000m","memory":"2Gi"}}}]}}}}'Issue 7: Controller Crash Loop
Section titled “Issue 7: Controller Crash Loop”Symptoms:
- Controller pod restarting
- CrashLoopBackOff status
- Controller logs show errors
Possible Causes:
- Invalid configuration
- RBAC permissions missing
- Resource limits too low
- Metrics provider unreachable
Diagnosis:
# Check pod statuskubectl get pods -n optipod-system -l app=optipod-controller
# Check pod eventskubectl describe pod -n optipod-system -l app=optipod-controller
# Check logs from previous crashkubectl logs -n optipod-system -l app=optipod-controller --previous
# Check resource usagekubectl top pod -n optipod-system -l app=optipod-controllerSolutions:
Check Configuration:
# View controller configurationkubectl get deployment optipod-controller -n optipod-system -o yaml
# Check for invalid environment variableskubectl get deployment optipod-controller -n optipod-system \ -o jsonpath='{.spec.template.spec.containers[0].env}' | jqIncrease Resource Limits:
kubectl patch deployment optipod-controller -n optipod-system \ --type strategic \ --patch '{"spec":{"template":{"spec":{"containers":[{"name":"controller","resources":{"limits":{"cpu":"500m","memory":"512Mi"},"requests":{"cpu":"100m","memory":"128Mi"}}}]}}}}'Fix RBAC:
# Reinstall RBACkubectl apply -f https://raw.githubusercontent.com/optipod/optipod/main/config/rbac/Issue 8: High Controller CPU/Memory Usage
Section titled “Issue 8: High Controller CPU/Memory Usage”Symptoms:
- Controller using excessive resources
- Slow reconciliation
- Cluster performance impact
Possible Causes:
- Too many workloads
- Reconciliation interval too short
- Metrics queries too frequent
- Memory leak
Diagnosis:
# Check resource usagekubectl top pod -n optipod-system -l app=optipod-controller
# Check number of policies and workloadskubectl get optimizationpolicy -Akubectl get deployments,statefulsets,daemonsets -A | wc -l
# Check reconciliation frequencykubectl logs -n optipod-system deployment/optipod-controller | grep "reconciliation"Solutions:
Increase Reconciliation Interval:
# Update all policies to reconcile less frequentlykubectl get optimizationpolicy -A -o name | xargs -I {} \ kubectl patch {} --type merge \ --patch '{"spec":{"reconciliationInterval":"10m"}}'Increase Controller Resources:
kubectl patch deployment optipod-controller -n optipod-system \ --type strategic \ --patch '{"spec":{"template":{"spec":{"containers":[{"name":"controller","resources":{"limits":{"cpu":"1000m","memory":"1Gi"},"requests":{"cpu":"200m","memory":"256Mi"}}}]}}}}'Reduce Workload Scope:
# Use more specific selectors to reduce workload countkubectl patch optimizationpolicy <policy-name> -n optipod-system \ --type merge \ --patch '{"spec":{"selector":{"namespaceSelector":{"matchLabels":{"optimize":"true"}}}}}'Diagnostic Commands Reference
Section titled “Diagnostic Commands Reference”Controller Diagnostics
Section titled “Controller Diagnostics”# Controller statuskubectl get deployment -n optipod-system optipod-controllerkubectl get pods -n optipod-system -l app=optipod-controller
# Controller logs (last 100 lines)kubectl logs -n optipod-system deployment/optipod-controller --tail=100
# Controller logs (follow)kubectl logs -n optipod-system deployment/optipod-controller -f
# Controller logs (previous crash)kubectl logs -n optipod-system -l app=optipod-controller --previous
# Controller resource usagekubectl top pod -n optipod-system -l app=optipod-controller
# Controller configurationkubectl get deployment optipod-controller -n optipod-system -o yamlPolicy Diagnostics
Section titled “Policy Diagnostics”# List all policieskubectl get optimizationpolicy -A
# Policy detailskubectl get optimizationpolicy <name> -n optipod-system -o yaml
# Policy statuskubectl get optimizationpolicy <name> -n optipod-system \ -o jsonpath='{.status}' | jq
# Policy conditionskubectl get optimizationpolicy <name> -n optipod-system \ -o jsonpath='{.status.conditions}' | jq
# Policy eventskubectl get events -n optipod-system --field-selector involvedObject.name=<name>Workload Diagnostics
Section titled “Workload Diagnostics”# Workload annotationskubectl get deployment <name> -n <namespace> \ -o jsonpath='{.metadata.annotations}' | jq | grep optipod.io
# Workload resourceskubectl get deployment <name> -n <namespace> \ -o jsonpath='{.spec.template.spec.containers[*].resources}' | jq
# Workload eventskubectl get events -n <namespace> --field-selector involvedObject.name=<name>
# Pod statuskubectl get pods -n <namespace> -l app=<name>
# Pod resource usagekubectl top pods -n <namespace> -l app=<name>Webhook Diagnostics
Section titled “Webhook Diagnostics”# Webhook statuskubectl get deployment -n optipod-system optipod-webhookkubectl get pods -n optipod-system -l app=optipod-webhook
# Webhook logskubectl logs -n optipod-system deployment/optipod-webhook --tail=100
# Webhook configurationkubectl get mutatingwebhookconfiguration optipod-webhook -o yaml
# Webhook servicekubectl get svc -n optipod-system optipod-webhook
# Webhook certificatekubectl get certificate -n optipod-system optipod-webhook-certkubectl get secret -n optipod-system optipod-webhook-cert
# Test webhook healthkubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -k https://optipod-webhook.optipod-system.svc:443/healthMetrics Diagnostics
Section titled “Metrics Diagnostics”# Test Prometheus connectivitykubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -s "http://prometheus-operated.monitoring:9090/api/v1/query?query=up"
# Query workload metricskubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ curl -s "http://prometheus-operated.monitoring:9090/api/v1/query?query=container_cpu_usage_seconds_total{pod=~\"<pod-name>.*\"}"
# Check metrics-server (if using)kubectl top nodeskubectl top pods -AGetting Help
Section titled “Getting Help”If you’re still experiencing issues:
- Check GitHub Issues: github.com/optipod/optipod/issues
- Review Documentation: optipod.io/docs
- Join Community: Slack/Discord links in README
- File a Bug Report: Include:
- OptiPod version
- Kubernetes version
- Policy configuration
- Controller logs
- Relevant events
- Steps to reproduce
Next Steps
Section titled “Next Steps”- Creating Policies - Policy configuration guide
- Reviewing Recommendations - Understanding recommendations
- Switching to Auto Mode - Enabling automatic optimization
- Safety Model - Understanding safety guarantees
- Operations - Operational procedures