Skip to content

Troubleshooting

This guide covers common issues you might encounter with OptiPod and how to resolve them.

Before diving into specific issues, try these general steps:

  1. Check operator logs
Terminal window
kubectl logs -n optipod-system -l app=optipod-operator --tail=100
  1. Verify policy status
Terminal window
kubectl get optimizationpolicy -A
kubectl describe optimizationpolicy <policy-name> -n <namespace>
  1. Check workload labels
Terminal window
kubectl get deployment <name> -o jsonpath='{.metadata.labels}'
  1. Verify metrics provider
Terminal window
# For Prometheus
kubectl get svc -n monitoring prometheus-server
# For metrics-server
kubectl get apiservice v1beta1.metrics.k8s.io

Symptoms:

  • OptiPod operator pod in CrashLoopBackOff
  • Pod fails to start

Diagnosis:

Terminal window
kubectl get pods -n optipod-system
kubectl describe pod <pod-name> -n optipod-system
kubectl logs <pod-name> -n optipod-system

Common Causes:

  1. Insufficient RBAC permissions
Terminal window
# Check RBAC
kubectl get clusterrole optipod-operator
kubectl get clusterrolebinding optipod-operator

Solution: Reinstall with correct RBAC:

Terminal window
kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/deploy/rbac.yaml
  1. Missing CRDs
Terminal window
# Check CRDs
kubectl get crd optimizationpolicies.optipod.optipod.io

Solution: Install CRDs:

Terminal window
kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/config/crd/bases/
  1. Resource constraints

Solution: Increase operator resources:

resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi

Symptoms:

  • Pods fail to start
  • Webhook timeout errors

Diagnosis:

Terminal window
kubectl get mutatingwebhookconfiguration optipod-webhook
kubectl describe mutatingwebhookconfiguration optipod-webhook

Solution:

  1. Check webhook deployment
Terminal window
kubectl get pods -n optipod-system -l app=optipod-webhook
kubectl logs -n optipod-system -l app=optipod-webhook
  1. Check webhook service
Terminal window
kubectl get svc -n optipod-system optipod-webhook
  1. Verify certificates
Terminal window
kubectl get secret -n optipod-system optipod-webhook-cert
# Or check cert-manager certificate
kubectl get certificate -n optipod-system

Symptoms:

  • Workloads labeled but no recommendations appear
  • Recommendation annotations missing

Diagnosis:

Terminal window
# Check if workload is targeted
kubectl get deployment <name> -o yaml | grep -A 5 labels
# Check policy
kubectl get optimizationpolicy -n <namespace>
kubectl describe optimizationpolicy <policy-name> -n <namespace>
# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep <workload-name>

Common Causes:

  1. Insufficient metrics data

Solution: Wait for more data (typically 24-48 hours with 7-day rolling window)

  1. Workload not matching policy selector

Solution: Verify labels match:

Terminal window
# Check workload labels
kubectl get deployment <name> -o jsonpath='{.metadata.labels}'
# Check policy selector
kubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.selector.workloadSelector}'
  1. Policy in Disabled mode

Solution:

Terminal window
kubectl patch optimizationpolicy <policy-name> \
--type merge \
-p '{"spec":{"mode":"Recommend"}}'
  1. Metrics provider unreachable

Solution: Check connectivity:

Terminal window
# Test Prometheus
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://prometheus-server.monitoring.svc:9090/-/healthy
# Check metrics-server
kubectl top nodes

Symptoms:

  • Old recommendations not refreshing
  • Timestamp not changing

Diagnosis:

Terminal window
# Check last update time
kubectl get deployment <name> \
-o jsonpath='{.metadata.annotations.optipod\.io/last-recommendation}'
# Check policy status
kubectl get optimizationpolicy <policy-name> -o yaml

Common Causes:

  1. Reconciliation interval not reached

Solution: Wait for reconciliation interval or adjust:

spec:
reconciliationInterval: 12h # Reduce interval (default: 5m)
  1. Controller not running

Solution:

Terminal window
kubectl get pods -n optipod-system
kubectl rollout restart deployment optipod-controller -n optipod-system
  1. Metrics haven’t changed significantly

OptiPod may not update recommendations if metrics are stable and within acceptable range.

Symptoms:

  • Recommendations too high or too low
  • Don’t match actual usage

Diagnosis:

Terminal window
# Check actual usage
kubectl top pod -l app=<name>
# Check recommendation annotations
kubectl get deployment <name> -o yaml | grep -A 10 "optipod.io/"
# Check policy configuration
kubectl get optimizationpolicy <policy-name> -o yaml

Common Causes:

  1. Rolling window too short

Solution:

metricsConfig:
rollingWindow: 14d # Increase window
  1. Resource bounds too restrictive

Solution:

resourceBounds:
cpu:
min: 10m # Lower minimum
max: 8000m # Raise maximum
memory:
min: 64Mi
max: 16Gi
  1. Safety factor too high or too low

Solution:

metricsConfig:
safetyFactor: 1.2 # Adjust multiplier (default: 1.2)
  1. Wrong percentile

Solution:

metricsConfig:
percentile: P90 # Try P99 for more headroom, P50 for more aggressive
  1. Unstable workload

Solution: Some workloads aren’t good candidates for optimization. Consider:

  • Increasing rolling window
  • Using higher percentile (P99)
  • Increasing safety factor
  • Using manual tuning instead

Symptoms:

  • Policy in Auto mode but resources not updating
  • No optipod.io/last-applied annotation

Diagnosis:

Terminal window
# Check policy mode
kubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.mode}'
# Check policy status
kubectl get optimizationpolicy <policy-name> -o yaml | grep -A 20 status
# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep "Applied"

Common Causes:

  1. Reconciliation interval not reached

Solution: Wait for next reconciliation or reduce interval:

spec:
reconciliationInterval: 5m # Default is 5m
  1. Update strategy requires pod restart

If using webhook strategy with onNextRestart, changes won’t apply until pods restart naturally.

Solution: Use immediate rollout or trigger restart:

updateStrategy:
strategy: webhook
rolloutStrategy: immediate # Force rolling restart

Or manually restart:

Terminal window
kubectl rollout restart deployment <name>
  1. Resource bounds preventing update

Recommendations may be outside configured bounds.

Solution: Check and adjust bounds:

resourceBounds:
cpu:
min: 10m
max: 8000m
memory:
min: 64Mi
max: 16Gi

Symptoms:

  • OptiPod not applying memory decreases
  • Memory recommendations being blocked

Diagnosis:

Terminal window
# Check controller logs for safety blocks
kubectl logs -n optipod-system -l app=optipod-controller | grep -i "memory.*blocked"
# Check current memory usage
kubectl top pod -l app=<name>
# Check recommendations
kubectl get deployment <name> -o yaml | grep "optipod.io/memory"

Common Causes:

  1. Memory safety check blocking decrease

By default, OptiPod blocks memory decreases that could cause OOM kills.

Solution: If you’re confident the decrease is safe:

updateStrategy:
allowUnsafeMemoryDecrease: true # Use with caution!
  1. Gradual decrease not yet implemented

⚠️ Note: The gradualDecreaseConfig feature is not yet implemented. Memory decreases are applied immediately in full, subject to safety checks.

If you need to slow down memory optimization, use more conservative settings:

metricsConfig:
safetyFactor: 1.5 # Higher safety factor
percentile: P95 # Higher percentile

Symptoms:

  • Pods being OOM killed
  • Started after OptiPod applied changes

Immediate Action:

Terminal window
# Disable Auto mode
kubectl patch optimizationpolicy <policy-name> \
--type merge \
-p '{"spec":{"mode":"Disabled"}}'
# Revert memory
kubectl set resources deployment <name> \
--requests=memory=<previous-value> \
--limits=memory=<previous-value>

Investigation:

Terminal window
# Check memory usage
kubectl top pod -l app=<name>
# Check what was applied
kubectl get deployment <name> -o yaml | grep -A 5 "optipod.io/"
# Check for memory spikes in Prometheus
# Query: container_memory_usage_bytes{pod=~"<name>.*"}

Prevention:

metricsConfig:
percentile: P99 # Use higher percentile
safetyFactor: 1.5 # More headroom
resourceBounds:
memory:
min: 256Mi # Higher floor
# Note: gradualDecreaseConfig not yet implemented
# Use higher safety factors and percentiles instead

Symptoms:

  • Controller logs show Prometheus errors
  • No recommendations generated

Diagnosis:

Terminal window
# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep -i prometheus
# Test connectivity from controller pod
kubectl exec -n optipod-system <controller-pod> -- \
curl -v http://prometheus-server.monitoring.svc:9090/-/healthy

Solution:

  1. Verify Prometheus is accessible

Prometheus configuration is external to OptiPod (not in the CRD). Configure via:

  • Helm values
  • Environment variables
  • ConfigMap

Check Helm values:

prometheus:
address: http://prometheus-server.monitoring.svc:9090
  1. Check network policies
Terminal window
kubectl get networkpolicy -n optipod-system
kubectl get networkpolicy -n monitoring
  1. Verify Prometheus is running
Terminal window
kubectl get pods -n monitoring -l app=prometheus

Symptoms:

  • kubectl top commands fail
  • OptiPod can’t get metrics

Diagnosis:

Terminal window
# Check metrics-server
kubectl get apiservice v1beta1.metrics.k8s.io
kubectl get pods -n kube-system -l k8s-app=metrics-server

Solution:

  1. Install metrics-server
Terminal window
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
  1. Fix metrics-server issues
Terminal window
# Common fix for self-signed certs
kubectl patch deployment metrics-server -n kube-system --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

Symptoms:

  • GitOps tool shows resources as out-of-sync
  • Constant sync attempts

Solution:

  1. Use webhook strategy with onNextRestart

This keeps workload specs unchanged in Git:

updateStrategy:
strategy: webhook
rolloutStrategy: onNextRestart

OptiPod stores recommendations as annotations, webhook applies them at pod creation.

  1. Configure GitOps to ignore OptiPod annotations

For ArgoCD:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /metadata/annotations/optipod.io~1cpu-request.*
- /metadata/annotations/optipod.io~1memory-request.*
- /metadata/annotations/optipod.io~1cpu-limit.*
- /metadata/annotations/optipod.io~1memory-limit.*
  1. Use SSA strategy with ArgoCD SSA support

If using ArgoCD 2.5+:

syncPolicy:
syncOptions:
- ServerSideApply=true

Then configure OptiPod:

updateStrategy:
strategy: ssa
useServerSideApply: true
  1. Use Recommend mode

Most GitOps-friendly approach:

mode: Recommend # Manual application via GitOps

Diagnosis:

Terminal window
kubectl top pod -n optipod-system

Solution:

  1. Increase controller resources
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
  1. Reduce workload count
  • Use more specific label selectors
  • Split into multiple policies with different reconciliation intervals
  1. Increase reconciliation interval
spec:
reconciliationInterval: 15m # Less frequent reconciliation

Solution:

  1. Reduce rolling window
metricsConfig:
rollingWindow: 3d # Shorter window (less data to process)
  1. Use metrics-server instead of Prometheus

For simpler deployments:

metricsConfig:
provider: metrics-server

Note: metrics-server provides less historical data but is faster.

If you can’t resolve your issue:

  1. Collect diagnostic information
Terminal window
# Controller logs
kubectl logs -n optipod-system -l app=optipod-controller > controller-logs.txt
# Policy status
kubectl get optimizationpolicy -A -o yaml > policies.yaml
# Workload status
kubectl get deployment <name> -o yaml > workload.yaml
# Events
kubectl get events -A --sort-by='.lastTimestamp' > events.txt
  1. Check GitHub Issues
  1. Community Support