Troubleshooting

This guide covers common issues you might encounter with OptiPod and how to resolve them.

General Troubleshooting Steps

Before diving into specific issues, try these general steps:

Check operator logs

kubectl logs -n optipod-system -l app=optipod-operator --tail=100

Verify policy status

kubectl get optimizationpolicy -A
kubectl describe optimizationpolicy <policy-name> -n <namespace>

Check workload labels

kubectl get deployment <name> -o jsonpath='{.metadata.labels}'

Verify metrics provider

# For Prometheus
kubectl get svc -n monitoring prometheus-server

# For metrics-server
kubectl get apiservice v1beta1.metrics.k8s.io

Installation Issues

Operator Pod Not Starting

Symptoms:

OptiPod operator pod in CrashLoopBackOff
Pod fails to start

Diagnosis:

kubectl get pods -n optipod-system
kubectl describe pod <pod-name> -n optipod-system
kubectl logs <pod-name> -n optipod-system

Common Causes:

Insufficient RBAC permissions

# Check RBAC
kubectl get clusterrole optipod-operator
kubectl get clusterrolebinding optipod-operator

Solution: Reinstall with correct RBAC:

kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/deploy/rbac.yaml

Missing CRDs

# Check CRDs
kubectl get crd optimizationpolicies.optipod.optipod.io

Solution: Install CRDs:

kubectl apply -f https://raw.githubusercontent.com/Sagart-cactus/optipod/main/config/crd/bases/

Resource constraints

Solution: Increase operator resources:

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Webhook Configuration Issues

Symptoms:

Pods fail to start
Webhook timeout errors

Diagnosis:

kubectl get mutatingwebhookconfiguration optipod-webhook
kubectl describe mutatingwebhookconfiguration optipod-webhook

Solution:

Check webhook deployment

kubectl get pods -n optipod-system -l app=optipod-webhook
kubectl logs -n optipod-system -l app=optipod-webhook

Check webhook service

kubectl get svc -n optipod-system optipod-webhook

Verify certificates

kubectl get secret -n optipod-system optipod-webhook-cert
# Or check cert-manager certificate
kubectl get certificate -n optipod-system

Policy Issues

No Recommendations Generated

Symptoms:

Workloads labeled but no recommendations appear
Recommendation annotations missing

Diagnosis:

# Check if workload is targeted
kubectl get deployment <name> -o yaml | grep -A 5 labels

# Check policy
kubectl get optimizationpolicy -n <namespace>
kubectl describe optimizationpolicy <policy-name> -n <namespace>

# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep <workload-name>

Common Causes:

Insufficient metrics data

Solution: Wait for more data (typically 24-48 hours with 7-day rolling window)

Workload not matching policy selector

Solution: Verify labels match:

# Check workload labels
kubectl get deployment <name> -o jsonpath='{.metadata.labels}'

# Check policy selector
kubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.selector.workloadSelector}'

Policy in Disabled mode

Solution:

kubectl patch optimizationpolicy <policy-name> \
  --type merge \
  -p '{"spec":{"mode":"Recommend"}}'

Metrics provider unreachable

Solution: Check connectivity:

# Test Prometheus
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://prometheus-server.monitoring.svc:9090/-/healthy

# Check metrics-server
kubectl top nodes

Recommendations Not Updating

Symptoms:

Old recommendations not refreshing
Timestamp not changing

Diagnosis:

# Check last update time
kubectl get deployment <name> \
  -o jsonpath='{.metadata.annotations.optipod\.io/last-recommendation}'

# Check policy status
kubectl get optimizationpolicy <policy-name> -o yaml

Common Causes:

Reconciliation interval not reached

Solution: Wait for reconciliation interval or adjust:

spec:
  reconciliationInterval: 12h  # Reduce interval (default: 5m)

Controller not running

Solution:

kubectl get pods -n optipod-system
kubectl rollout restart deployment optipod-controller -n optipod-system

Metrics haven’t changed significantly

OptiPod may not update recommendations if metrics are stable and within acceptable range.

Recommendations Seem Incorrect

Symptoms:

Recommendations too high or too low
Don’t match actual usage

Diagnosis:

# Check actual usage
kubectl top pod -l app=<name>

# Check recommendation annotations
kubectl get deployment <name> -o yaml | grep -A 10 "optipod.io/"

# Check policy configuration
kubectl get optimizationpolicy <policy-name> -o yaml

Common Causes:

Rolling window too short

Solution:

metricsConfig:
  rollingWindow: 14d  # Increase window

Resource bounds too restrictive

Solution:

resourceBounds:
  cpu:
    min: 10m   # Lower minimum
    max: 8000m # Raise maximum
  memory:
    min: 64Mi
    max: 16Gi

Safety factor too high or too low

Solution:

metricsConfig:
  safetyFactor: 1.2  # Adjust multiplier (default: 1.2)

Wrong percentile

Solution:

metricsConfig:
  percentile: P90  # Try P99 for more headroom, P50 for more aggressive

Unstable workload

Solution: Some workloads aren’t good candidates for optimization. Consider:

Increasing rolling window
Using higher percentile (P99)
Increasing safety factor
Using manual tuning instead

Auto Mode Issues

Changes Not Being Applied

Symptoms:

Policy in Auto mode but resources not updating
No optipod.io/last-applied annotation

Diagnosis:

# Check policy mode
kubectl get optimizationpolicy <policy-name> -o jsonpath='{.spec.mode}'

# Check policy status
kubectl get optimizationpolicy <policy-name> -o yaml | grep -A 20 status

# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep "Applied"

Common Causes:

Reconciliation interval not reached

Solution: Wait for next reconciliation or reduce interval:

spec:
  reconciliationInterval: 5m  # Default is 5m

Update strategy requires pod restart

If using webhook strategy with onNextRestart, changes won’t apply until pods restart naturally.

Solution: Use immediate rollout or trigger restart:

updateStrategy:
  strategy: webhook
  rolloutStrategy: immediate  # Force rolling restart

Or manually restart:

kubectl rollout restart deployment <name>

Resource bounds preventing update

Recommendations may be outside configured bounds.

Solution: Check and adjust bounds:

resourceBounds:
  cpu:
    min: 10m
    max: 8000m
  memory:
    min: 64Mi
    max: 16Gi

Memory Safety Issues

Symptoms:

OptiPod not applying memory decreases
Memory recommendations being blocked

Diagnosis:

# Check controller logs for safety blocks
kubectl logs -n optipod-system -l app=optipod-controller | grep -i "memory.*blocked"

# Check current memory usage
kubectl top pod -l app=<name>

# Check recommendations
kubectl get deployment <name> -o yaml | grep "optipod.io/memory"

Common Causes:

Memory safety check blocking decrease

By default, OptiPod blocks memory decreases that could cause OOM kills.

Solution: If you’re confident the decrease is safe:

updateStrategy:
  allowUnsafeMemoryDecrease: true  # Use with caution!

Gradual decrease not yet implemented

⚠️ Note: The gradualDecreaseConfig feature is not yet implemented. Memory decreases are applied immediately in full, subject to safety checks.

If you need to slow down memory optimization, use more conservative settings:

metricsConfig:
  safetyFactor: 1.5  # Higher safety factor
  percentile: P95    # Higher percentile

OOM Kills After OptiPod Changes

Symptoms:

Pods being OOM killed
Started after OptiPod applied changes

Immediate Action:

# Disable Auto mode
kubectl patch optimizationpolicy <policy-name> \
  --type merge \
  -p '{"spec":{"mode":"Disabled"}}'

# Revert memory
kubectl set resources deployment <name> \
  --requests=memory=<previous-value> \
  --limits=memory=<previous-value>

Investigation:

# Check memory usage
kubectl top pod -l app=<name>

# Check what was applied
kubectl get deployment <name> -o yaml | grep -A 5 "optipod.io/"

# Check for memory spikes in Prometheus
# Query: container_memory_usage_bytes{pod=~"<name>.*"}

Prevention:

metricsConfig:
  percentile: P99        # Use higher percentile
  safetyFactor: 1.5      # More headroom

resourceBounds:
  memory:
    min: 256Mi           # Higher floor

# Note: gradualDecreaseConfig not yet implemented
# Use higher safety factors and percentiles instead

Metrics Issues

Prometheus Connection Failed

Symptoms:

Controller logs show Prometheus errors
No recommendations generated

Diagnosis:

# Check controller logs
kubectl logs -n optipod-system -l app=optipod-controller | grep -i prometheus

# Test connectivity from controller pod
kubectl exec -n optipod-system <controller-pod> -- \
  curl -v http://prometheus-server.monitoring.svc:9090/-/healthy

Solution:

Verify Prometheus is accessible

Prometheus configuration is external to OptiPod (not in the CRD). Configure via:

Helm values
Environment variables
ConfigMap

Check Helm values:

prometheus:
  address: http://prometheus-server.monitoring.svc:9090

Check network policies

kubectl get networkpolicy -n optipod-system
kubectl get networkpolicy -n monitoring

Verify Prometheus is running

kubectl get pods -n monitoring -l app=prometheus

metrics-server Not Available

Symptoms:

kubectl top commands fail
OptiPod can’t get metrics

Diagnosis:

# Check metrics-server
kubectl get apiservice v1beta1.metrics.k8s.io
kubectl get pods -n kube-system -l k8s-app=metrics-server

Solution:

Install metrics-server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Fix metrics-server issues

# Common fix for self-signed certs
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

GitOps Integration Issues

ArgoCD/Flux Detecting Drift

Symptoms:

GitOps tool shows resources as out-of-sync
Constant sync attempts

Solution:

Use webhook strategy with onNextRestart

This keeps workload specs unchanged in Git:

updateStrategy:
  strategy: webhook
  rolloutStrategy: onNextRestart

OptiPod stores recommendations as annotations, webhook applies them at pod creation.

Configure GitOps to ignore OptiPod annotations

For ArgoCD:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /metadata/annotations/optipod.io~1cpu-request.*
    - /metadata/annotations/optipod.io~1memory-request.*
    - /metadata/annotations/optipod.io~1cpu-limit.*
    - /metadata/annotations/optipod.io~1memory-limit.*

Use SSA strategy with ArgoCD SSA support

If using ArgoCD 2.5+:

syncPolicy:
  syncOptions:
  - ServerSideApply=true

Then configure OptiPod:

updateStrategy:
  strategy: ssa
  useServerSideApply: true

Use Recommend mode

Most GitOps-friendly approach:

mode: Recommend  # Manual application via GitOps

Performance Issues

Controller Using Too Much CPU/Memory

Diagnosis:

kubectl top pod -n optipod-system

Solution:

Increase controller resources

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Reduce workload count

Use more specific label selectors
Split into multiple policies with different reconciliation intervals

Increase reconciliation interval

spec:
  reconciliationInterval: 15m  # Less frequent reconciliation

Slow Recommendation Generation

Solution:

Reduce rolling window

metricsConfig:
  rollingWindow: 3d  # Shorter window (less data to process)

Use metrics-server instead of Prometheus

For simpler deployments:

metricsConfig:
  provider: metrics-server

Note: metrics-server provides less historical data but is faster.

Getting Help

If you can’t resolve your issue:

Collect diagnostic information

# Controller logs
kubectl logs -n optipod-system -l app=optipod-controller > controller-logs.txt

# Policy status
kubectl get optimizationpolicy -A -o yaml > policies.yaml

# Workload status
kubectl get deployment <name> -o yaml > workload.yaml

# Events
kubectl get events -A --sort-by='.lastTimestamp' > events.txt

Check GitHub Issues

Search existing issues: https://github.com/Sagart-cactus/optipod/issues
Create new issue with diagnostic info

Community Support

Join discussions: https://github.com/Sagart-cactus/optipod/discussions
Ask questions with full context

Troubleshooting

General Troubleshooting Steps

Installation Issues

Operator Pod Not Starting

Webhook Configuration Issues

Policy Issues

No Recommendations Generated

Recommendations Not Updating

Recommendations Seem Incorrect

Auto Mode Issues

Changes Not Being Applied

Memory Safety Issues

OOM Kills After OptiPod Changes

Metrics Issues

Prometheus Connection Failed

metrics-server Not Available

GitOps Integration Issues

ArgoCD/Flux Detecting Drift

Performance Issues

Controller Using Too Much CPU/Memory

Slow Recommendation Generation

Getting Help

Next Steps