Switching to Auto Mode

This guide explains how to safely transition from Recommend mode to Auto mode, enabling OptiPod to automatically apply resource optimizations.

Overview

OptiPod supports three operational modes:

Recommend: Compute recommendations, don’t apply (safe default)
Auto: Automatically apply recommendations
Disabled: Stop processing workloads

Switching to Auto mode enables automatic optimization but requires careful planning and validation.

Prerequisites

Before switching to Auto mode:

Validate recommendations in Recommend mode
Test in non-production environment
Review policy configuration for safety settings
Set up monitoring for optimization activity
Document rollback procedures
Verify RBAC permissions for updates

Safety Checklist

Complete this checklist before enabling Auto mode:

Gradual Adoption Strategy

Phase 1: Single Non-Critical Workload

Start with one low-risk workload:

apiVersion: optipod.io/v1alpha1
kind: OptimizationPolicy
metadata:
  name: pilot-workload
spec:
  mode: Auto  # Enable Auto mode

  selector:
    workloadSelector:
      matchLabels:
        app: pilot-app  # Single workload

  metricsConfig:
    provider: prometheus
    rollingWindow: 24h
    percentile: P90
    safetyFactor: 1.5  # Conservative

  resourceBounds:
    cpu:
      min: "100m"
      max: "2000m"
    memory:
      min: "256Mi"
      max: "4Gi"

  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart  # Safer
    allowInPlaceResize: true
    allowRecreate: false
    updateRequestsOnly: true
    # Note: gradualDecreaseConfig not yet implemented

Monitor for 1-2 weeks:

Check for performance issues
Monitor resource usage
Review optimization events
Validate no OOMKills or throttling

Phase 2: Non-Critical Workloads

Expand to all non-critical workloads:

apiVersion: optipod.io/v1alpha1
kind: OptimizationPolicy
metadata:
  name: non-critical-workloads
spec:
  mode: Auto

  selector:
    namespaceSelector:
      matchLabels:
        criticality: low
    workloadTypes:
      include:
        - Deployment  # Start with Deployments only

  metricsConfig:
    provider: prometheus
    rollingWindow: 24h
    percentile: P90
    safetyFactor: 1.3

  resourceBounds:
    cpu:
      min: "100m"
      max: "4000m"
    memory:
      min: "256Mi"
      max: "8Gi"

  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart
    allowInPlaceResize: true
    allowRecreate: false
    updateRequestsOnly: true
    # Note: gradualDecreaseConfig not yet implemented

Monitor for 2-4 weeks:

Validate optimization success rate
Check resource savings
Monitor for any issues
Adjust policies based on observations

Phase 3: Production Workloads

Carefully expand to production:

apiVersion: optipod.io/v1alpha1
kind: OptimizationPolicy
metadata:
  name: production-workloads
spec:
  mode: Auto

  selector:
    namespaceSelector:
      matchLabels:
        environment: production
    workloadSelector:
      matchLabels:
        optimize: "true"  # Explicit opt-in

  metricsConfig:
    provider: prometheus
    rollingWindow: 48h  # Longer window for production
    percentile: P90
    safetyFactor: 1.2

  resourceBounds:
    cpu:
      min: "100m"
      max: "4000m"
    memory:
      min: "256Mi"
      max: "8Gi"

  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart
    allowInPlaceResize: true
    allowRecreate: false
    updateRequestsOnly: true
    # Note: gradualDecreaseConfig not yet implemented

Monitor continuously:

Set up alerts for optimization failures
Review weekly optimization reports
Adjust policies based on workload behavior
Document lessons learned

Phase 4: Stateful Workloads (Optional)

Only after extensive validation:

apiVersion: optipod.io/v1alpha1
kind: OptimizationPolicy
metadata:
  name: stateful-workloads
spec:
  mode: Recommend  # Keep in Recommend mode longer
  # Or use Auto with very conservative settings

  selector:
    workloadTypes:
      include:
        - StatefulSet

  metricsConfig:
    provider: prometheus
    rollingWindow: 72h  # Very long window
    percentile: P95     # Higher percentile
    safetyFactor: 1.5   # Large buffer

  resourceBounds:
    cpu:
      min: "500m"
      max: "8000m"
    memory:
      min: "1Gi"
      max: "16Gi"

  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart
    allowInPlaceResize: true
    allowRecreate: false  # Never recreate StatefulSet pods
    updateRequestsOnly: true
    # Note: gradualDecreaseConfig not yet implemented

Switching a Policy to Auto Mode

Method 1: kubectl patch

kubectl patch optimizationpolicy my-policy \
  --type merge \
  --patch '{"spec":{"mode":"Auto"}}'

Method 2: kubectl edit

kubectl edit optimizationpolicy my-policy

Change:

spec:
  mode: Recommend

To:

spec:
  mode: Auto

Method 3: Update YAML and Apply

Edit policy file:

spec:
  mode: Auto  # Changed from Recommend

Apply:

kubectl apply -f policy.yaml

Method 4: GitOps (Recommended)

Update policy in Git repository:

# Edit policy file
vim policies/production-policy.yaml

# Commit and push
git add policies/production-policy.yaml
git commit -m "Enable Auto mode for production-policy"
git push

# ArgoCD/Flux will sync automatically

Verification

Check Policy Mode

kubectl get optimizationpolicy my-policy \
  -o jsonpath='{.spec.mode}'

Expected output: Auto

Monitor First Optimization

Watch for optimization events:

kubectl get events -w --field-selector reason=OptimizationApplied

Check workload annotations:

kubectl get deployment my-app -o yaml | grep "optipod.io/last-applied"

Verify Resource Changes

Compare before and after:

# Before (from recommendations)
kubectl get deployment my-app -o yaml | grep "optipod.io/.*-request"

# After (actual resources)
kubectl get deployment my-app \
  -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq

Monitoring After Switch

Kubernetes Events

Monitor optimization activity:

# All optimization events
kubectl get events -A --field-selector source=optipod

# Successful optimizations
kubectl get events --field-selector reason=OptimizationApplied

# Failed optimizations
kubectl get events --field-selector reason=OptimizationFailed

# Memory safety blocks
kubectl get events --field-selector reason=UnsafeMemoryDecrease

Policy Status

Check policy metrics:

kubectl get optimizationpolicy my-policy -o yaml

Review status:

status:
  workloadsDiscovered: 10
  workloadsProcessed: 8
  lastReconciliation: "2025-01-28T10:30:00Z"

Workload Health

Monitor workload health after optimization:

# Check pod status
kubectl get pods -l app=my-app

# Check for OOMKills
kubectl get events --field-selector reason=OOMKilled

# Check for restarts
kubectl get pods -l app=my-app \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Check resource usage
kubectl top pods -l app=my-app

Prometheus Metrics

Query optimization metrics:

# Optimization success rate
rate(optipod_optimizations_applied_total[5m])

# Optimization failures
rate(optipod_optimization_failure_total[5m])

# Resource change magnitude
optipod_resource_change_magnitude_percent

Common Issues

Issue 1: Optimizations Not Applied

Symptoms:

Policy in Auto mode
Recommendations exist
No changes applied

Possible causes:

Global dry-run enabled
Update strategy misconfigured
RBAC permissions missing
Workload doesn’t support updates

Check:

# Check controller flags
kubectl get deployment optipod-controller -n optipod-system \
  -o jsonpath='{.spec.template.spec.containers[0].args}' | grep dry-run

# Check update strategy
kubectl get optimizationpolicy my-policy \
  -o jsonpath='{.spec.updateStrategy}' | jq

# Check RBAC
kubectl auth can-i update deployments --as=system:serviceaccount:optipod-system:optipod-controller

# Check controller logs
kubectl logs -n optipod-system deployment/optipod-controller

Issue 2: Optimization Failures

Symptoms:

OptimizationFailed events
Error messages in logs

Possible causes:

Recommendations exceed limits
SSA conflicts
Validation errors
RBAC errors

Check:

# View failure events
kubectl get events --field-selector reason=OptimizationFailed

# Check controller logs
kubectl logs -n optipod-system deployment/optipod-controller | grep ERROR

# Describe workload
kubectl describe deployment my-app

Issue 3: Performance Degradation

Symptoms:

Increased latency
CPU throttling
OOMKills

Possible causes:

Safety factor too low
Percentile too low
Metrics window too short
Workload behavior changed

Actions:

Revert to Recommend mode immediately
Increase safety factor
Use higher percentile (P95 or P99)
Increase rolling window
Adjust resource bounds

# Emergency revert to Recommend mode
kubectl patch optimizationpolicy my-policy \
  --type merge \
  --patch '{"spec":{"mode":"Recommend"}}'

Rollback Procedures

Emergency: Stop All Optimizations

Method 1: Switch to Recommend mode

kubectl get optimizationpolicy --all-namespaces -o name | \
  xargs -I {} kubectl patch {} \
  --type merge \
  --patch '{"spec":{"mode":"Recommend"}}'

Method 2: Enable global dry-run

kubectl set env deployment/optipod-controller \
  -n optipod-system \
  DRY_RUN=true

Method 3: Disable policy

kubectl patch optimizationpolicy my-policy \
  --type merge \
  --patch '{"spec":{"mode":"Disabled"}}'

Rollback Specific Workload

If using GitOps:

git revert <commit>
git push

If using SSA:

kubectl patch deployment my-app \
  --type merge \
  --patch '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"cpu":"1000m","memory":"2Gi"}}}]}}}}'

If using webhook:

# Remove OptiPod annotations
kubectl annotate deployment my-app \
  optipod.io/managed- \
  optipod.io/policy- \
  optipod.io/cpu-request.app- \
  optipod.io/memory-request.app-

# Trigger rolling restart to apply original resources
kubectl rollout restart deployment my-app

Best Practices

Start small - Begin with single workload
Test thoroughly - Validate in non-production first
Use conservative settings - Higher safety factors initially
~~Enable gradual decrease~~ - Not yet implemented
Use webhook strategy - Safer for GitOps environments
Set onNextRestart - Avoid forced disruptions
Monitor closely - Watch for issues after switch
Document everything - Record decisions and observations
Have rollback ready - Know how to revert quickly
Communicate changes - Inform team before switching

Optimization Settings

Conservative (Recommended for Initial Auto Mode)

spec:
  mode: Auto
  metricsConfig:
    rollingWindow: 48h
    percentile: P95
    safetyFactor: 1.5
  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart
    allowInPlaceResize: true
    allowRecreate: false
    updateRequestsOnly: true
    # Note: gradualDecreaseConfig not yet implemented

Balanced (After Validation)

spec:
  mode: Auto
  metricsConfig:
    rollingWindow: 24h
    percentile: P90
    safetyFactor: 1.2
  updateStrategy:
    strategy: webhook
    rolloutStrategy: onNextRestart
    allowInPlaceResize: true
    allowRecreate: false
    updateRequestsOnly: true
    gradualDecreaseConfig:
      enabled: true

Aggressive (Development Only)

spec:
  mode: Auto
  metricsConfig:
    rollingWindow: 12h
    percentile: P90
    safetyFactor: 1.1
  updateStrategy:
    strategy: ssa
    allowInPlaceResize: true
    allowRecreate: true
    updateRequestsOnly: false

Success Criteria

Measure success after switching to Auto mode:

Resource Efficiency:

CPU utilization improved
Memory utilization improved
Cost savings achieved

Reliability:

No increase in OOMKills
No increase in CPU throttling
No performance degradation
Stable pod restart counts

Operational:

Optimization success rate > 95%
No manual interventions needed
Team confidence in automation

Next Steps

Reviewing Recommendations - Monitor ongoing recommendations
Troubleshooting - Common issues and solutions
Safety Model - Understanding safety guarantees
Modes - Operational modes explained
GitOps Integration - Using OptiPod with ArgoCD/Flux