Switching to Auto Mode
This guide explains how to safely transition from Recommend mode to Auto mode, enabling OptiPod to automatically apply resource optimizations.
Overview
Section titled “Overview”OptiPod supports three operational modes:
- Recommend: Compute recommendations, don’t apply (safe default)
- Auto: Automatically apply recommendations
- Disabled: Stop processing workloads
Switching to Auto mode enables automatic optimization but requires careful planning and validation.
Prerequisites
Section titled “Prerequisites”Before switching to Auto mode:
- Validate recommendations in Recommend mode
- Test in non-production environment
- Review policy configuration for safety settings
- Set up monitoring for optimization activity
- Document rollback procedures
- Verify RBAC permissions for updates
Safety Checklist
Section titled “Safety Checklist”Complete this checklist before enabling Auto mode:
- Recommendations reviewed and validated
- Policy tested in staging/development
- Resource bounds are conservative
- Safety factor is appropriate (1.2-1.5)
- Update strategy configured (webhook recommended)
- Rollout strategy set (onNextRestart recommended)
-
Gradual memory decrease enabled(Not yet implemented) - Monitoring and alerts configured
- Rollback procedure documented
- Team notified of changes
Gradual Adoption Strategy
Section titled “Gradual Adoption Strategy”Phase 1: Single Non-Critical Workload
Section titled “Phase 1: Single Non-Critical Workload”Start with one low-risk workload:
apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: pilot-workloadspec: mode: Auto # Enable Auto mode
selector: workloadSelector: matchLabels: app: pilot-app # Single workload
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.5 # Conservative
resourceBounds: cpu: min: "100m" max: "2000m" memory: min: "256Mi" max: "4Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart # Safer allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedMonitor for 1-2 weeks:
- Check for performance issues
- Monitor resource usage
- Review optimization events
- Validate no OOMKills or throttling
Phase 2: Non-Critical Workloads
Section titled “Phase 2: Non-Critical Workloads”Expand to all non-critical workloads:
apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: non-critical-workloadsspec: mode: Auto
selector: namespaceSelector: matchLabels: criticality: low workloadTypes: include: - Deployment # Start with Deployments only
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.3
resourceBounds: cpu: min: "100m" max: "4000m" memory: min: "256Mi" max: "8Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedMonitor for 2-4 weeks:
- Validate optimization success rate
- Check resource savings
- Monitor for any issues
- Adjust policies based on observations
Phase 3: Production Workloads
Section titled “Phase 3: Production Workloads”Carefully expand to production:
apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: production-workloadsspec: mode: Auto
selector: namespaceSelector: matchLabels: environment: production workloadSelector: matchLabels: optimize: "true" # Explicit opt-in
metricsConfig: provider: prometheus rollingWindow: 48h # Longer window for production percentile: P90 safetyFactor: 1.2
resourceBounds: cpu: min: "100m" max: "4000m" memory: min: "256Mi" max: "8Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedMonitor continuously:
- Set up alerts for optimization failures
- Review weekly optimization reports
- Adjust policies based on workload behavior
- Document lessons learned
Phase 4: Stateful Workloads (Optional)
Section titled “Phase 4: Stateful Workloads (Optional)”Only after extensive validation:
apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: stateful-workloadsspec: mode: Recommend # Keep in Recommend mode longer # Or use Auto with very conservative settings
selector: workloadTypes: include: - StatefulSet
metricsConfig: provider: prometheus rollingWindow: 72h # Very long window percentile: P95 # Higher percentile safetyFactor: 1.5 # Large buffer
resourceBounds: cpu: min: "500m" max: "8000m" memory: min: "1Gi" max: "16Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false # Never recreate StatefulSet pods updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedSwitching a Policy to Auto Mode
Section titled “Switching a Policy to Auto Mode”Method 1: kubectl patch
Section titled “Method 1: kubectl patch”kubectl patch optimizationpolicy my-policy \ --type merge \ --patch '{"spec":{"mode":"Auto"}}'Method 2: kubectl edit
Section titled “Method 2: kubectl edit”kubectl edit optimizationpolicy my-policyChange:
spec: mode: RecommendTo:
spec: mode: AutoMethod 3: Update YAML and Apply
Section titled “Method 3: Update YAML and Apply”Edit policy file:
spec: mode: Auto # Changed from RecommendApply:
kubectl apply -f policy.yamlMethod 4: GitOps (Recommended)
Section titled “Method 4: GitOps (Recommended)”Update policy in Git repository:
# Edit policy filevim policies/production-policy.yaml
# Commit and pushgit add policies/production-policy.yamlgit commit -m "Enable Auto mode for production-policy"git push
# ArgoCD/Flux will sync automaticallyVerification
Section titled “Verification”Check Policy Mode
Section titled “Check Policy Mode”kubectl get optimizationpolicy my-policy \ -o jsonpath='{.spec.mode}'Expected output: Auto
Monitor First Optimization
Section titled “Monitor First Optimization”Watch for optimization events:
kubectl get events -w --field-selector reason=OptimizationAppliedCheck workload annotations:
kubectl get deployment my-app -o yaml | grep "optipod.io/last-applied"Verify Resource Changes
Section titled “Verify Resource Changes”Compare before and after:
# Before (from recommendations)kubectl get deployment my-app -o yaml | grep "optipod.io/.*-request"
# After (actual resources)kubectl get deployment my-app \ -o jsonpath='{.spec.template.spec.containers[0].resources}' | jqMonitoring After Switch
Section titled “Monitoring After Switch”Kubernetes Events
Section titled “Kubernetes Events”Monitor optimization activity:
# All optimization eventskubectl get events -A --field-selector source=optipod
# Successful optimizationskubectl get events --field-selector reason=OptimizationApplied
# Failed optimizationskubectl get events --field-selector reason=OptimizationFailed
# Memory safety blockskubectl get events --field-selector reason=UnsafeMemoryDecreasePolicy Status
Section titled “Policy Status”Check policy metrics:
kubectl get optimizationpolicy my-policy -o yamlReview status:
status: workloadsDiscovered: 10 workloadsProcessed: 8 lastReconciliation: "2025-01-28T10:30:00Z"Workload Health
Section titled “Workload Health”Monitor workload health after optimization:
# Check pod statuskubectl get pods -l app=my-app
# Check for OOMKillskubectl get events --field-selector reason=OOMKilled
# Check for restartskubectl get pods -l app=my-app \ -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Check resource usagekubectl top pods -l app=my-appPrometheus Metrics
Section titled “Prometheus Metrics”Query optimization metrics:
# Optimization success raterate(optipod_optimizations_applied_total[5m])
# Optimization failuresrate(optipod_optimization_failure_total[5m])
# Resource change magnitudeoptipod_resource_change_magnitude_percentCommon Issues
Section titled “Common Issues”Issue 1: Optimizations Not Applied
Section titled “Issue 1: Optimizations Not Applied”Symptoms:
- Policy in Auto mode
- Recommendations exist
- No changes applied
Possible causes:
- Global dry-run enabled
- Update strategy misconfigured
- RBAC permissions missing
- Workload doesn’t support updates
Check:
# Check controller flagskubectl get deployment optipod-controller -n optipod-system \ -o jsonpath='{.spec.template.spec.containers[0].args}' | grep dry-run
# Check update strategykubectl get optimizationpolicy my-policy \ -o jsonpath='{.spec.updateStrategy}' | jq
# Check RBACkubectl auth can-i update deployments --as=system:serviceaccount:optipod-system:optipod-controller
# Check controller logskubectl logs -n optipod-system deployment/optipod-controllerIssue 2: Optimization Failures
Section titled “Issue 2: Optimization Failures”Symptoms:
- OptimizationFailed events
- Error messages in logs
Possible causes:
- Recommendations exceed limits
- SSA conflicts
- Validation errors
- RBAC errors
Check:
# View failure eventskubectl get events --field-selector reason=OptimizationFailed
# Check controller logskubectl logs -n optipod-system deployment/optipod-controller | grep ERROR
# Describe workloadkubectl describe deployment my-appIssue 3: Performance Degradation
Section titled “Issue 3: Performance Degradation”Symptoms:
- Increased latency
- CPU throttling
- OOMKills
Possible causes:
- Safety factor too low
- Percentile too low
- Metrics window too short
- Workload behavior changed
Actions:
- Revert to Recommend mode immediately
- Increase safety factor
- Use higher percentile (P95 or P99)
- Increase rolling window
- Adjust resource bounds
# Emergency revert to Recommend modekubectl patch optimizationpolicy my-policy \ --type merge \ --patch '{"spec":{"mode":"Recommend"}}'Rollback Procedures
Section titled “Rollback Procedures”Emergency: Stop All Optimizations
Section titled “Emergency: Stop All Optimizations”Method 1: Switch to Recommend mode
kubectl get optimizationpolicy --all-namespaces -o name | \ xargs -I {} kubectl patch {} \ --type merge \ --patch '{"spec":{"mode":"Recommend"}}'Method 2: Enable global dry-run
kubectl set env deployment/optipod-controller \ -n optipod-system \ DRY_RUN=trueMethod 3: Disable policy
kubectl patch optimizationpolicy my-policy \ --type merge \ --patch '{"spec":{"mode":"Disabled"}}'Rollback Specific Workload
Section titled “Rollback Specific Workload”If using GitOps:
git revert <commit>git pushIf using SSA:
kubectl patch deployment my-app \ --type merge \ --patch '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"requests":{"cpu":"1000m","memory":"2Gi"}}}]}}}}'If using webhook:
# Remove OptiPod annotationskubectl annotate deployment my-app \ optipod.io/managed- \ optipod.io/policy- \ optipod.io/cpu-request.app- \ optipod.io/memory-request.app-
# Trigger rolling restart to apply original resourceskubectl rollout restart deployment my-appBest Practices
Section titled “Best Practices”- Start small - Begin with single workload
- Test thoroughly - Validate in non-production first
- Use conservative settings - Higher safety factors initially
Enable gradual decrease- Not yet implemented- Use webhook strategy - Safer for GitOps environments
- Set onNextRestart - Avoid forced disruptions
- Monitor closely - Watch for issues after switch
- Document everything - Record decisions and observations
- Have rollback ready - Know how to revert quickly
- Communicate changes - Inform team before switching
Optimization Settings
Section titled “Optimization Settings”Conservative (Recommended for Initial Auto Mode)
Section titled “Conservative (Recommended for Initial Auto Mode)”spec: mode: Auto metricsConfig: rollingWindow: 48h percentile: P95 safetyFactor: 1.5 updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedBalanced (After Validation)
Section titled “Balanced (After Validation)”spec: mode: Auto metricsConfig: rollingWindow: 24h percentile: P90 safetyFactor: 1.2 updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true gradualDecreaseConfig: enabled: trueAggressive (Development Only)
Section titled “Aggressive (Development Only)”spec: mode: Auto metricsConfig: rollingWindow: 12h percentile: P90 safetyFactor: 1.1 updateStrategy: strategy: ssa allowInPlaceResize: true allowRecreate: true updateRequestsOnly: falseSuccess Criteria
Section titled “Success Criteria”Measure success after switching to Auto mode:
Resource Efficiency:
- CPU utilization improved
- Memory utilization improved
- Cost savings achieved
Reliability:
- No increase in OOMKills
- No increase in CPU throttling
- No performance degradation
- Stable pod restart counts
Operational:
- Optimization success rate > 95%
- No manual interventions needed
- Team confidence in automation
Next Steps
Section titled “Next Steps”- Reviewing Recommendations - Monitor ongoing recommendations
- Troubleshooting - Common issues and solutions
- Safety Model - Understanding safety guarantees
- Modes - Operational modes explained
- GitOps Integration - Using OptiPod with ArgoCD/Flux