Creating Optimization Policies
Creating Optimization Policies
Section titled “Creating Optimization Policies”This guide explains how to create and configure OptimizationPolicy resources to optimize your Kubernetes workloads.
Overview
Section titled “Overview”An OptimizationPolicy defines:
- Which workloads to optimize (selector)
- How to collect metrics (metricsConfig)
- Resource constraints (resourceBounds)
- How to apply changes (updateStrategy)
- When to apply changes (mode)
Quick Start
Section titled “Quick Start”Here’s a minimal policy to get started:
apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: my-first-policy namespace: defaultspec: mode: Recommend # Start in safe mode
selector: namespaceSelector: matchLabels: optimize: "true"
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.2
resourceBounds: cpu: min: "100m" max: "4000m" memory: min: "128Mi" max: "8Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: trueRequired Fields
Section titled “Required Fields”Every policy must specify these fields:
1. Mode
Section titled “1. Mode”Defines the operational behavior:
spec: mode: Recommend # or Auto, DisabledOptions:
Recommend: Compute recommendations, don’t apply (safe default)Auto: Automatically apply recommendationsDisabled: Stop processing workloads
Best practice: Start with Recommend mode to validate recommendations before enabling Auto.
2. Selector
Section titled “2. Selector”Defines which workloads the policy applies to. At least one selector type is required:
spec: selector: # Option 1: Select by namespace labels namespaceSelector: matchLabels: environment: production
# Option 2: Select by workload labels workloadSelector: matchLabels: optimize: "true"
# Option 3: Select by namespace names namespaces: allow: - production - staging deny: - kube-system3. Metrics Configuration
Section titled “3. Metrics Configuration”Defines how metrics are collected:
spec: metricsConfig: provider: prometheus # or metrics-server rollingWindow: 24h # Time window for metrics percentile: P90 # P50, P90, or P99 safetyFactor: 1.2 # Multiplier >= 1.04. Resource Bounds
Section titled “4. Resource Bounds”Defines min/max constraints for recommendations:
spec: resourceBounds: cpu: min: "100m" max: "4000m" memory: min: "128Mi" max: "8Gi"5. Update Strategy
Section titled “5. Update Strategy”Defines how changes are applied:
spec: updateStrategy: strategy: webhook # or ssa rolloutStrategy: onNextRestart # or immediate allowInPlaceResize: true allowRecreate: false updateRequestsOnly: trueSelector Configuration
Section titled “Selector Configuration”Namespace Selector
Section titled “Namespace Selector”Select namespaces by labels:
selector: namespaceSelector: matchLabels: environment: production team: platformWith match expressions:
selector: namespaceSelector: matchExpressions: - key: environment operator: In values: - production - staging - key: team operator: NotIn values: - experimentalWorkload Selector
Section titled “Workload Selector”Select workloads by labels:
selector: workloadSelector: matchLabels: optimize: "true" tier: backendNamespace Allow/Deny Lists
Section titled “Namespace Allow/Deny Lists”Explicitly allow or deny namespaces:
selector: namespaces: allow: - production - staging deny: - kube-system - kube-publicNote: Deny takes precedence over allow.
Workload Type Filtering
Section titled “Workload Type Filtering”Target specific workload types:
selector: # Include only specific types workloadTypes: include: - Deployment - StatefulSetOr exclude specific types:
selector: # Exclude specific types workloadTypes: exclude: - DaemonSetNote: Exclude takes precedence over include.
Combining Selectors
Section titled “Combining Selectors”You can combine multiple selector types:
selector: # Namespaces with label environment=production namespaceSelector: matchLabels: environment: production
# Workloads with label optimize=true workloadSelector: matchLabels: optimize: "true"
# But not in kube-system namespaces: deny: - kube-system
# Only Deployments workloadTypes: include: - DeploymentMetrics Configuration
Section titled “Metrics Configuration”Provider Selection
Section titled “Provider Selection”OptiPod supports two metrics providers:
Prometheus (Recommended):
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.2Metrics-Server:
metricsConfig: provider: metrics-server rollingWindow: 12h percentile: P90 safetyFactor: 1.2 metricsServer: minSamplesRequired: 10Rolling Window
Section titled “Rolling Window”Time period for metrics aggregation:
metricsConfig: rollingWindow: 24h # 24 hours of dataRecommendations:
- Production:
24hor48h - Development:
12h - Stateful workloads:
48hor longer
Percentile
Section titled “Percentile”Which percentile to use for recommendations:
metricsConfig: percentile: P90 # P50, P90, or P99Recommendations:
- Stable workloads:
P90 - Variable traffic:
P95orP99 - Development:
P50orP90
Safety Factor
Section titled “Safety Factor”Multiplier applied to percentile value:
metricsConfig: safetyFactor: 1.2 # 20% bufferRecommendations:
- Stable workloads:
1.1-1.2 - Variable traffic:
1.2-1.5 - Bursty workloads:
1.5-2.0 - Critical services:
1.5-2.0
Calculation:
Recommendation = Percentile(Usage) × SafetyFactorResource Bounds
Section titled “Resource Bounds”CPU Bounds
Section titled “CPU Bounds”resourceBounds: cpu: min: "100m" # Minimum CPU request max: "4000m" # Maximum CPU requestRecommendations:
- Start with wide bounds for assessment
- Tighten based on observed usage
- Consider workload characteristics
Examples:
Conservative (initial assessment):
cpu: min: "50m" max: "8000m"Production (after assessment):
cpu: min: "100m" max: "2000m"High-resource workloads:
cpu: min: "1000m" max: "8000m"Memory Bounds
Section titled “Memory Bounds”resourceBounds: memory: min: "128Mi" # Minimum memory request max: "8Gi" # Maximum memory requestRecommendations:
- Be conservative with memory (OOM kills pods)
- Start with higher minimums
- Monitor for OOMKills
Examples:
Conservative:
memory: min: "256Mi" max: "4Gi"High-memory workloads:
memory: min: "2Gi" max: "16Gi"Enforcement
Section titled “Enforcement”OptiPod clamps recommendations to bounds:
If recommendation < min → Use minIf recommendation > max → Use maxOtherwise → Use recommendationUpdate Strategy
Section titled “Update Strategy”Strategy Selection
Section titled “Strategy Selection”Choose between SSA and Webhook strategies:
Webhook (Default - GitOps Compatible):
updateStrategy: strategy: webhook rolloutStrategy: onNextRestartSSA (Simpler Infrastructure):
updateStrategy: strategy: ssa useServerSideApply: trueSee Update Strategies for detailed comparison.
Rollout Strategy
Section titled “Rollout Strategy”Control when changes take effect (webhook strategy only):
onNextRestart (Safer):
updateStrategy: rolloutStrategy: onNextRestart- Waits for natural pod restart
- No forced disruption
- Gradual rollout
immediate (Faster):
updateStrategy: rolloutStrategy: immediate- Triggers rolling restart
- Faster optimization
- Controlled disruption
In-Place Resize
Section titled “In-Place Resize”Enable in-place pod resize (Kubernetes 1.27+):
updateStrategy: allowInPlaceResize: true # Use if available allowRecreate: false # Block pod recreationUpdate Scope
Section titled “Update Scope”Control what gets updated:
Requests Only (Safer):
updateStrategy: updateRequestsOnly: trueRequests and Limits:
updateStrategy: updateRequestsOnly: false limitConfig: cpuLimitMultiplier: 1.5 # Limit = Request × 1.5 memoryLimitMultiplier: 1.1 # Limit = Request × 1.1Memory Safety
Section titled “Memory Safety”Configure memory safety features:
Gradual Memory Decrease (Not Yet Implemented):
⚠️ Note: This configuration is accepted but not yet implemented. It has no effect on actual behavior.
updateStrategy: gradualDecreaseConfig: enabled: true memoryDecreasePercentage: 10 # Max 10% per cycle minimumDecreaseThreshold: 100Mi # Threshold to trigger maximumTotalDecrease: 70 # Max 70% total decreaseUnsafe Memory Decrease (Use with Caution):
updateStrategy: allowUnsafeMemoryDecrease: true # Disable safety checksOptional Fields
Section titled “Optional Fields”Weight
Section titled “Weight”Priority when multiple policies match the same workload:
spec: weight: 200 # Default: 100, Range: 1-1000Higher weight = higher priority.
Reconciliation Interval
Section titled “Reconciliation Interval”How often the policy is evaluated:
spec: reconciliationInterval: 5m # Default: 5mCommon Patterns
Section titled “Common Patterns”Pattern 1: Production Workloads (GitOps)
Section titled “Pattern 1: Production Workloads (GitOps)”apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: production-gitopsspec: mode: Auto
selector: namespaceSelector: matchLabels: environment: production workloadSelector: matchLabels: optimize: "true"
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.2
resourceBounds: cpu: min: "100m" max: "4000m" memory: min: "256Mi" max: "8Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: truePattern 2: Development Workloads (Aggressive)
Section titled “Pattern 2: Development Workloads (Aggressive)”apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: development-aggressivespec: mode: Auto
selector: namespaceSelector: matchLabels: environment: development
metricsConfig: provider: prometheus rollingWindow: 12h percentile: P90 safetyFactor: 1.1
resourceBounds: cpu: min: "50m" max: "2000m" memory: min: "64Mi" max: "4Gi"
updateStrategy: strategy: ssa allowInPlaceResize: true allowRecreate: true updateRequestsOnly: false limitConfig: cpuLimitMultiplier: 1.5 memoryLimitMultiplier: 1.1Pattern 3: Stateful Workloads (Conservative)
Section titled “Pattern 3: Stateful Workloads (Conservative)”apiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: stateful-conservativespec: mode: Recommend # Only recommend for stateful
selector: workloadTypes: include: - StatefulSet
metricsConfig: provider: prometheus rollingWindow: 48h # Longer window percentile: P95 # Higher percentile safetyFactor: 1.5 # Larger buffer
resourceBounds: cpu: min: "100m" max: "8000m" memory: min: "512Mi" max: "16Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: true # Note: gradualDecreaseConfig not yet implementedPattern 4: Multiple Policies by Workload Type
Section titled “Pattern 4: Multiple Policies by Workload Type”# Policy for DeploymentsapiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: deployments-policyspec: mode: Auto weight: 100
selector: namespaceSelector: matchLabels: environment: production workloadTypes: include: - Deployment
metricsConfig: provider: prometheus rollingWindow: 24h percentile: P90 safetyFactor: 1.1
resourceBounds: cpu: min: "50m" max: "4000m" memory: min: "64Mi" max: "8Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: true updateRequestsOnly: true
---# Policy for StatefulSetsapiVersion: optipod.io/v1alpha1kind: OptimizationPolicymetadata: name: statefulsets-policyspec: mode: Recommend weight: 100
selector: namespaceSelector: matchLabels: environment: production workloadTypes: include: - StatefulSet
metricsConfig: provider: prometheus rollingWindow: 48h percentile: P95 safetyFactor: 1.3
resourceBounds: cpu: min: "100m" max: "8000m" memory: min: "256Mi" max: "16Gi"
updateStrategy: strategy: webhook rolloutStrategy: onNextRestart allowInPlaceResize: true allowRecreate: false updateRequestsOnly: trueValidation
Section titled “Validation”OptiPod validates policies before processing. Common validation errors:
Invalid Mode
Section titled “Invalid Mode”# ❌ Invalidspec: mode: AutoApply
# ✅ Validspec: mode: AutoMissing Selector
Section titled “Missing Selector”# ❌ Invalid - no selectorspec: mode: Auto metricsConfig: ...
# ✅ Valid - at least one selectorspec: mode: Auto selector: namespaceSelector: matchLabels: optimize: "true"Invalid Bounds
Section titled “Invalid Bounds”# ❌ Invalid - min > maxresourceBounds: cpu: min: "2000m" max: "100m"
# ✅ Valid - min <= maxresourceBounds: cpu: min: "100m" max: "2000m"Invalid Safety Factor
Section titled “Invalid Safety Factor”# ❌ Invalid - must be >= 1.0metricsConfig: safetyFactor: 0.9
# ✅ ValidmetricsConfig: safetyFactor: 1.2Troubleshooting
Section titled “Troubleshooting”Policy Not Processing Workloads
Section titled “Policy Not Processing Workloads”Check policy status:
kubectl describe optimizationpolicy my-policyCommon issues:
- Selector doesn’t match any workloads
- Policy validation failed
- Metrics provider not configured
- RBAC permissions missing
Recommendations Not Applied
Section titled “Recommendations Not Applied”Check mode and update strategy:
kubectl get optimizationpolicy my-policy -o yamlVerify:
- Mode is
Auto(notRecommend) - Update strategy is configured
- Global dry-run is not enabled
- Workload matches selector
Validation Errors
Section titled “Validation Errors”View events:
kubectl get events --field-selector reason=ValidationFailedFix validation errors and reapply policy.
Best Practices
Section titled “Best Practices”- Start with Recommend mode to validate recommendations
- Use conservative bounds initially
- Test in non-production before enabling Auto mode
- Use webhook strategy for GitOps environments
Enable gradual memory decreasefor safety (Not yet implemented)- Monitor metrics for optimization success/failure
- Document your policies for team awareness
- Use workload type filtering for gradual adoption
- Set appropriate safety factors based on workload characteristics
- Review recommendations regularly before switching to Auto mode
Next Steps
Section titled “Next Steps”- Reviewing Recommendations - How to review recommendations
- Switching to Auto Mode - Safely enable automatic optimization
- Update Strategies - SSA vs Webhook strategies
- Safety Model - Understanding safety guarantees
- CRD Reference - Complete field documentation