Prometheus Metrics Reference
Prometheus Metrics Reference
Section titled “Prometheus Metrics Reference”Complete reference for all Prometheus metrics exposed by OptiPod.
Overview
Section titled “Overview”OptiPod exposes Prometheus metrics for monitoring controller and webhook operations. Metrics are available on:
- Controller:
:8080/metrics - Webhook:
:8443/metrics
All metrics use the optipod_ prefix.
Metric Types
Section titled “Metric Types”OptiPod uses four Prometheus metric types:
- Counter - Monotonically increasing value (e.g., total requests)
- Gauge - Value that can go up or down (e.g., current workload count)
- Histogram - Distribution of values with buckets (e.g., duration)
- Summary - Similar to histogram with quantiles
Controller Metrics
Section titled “Controller Metrics”Workload Monitoring
Section titled “Workload Monitoring”optipod_workloads_monitored
Section titled “optipod_workloads_monitored”Number of workloads currently monitored by OptiPod.
Type: Gauge Labels:
namespace- Workload namespacepolicy- Policy name managing the workload
Example:
optipod_workloads_monitored{namespace="production",policy="prod-policy"}Use cases:
- Monitor workload discovery
- Track policy coverage
- Alert on unexpected changes
optipod_workloads_updated
Section titled “optipod_workloads_updated”Number of workloads updated in the last reconciliation cycle.
Type: Gauge Labels:
namespace- Workload namespacepolicy- Policy name
Example:
optipod_workloads_updated{namespace="production",policy="prod-policy"}Use cases:
- Track optimization activity
- Monitor Auto mode effectiveness
- Identify high-churn workloads
optipod_workloads_skipped
Section titled “optipod_workloads_skipped”Number of workloads skipped in the last reconciliation cycle.
Type: Gauge Labels:
namespace- Workload namespacepolicy- Policy namereason- Skip reason (e.g., “no_metrics”, “validation_failed”)
Example:
optipod_workloads_skipped{namespace="production",policy="prod-policy",reason="no_metrics"}Use cases:
- Identify workloads without metrics
- Debug policy issues
- Monitor validation failures
Reconciliation
Section titled “Reconciliation”optipod_reconciliation_duration_seconds
Section titled “optipod_reconciliation_duration_seconds”Duration of reconciliation cycles in seconds.
Type: Histogram Labels:
policy- Policy name
Buckets: Default Prometheus buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10)
Example:
# Average reconciliation durationrate(optipod_reconciliation_duration_seconds_sum[5m]) / rate(optipod_reconciliation_duration_seconds_count[5m])
# 95th percentilehistogram_quantile(0.95, rate(optipod_reconciliation_duration_seconds_bucket[5m]))Use cases:
- Monitor controller performance
- Identify slow reconciliations
- Set SLO targets
optipod_reconciliation_errors_total
Section titled “optipod_reconciliation_errors_total”Total number of reconciliation errors.
Type: Counter Labels:
policy- Policy nameerror_type- Error category (e.g., “metrics_fetch”, “validation”, “apply”)
Example:
# Error raterate(optipod_reconciliation_errors_total[5m])
# Errors by typesum by (error_type) (rate(optipod_reconciliation_errors_total[5m]))Use cases:
- Alert on error spikes
- Identify error patterns
- Debug controller issues
Metrics Collection
Section titled “Metrics Collection”optipod_metrics_collection_duration_seconds
Section titled “optipod_metrics_collection_duration_seconds”Duration of metrics collection operations in seconds.
Type: Histogram Labels:
provider- Metrics provider (e.g., “prometheus”, “metrics-server”)
Buckets: Default Prometheus buckets
Example:
# Average collection duration by providerrate(optipod_metrics_collection_duration_seconds_sum[5m]) / rate(optipod_metrics_collection_duration_seconds_count[5m])Use cases:
- Monitor metrics provider performance
- Identify slow queries
- Compare provider efficiency
Recommendations
Section titled “Recommendations”optipod_recommendations_total
Section titled “optipod_recommendations_total”Total number of recommendations generated.
Type: Counter Labels:
policy- Policy name
Example:
# Recommendation raterate(optipod_recommendations_total[5m])
# Total recommendations by policysum by (policy) (optipod_recommendations_total)Use cases:
- Track recommendation generation
- Monitor policy activity
- Measure optimization coverage
Applications
Section titled “Applications”optipod_applications_total
Section titled “optipod_applications_total”Total number of resource updates applied.
Type: Counter Labels:
policy- Policy namemethod- Application method (“ssa” or “webhook”)
Example:
# Application raterate(optipod_applications_total[5m])
# Applications by methodsum by (method) (rate(optipod_applications_total[5m]))Use cases:
- Monitor Auto mode activity
- Compare SSA vs webhook usage
- Track optimization velocity
optipod_ssa_patch_total
Section titled “optipod_ssa_patch_total”Total number of Server-Side Apply patch operations.
Type: Counter Labels:
policy- Policy namenamespace- Workload namespaceworkload- Workload namekind- Workload kind (Deployment, StatefulSet, DaemonSet)status- Patch status (“success” or “failure”)patch_type- Type of patch (“strategic” or “merge”)
Example:
# SSA success ratesum(rate(optipod_ssa_patch_total{status="success"}[5m])) / sum(rate(optipod_ssa_patch_total[5m]))
# Failed patchesrate(optipod_ssa_patch_total{status="failure"}[5m])Use cases:
- Monitor SSA strategy effectiveness
- Debug patch failures
- Track per-workload updates
Optimization
Section titled “Optimization”optipod_optimization_success_total
Section titled “optipod_optimization_success_total”Total number of successful optimizations.
Type: Counter Labels:
policy- Policy namenamespace- Workload namespaceworkload- Workload namekind- Workload kindmethod- Optimization method (“ssa” or “webhook”)
Example:
# Success raterate(optipod_optimization_success_total[5m])
# Success by workload kindsum by (kind) (rate(optipod_optimization_success_total[5m]))Use cases:
- Monitor optimization success
- Track per-workload optimization
- Compare methods
optipod_optimization_failure_total
Section titled “optipod_optimization_failure_total”Total number of failed optimizations.
Type: Counter Labels:
policy- Policy namenamespace- Workload namespaceworkload- Workload namekind- Workload kindmethod- Optimization methodreason- Failure reason
Example:
# Failure raterate(optipod_optimization_failure_total[5m])
# Failures by reasonsum by (reason) (rate(optipod_optimization_failure_total[5m]))Use cases:
- Alert on optimization failures
- Identify failure patterns
- Debug specific workloads
optipod_resource_changes_magnitude
Section titled “optipod_resource_changes_magnitude”Magnitude of resource changes in percentage.
Type: Histogram Labels:
policy- Policy namenamespace- Workload namespaceworkload- Workload nameresource_type- Resource type (“cpu” or “memory”)
Buckets: -90, -75, -50, -25, -10, -5, 0, 5, 10, 25, 50, 75, 100, 200, 500
Example:
# Average change magnituderate(optipod_resource_changes_magnitude_sum[5m]) / rate(optipod_resource_changes_magnitude_count[5m])
# Distribution of changessum by (le) (rate(optipod_resource_changes_magnitude_bucket[5m]))Use cases:
- Monitor optimization impact
- Identify aggressive changes
- Track savings
optipod_default_multiplier_usage_total
Section titled “optipod_default_multiplier_usage_total”Total number of times default multipliers were used.
Type: Counter Labels:
policy- Policy nameresource_type- Resource type (“cpu” or “memory”)multiplier_value- Multiplier value used
Example:
# Default multiplier usage raterate(optipod_default_multiplier_usage_total[5m])Use cases:
- Track default configuration usage
- Identify policies needing tuning
optipod_optimization_decision_duration_seconds
Section titled “optipod_optimization_decision_duration_seconds”Duration of optimization decision making in seconds.
Type: Histogram Labels:
policy- Policy nameworkload_kind- Workload kind
Buckets: Default Prometheus buckets
Example:
# Average decision durationrate(optipod_optimization_decision_duration_seconds_sum[5m]) / rate(optipod_optimization_decision_duration_seconds_count[5m])Use cases:
- Monitor decision performance
- Identify slow policies
Webhook Metrics
Section titled “Webhook Metrics”Admission Requests
Section titled “Admission Requests”optipod_webhook_admission_requests_total
Section titled “optipod_webhook_admission_requests_total”Total number of webhook admission requests received.
Type: Counter Labels:
namespace- Pod namespacepod_name- Pod namedry_run- Dry run flag (“true” or “false”)
Example:
# Request raterate(optipod_webhook_admission_requests_total[5m])
# Dry run vs real requestssum by (dry_run) (rate(optipod_webhook_admission_requests_total[5m]))Use cases:
- Monitor webhook traffic
- Track dry run usage
- Capacity planning
optipod_webhook_admission_success_total
Section titled “optipod_webhook_admission_success_total”Total number of successful webhook admissions.
Type: Counter Labels:
namespace- Pod namespacepolicy- Policy namepatches_applied- Whether patches were applied (“0” or “1+”)
Example:
# Success ratesum(rate(optipod_webhook_admission_success_total[5m])) / sum(rate(optipod_webhook_admission_requests_total[5m]))Use cases:
- Monitor webhook success rate
- Track patch application
optipod_webhook_admission_failures_total
Section titled “optipod_webhook_admission_failures_total”Total number of failed webhook admissions.
Type: Counter Labels:
namespace- Pod namespacefailure_reason- Failure reason
Example:
# Failure raterate(optipod_webhook_admission_failures_total[5m])
# Failures by reasonsum by (failure_reason) (rate(optipod_webhook_admission_failures_total[5m]))Use cases:
- Alert on webhook failures
- Debug admission issues
- Identify failure patterns
Mutation Operations
Section titled “Mutation Operations”optipod_webhook_mutation_duration_seconds
Section titled “optipod_webhook_mutation_duration_seconds”Duration of webhook mutation operations in seconds.
Type: Histogram Labels:
namespace- Pod namespacepolicy- Policy name
Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0
Example:
# Average mutation durationrate(optipod_webhook_mutation_duration_seconds_sum[5m]) / rate(optipod_webhook_mutation_duration_seconds_count[5m])
# 99th percentilehistogram_quantile(0.99, rate(optipod_webhook_mutation_duration_seconds_bucket[5m]))Use cases:
- Monitor webhook latency
- Set SLO targets
- Identify slow mutations
optipod_webhook_patches_applied_total
Section titled “optipod_webhook_patches_applied_total”Total number of patches applied by the webhook.
Type: Counter Labels:
namespace- Pod namespacepolicy- Policy namecontainer_name- Container nameresource_type- Resource type (“cpu” or “memory”)
Example:
# Patch raterate(optipod_webhook_patches_applied_total[5m])
# Patches by resource typesum by (resource_type) (rate(optipod_webhook_patches_applied_total[5m]))Use cases:
- Track webhook activity
- Monitor per-container patches
- Analyze resource type distribution
Errors and Health
Section titled “Errors and Health”optipod_webhook_annotation_parsing_errors_total
Section titled “optipod_webhook_annotation_parsing_errors_total”Total number of annotation parsing errors in webhook.
Type: Counter Labels:
namespace- Pod namespacepod_name- Pod nameannotation_key- Annotation key that failed to parseerror_type- Error type
Example:
# Parsing error raterate(optipod_webhook_annotation_parsing_errors_total[5m])
# Errors by annotationsum by (annotation_key) (rate(optipod_webhook_annotation_parsing_errors_total[5m]))Use cases:
- Debug annotation format issues
- Identify problematic annotations
- Alert on parsing errors
optipod_webhook_server_health_status
Section titled “optipod_webhook_server_health_status”Webhook server health status.
Type: Gauge Labels:
endpoint- Health endpoint (“/healthz” or “/readyz”)
Values:
1- Healthy0- Unhealthy
Example:
# Check health statusoptipod_webhook_server_health_status{endpoint="/healthz"}Use cases:
- Monitor webhook health
- Alert on unhealthy status
- Track availability
optipod_webhook_certificate_expiry_timestamp_seconds
Section titled “optipod_webhook_certificate_expiry_timestamp_seconds”Webhook certificate expiry time as Unix timestamp.
Type: Gauge Labels:
cert_type- Certificate type (“server” or “ca”)
Example:
# Days until expiry(optipod_webhook_certificate_expiry_timestamp_seconds - time()) / 86400
# Alert if expiring soon(optipod_webhook_certificate_expiry_timestamp_seconds - time()) < 604800 # 7 daysUse cases:
- Monitor certificate expiry
- Alert before expiration
- Track certificate rotation
Policy Matching
Section titled “Policy Matching”optipod_webhook_policy_matching_duration_seconds
Section titled “optipod_webhook_policy_matching_duration_seconds”Duration of policy matching operations in webhook.
Type: Histogram Labels:
namespace- Pod namespacepolicies_found- Number of policies found
Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5
Example:
# Average matching durationrate(optipod_webhook_policy_matching_duration_seconds_sum[5m]) / rate(optipod_webhook_policy_matching_duration_seconds_count[5m])Use cases:
- Monitor policy matching performance
- Optimize policy selectors
Common PromQL Queries
Section titled “Common PromQL Queries”Controller Health
Section titled “Controller Health”Reconciliation success rate:
1 - ( rate(optipod_reconciliation_errors_total[5m]) / rate(optipod_reconciliation_duration_seconds_count[5m]))Workloads per policy:
sum by (policy) (optipod_workloads_monitored)Recommendation generation rate:
rate(optipod_recommendations_total[5m])Optimization Impact
Section titled “Optimization Impact”Total resource savings (CPU):
sum(optipod_resource_changes_magnitude{resource_type="cpu"} < 0)Average optimization magnitude:
avg(optipod_resource_changes_magnitude)Optimization success rate:
sum(rate(optipod_optimization_success_total[5m])) /(sum(rate(optipod_optimization_success_total[5m])) + sum(rate(optipod_optimization_failure_total[5m])))Webhook Performance
Section titled “Webhook Performance”Webhook latency (p95):
histogram_quantile(0.95, rate(optipod_webhook_mutation_duration_seconds_bucket[5m]))Webhook success rate:
sum(rate(optipod_webhook_admission_success_total[5m])) /sum(rate(optipod_webhook_admission_requests_total[5m]))Patches per second:
sum(rate(optipod_webhook_patches_applied_total[5m]))Error Monitoring
Section titled “Error Monitoring”Error rate by type:
sum by (error_type) (rate(optipod_reconciliation_errors_total[5m]))Webhook failure rate:
sum by (failure_reason) (rate(optipod_webhook_admission_failures_total[5m]))Annotation parsing errors:
sum by (annotation_key) (rate(optipod_webhook_annotation_parsing_errors_total[5m]))Grafana Dashboard
Section titled “Grafana Dashboard”Recommended Panels
Section titled “Recommended Panels”Overview Dashboard:
- Workloads monitored (gauge)
- Recommendation rate (graph)
- Optimization success rate (gauge)
- Reconciliation duration p95 (graph)
- Error rate (graph)
Controller Dashboard:
- Reconciliation duration histogram
- Metrics collection duration
- Workloads by policy (table)
- Errors by type (graph)
- Resource change magnitude distribution
Webhook Dashboard:
- Admission request rate (graph)
- Mutation latency p50/p95/p99 (graph)
- Success rate (gauge)
- Patches applied by resource type (graph)
- Certificate expiry countdown (gauge)
Example Panel Queries
Section titled “Example Panel Queries”Workloads Monitored (Gauge):
sum(optipod_workloads_monitored)Reconciliation Duration (Graph):
histogram_quantile(0.50, rate(optipod_reconciliation_duration_seconds_bucket[5m]))histogram_quantile(0.95, rate(optipod_reconciliation_duration_seconds_bucket[5m]))histogram_quantile(0.99, rate(optipod_reconciliation_duration_seconds_bucket[5m]))Optimization Success Rate (Gauge):
sum(rate(optipod_optimization_success_total[5m])) /(sum(rate(optipod_optimization_success_total[5m])) + sum(rate(optipod_optimization_failure_total[5m])))Alerting Rules
Section titled “Alerting Rules”Critical Alerts
Section titled “Critical Alerts”Controller Down:
- alert: OptiPodControllerDown expr: up{job="optipod-controller"} == 0 for: 5m labels: severity: critical annotations: summary: "OptiPod controller is down" description: "Controller has been down for more than 5 minutes"High Error Rate:
- alert: OptiPodHighErrorRate expr: rate(optipod_reconciliation_errors_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "High reconciliation error rate" description: "Error rate is {{ $value }} errors/sec"Webhook Failures:
- alert: OptiPodWebhookFailures expr: rate(optipod_webhook_admission_failures_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Webhook admission failures detected" description: "Failure rate is {{ $value }} failures/sec"Warning Alerts
Section titled “Warning Alerts”Certificate Expiring Soon:
- alert: OptiPodCertificateExpiringSoon expr: (optipod_webhook_certificate_expiry_timestamp_seconds - time()) < 604800 for: 1h labels: severity: warning annotations: summary: "Webhook certificate expiring soon" description: "Certificate expires in {{ $value | humanizeDuration }}"High Reconciliation Duration:
- alert: OptiPodSlowReconciliation expr: histogram_quantile(0.95, rate(optipod_reconciliation_duration_seconds_bucket[5m])) > 30 for: 15m labels: severity: warning annotations: summary: "Slow reconciliation detected" description: "P95 reconciliation duration is {{ $value }}s"No Recommendations Generated:
- alert: OptiPodNoRecommendations expr: rate(optipod_recommendations_total[30m]) == 0 for: 1h labels: severity: warning annotations: summary: "No recommendations generated" description: "No recommendations in the last hour"ServiceMonitor Configuration
Section titled “ServiceMonitor Configuration”OptiPod includes a ServiceMonitor for Prometheus Operator:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: optipod-controller namespace: optipod-systemspec: selector: matchLabels: app.kubernetes.io/name: optipod app.kubernetes.io/component: controller endpoints: - port: metrics path: /metrics interval: 30s scrapeTimeout: 10sMetrics Endpoint Security
Section titled “Metrics Endpoint Security”Authentication
Section titled “Authentication”Metrics endpoints support authentication via:
- Token-based auth - Bearer token in Authorization header
- mTLS - Mutual TLS authentication
- RBAC - Kubernetes RBAC for metrics reader role
RBAC Configuration
Section titled “RBAC Configuration”Grant metrics access:
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: optipod-metrics-readerrules:- nonResourceURLs: - /metrics verbs: - getTroubleshooting
Section titled “Troubleshooting”Metrics Not Appearing
Section titled “Metrics Not Appearing”Check metrics endpoint:
kubectl port-forward -n optipod-system deployment/optipod-controller 8080:8080curl http://localhost:8080/metricsVerify ServiceMonitor:
kubectl get servicemonitor -n optipod-systemkubectl describe servicemonitor optipod-controller -n optipod-systemCheck Prometheus targets:
# In Prometheus UIStatus → Targets → optipod-system/optipod-controllerMissing Labels
Section titled “Missing Labels”If labels are missing:
- Check metric registration in code
- Verify label values are set correctly
- Check for label cardinality limits
High Cardinality
Section titled “High Cardinality”If metrics have too many unique label combinations:
- Review label usage (especially
workloadandnamespace) - Consider aggregating at query time
- Use recording rules for common queries
Best Practices
Section titled “Best Practices”- Use recording rules for frequently queried metrics
- Set appropriate scrape intervals (30s recommended)
- Monitor metric cardinality to avoid performance issues
- Use label matchers in queries to reduce data volume
- Create dashboards for common monitoring scenarios
- Set up alerts for critical conditions
- Document custom queries for team reference
- Review metrics regularly to identify unused metrics
Related Documentation
Section titled “Related Documentation”- Observability Guide - Complete observability setup
- Operations Guide - Operational procedures
- Troubleshooting - Common issues and solutions
- Architecture - System architecture and components