Cost-Aware Observability on K8s: Balancing Scrape Intervals, Retention and Cardinality
Cost-aware observability in Kubernetes is essential for ensuring effective monitoring without incurring high costs or overloading the system with data. This involves carefully balancing three key factors — scraping intervals, metric retention and cardinality management. This blog covers real-world examples and code snippets from these areas and explains how to optimize your Kubernetes observability stack for performance and cost-effectiveness.
Balancing Scrape Intervals for Cost Efficiency
Scrape intervals determine how often a monitoring system (such as Prometheus) collects metrics from Kubernetes components or applications. A shorter interval provides greater accuracy but increases data volume, processing costs and storage requirements.
Real-World Example
A production cluster with hundreds of pods collecting metrics every 10 seconds quickly consumes storage and slows down query performance, resulting in expensive storage and sluggish query performance. Adjusting scraping intervals based on the importance of the measurement reduces costs while maintaining the visibility you need.
Code Snippet (Prometheus scrape_configs Example)
scrape_configs:
– job_name: ‘kube-apiserver’
scrape_interval: 10s # Important, rapid metrics
static_configs:
– targets: [‘apiserver.k8s.local:6443’]
– job_name: ‘kube-node’
scrape_interval: 30s # Stable node-level metrics
static_configs:
– targets: [‘node-exporter.k8s.local:9100’]
– job_name: ‘application-metrics’
scrape_interval: 60s # Less critical application metrics
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
In this setup, kube-apiserver metrics are scraped frequently at 10 seconds for timely insights, while stable node metrics are collected every 30 seconds and application metrics are collected less often, at 60 seconds. This saves considerable storage and computing costs while capturing meaningful data where needed.
Smart Retention Policies to Manage Storage Costs
Raw, full-fidelity metrics consume significant storage, especially in large clusters. Implementing tiered retention policies based on metric purpose and analysis needs ensures cost-effective long-term observability.
Industry Practice
- High-Value Metrics (e.g., Cluster Health, Service Latency): Retain full resolution for 30 days.
- Medium-Value Metrics (e.g., Pod Resource Usage): Retain full resolution for seven days, then downsample for up to 90 days.
- Debug or Ephemeral Metrics: Retain only for 24 hours.
Example Prometheus Retention Flags
–storage.tsdb.retention.time=30d # Full retention for 30 days
–storage.tsdb.retention.size=50GB # Optional size-based retention cap
For long-term storage with downsampling, the integration of Prometheus with tools such as Thanos or Cortex allows for migrating old metrics to cheaper storage with reduced resolution.
Controlling Metric Cardinality — The Biggest Cost Driver
Cardinality refers to the number of unique time series generated by all metric labels combined. High cardinality leads to exponential data growth, higher ingestion rates and greater hardware or cloud costs.
Real-World Scenario
An application that labels metrics by pod_id, container_id, user_id and request_id generates millions of unique series per day, inflating costs and reducing query performance.
Strategies and Code-Snippet Example to Reduce Cardinality in Prometheus
- Drop unnecessary labels early with relabel_configs:
metric_relabel_configs:
– source_labels: [user_id, request_id]
action: drop # Remove high-cardinality user/request labels from metrics
- Aggregate metrics to a higher level (e.g., per namespace or deployment):
relabel_configs:
– source_labels: [__meta_kubernetes_namespace]
target_label: ‘namespace’
– source_labels: [__meta_kubernetes_deployment]
target_label: ‘deployment’
– action: labeldrop
regex: ‘pod_id|container_id’ # Drop pod-specific labels for aggregation
Reducing cardinality helps manage storage costs and improves query speed without sacrificing critical monitoring insights.
Combining Observability Layers and Cost Optimization
In a robust Kubernetes observability strategy:
- Metrics provide health and performance signals.
- Logs capture detailed context for troubleshooting.
- Traces map distributed workflows.
Cost-aware observability balances how long, how frequently and at what level of detail each layer retains data. For example, metrics can be scraped less regularly, while logs might temporarily retain more granular data during incident investigations.
To achieve the best observability results, you can use full-stack platforms like Middleware, which help reduce cloud observability costs while providing the best insights for your applications.
Real-World Example
A platform engineering company optimizes Kubernetes observability by:
- Setting different scrape intervals as per workload criticality.
- Implementing retention policies with Thanos to keep detailed metrics for 30 days and aggregated metrics for 90 days.
- Dropping high-cardinality labels dynamically through relabeling rules to control ingestion rates.
- Correlating logs and traces selectively for anomaly detection while minimizing bulk data storage.
This approach reduces monitoring costs by over 40% while maintaining actionable visibility for teams.
Key Takeaways
- Adjust scrape intervals as per metric criticality to capture essential data efficiently.
- Implement tiered retention policies and use downsampling where appropriate.
- Control the number of essential items by prematurely discarding or grouping tags with highly crucial items.
- Use multi-layer observability to balance cost and insights across metrics, logs and traces.
- Leverage tools such as Prometheus, Thanos and Cortex along with dynamic relabeling for automation and cost control.
With these practices, Kubernetes teams can responsibly scale observability, ensuring that monitoring costs are predictable and visibility remains usable and reliable.
This detailed, cost-aware observability guide leverages industry best practices and proven real-world approaches to help Kubernetes operators optimize monitoring spend without compromising reliability or performance.


