How to Avoid Drowning In Cloud-Native Observability Data

May 4, 2022May 3, 2022 Bill Doerrfeld alert logs, cloud-native data retention, cloud-native storage, data storage, distributed traces, metrics, observability

by Bill Doerrfeld

Cloud-native monitoring is becoming more essential as companies seek to improve how their infrastructure operates. This data can be leveraged for root cause analysis, improving response times to incidents and retaining available highly-performant user experiences.

But traditional application performance monitoring (APM) doesn’t always cut it in this new cloud-native stack. There are fundamental differences between the two in terms of scale and the volume of data. Furthermore, when everything is run in containers, you must design and optimize monitoring around the ephemerality of data.

Having a window into cloud-native performances can better equip SREs and platform engineers with real-time insights, helping to quickly respond when issues come up. Thus, in recent years we’ve seen a swell of interest in cloud-native full-stack observability that involves metrics, logs, and tracing to expose the root cause of incidents.

A key goal of observability is to decrease the mean-time-to-recovery (MTTR). However, surprisingly, this metric is actually increasing among many companies, says Martin Mao, co-founder and CEO, Chronosphere. He suggests that engineers may be experiencing data fatigue since identifying alerts is tricky when dealing with a barrage of notifications. I recently met with Mao to gather perspectives on managing cloud-native observability data. Below, we’ll cover some tips that could help you keep your head above water in this sea of observability data.

Trends Around Observability

First, it’s clear that many open source tools are emerging to support the cloud-native observability mission, which is agnostic of cloud provider or computing environment. A recent CNCF micro-study found engineers are actively using projects like OpenTelemetry, Fluentd, Jaeger, OpenTracing, Cortex and Open Metrics.

Mao, who previously led the observability team at Uber, has an intimate perspective on the needs of today’s platform operations. At Uber, developers realized that APM was insufficient and sought to craft their own tools, which gave birth to projects like M3, the open source metrics platform, and Jaeger, the open source distributed tracing system.

But all the investment into flashy cloud-native technology does have a downside. According to Mao, there is mounting concern about the sheer amount of data these tools produce. And the growth in observability data is far outpacing the business and infrastructure growth, meaning it’s difficult to keep up. Not only is it difficult to parse through, but an overabundance of observability data could create new data lakes, bringing new data storage and integration concerns.

“Since more data is being produced, there are more and more alerts to sift through which starts to encumber your ability to find out how to resolve these issues,” said Mao.

Solution: Optimize Retention and Resolution

According to Mao, organizations can navigate these issues by setting limits around data retention and resolution. Let’s dig into what these concepts mean.

Data Retention

With many tools producing escalating data dimensions, your observability data can quickly accumulate. The first method to stem the tide is to place limits on when data is collected and the length of time it’s stored.

For example, is it necessary to hold all the data collected during a single deployment process indefinitely? With today’s iterative development cycles, it’s probably not prudent to store these points forever. This could mean slimming your default storage timeframe from 12 months down to size.

Also, failing to set limits on when data is collected contributes to surging observability data. For example, recording a debug endpoint in real-time only makes sense when actively debugging, says Mao. Otherwise, you’re needlessly collecting data.

Data Resolution

Data resolution refers to the granularity of recorded time-series data. As Mao points out, recording data every second versus every hour is essentially a 3600X difference. Thus, optimizing the resolution of data collection is extremely important to reduce costly storage.

Adjusting data resolution for observability data will largely depend on the use case at hand. To return to our CI/CD example, if you’re collecting deployment figures when rolling something back, you want high resolution for every second since it’s a pivotal moment, describes Mao. On the other hand, if you’re conducting capacity planning for an entire year, you probably don’t need to retain historical capacity information by the second, as that would be far too granular.

Other Tips

Optimizing data retention and resolution can limit the amount of recorded data. This helps keep a smaller footprint and produces fewer data points to sift through. Adjusting resolution is often a far better tradeoff than other monitoring approaches, such as only recording 10% of the entire production fleet, which could leave many user pain points in the dark.

Having a way to dynamically opt in and opt out of the data collection process can alleviate some of this upfront work. This can be thought of as automatically applying more intelligent defaults. Once you know what works, you could set common patterns around observability data collection and storage processes, which could be shared throughout an organization.

But, this will only go so far. To better act upon data, teams will require tools to modify and visualize the data they’re collecting, says Mao. Plus, since operators likely won’t need every data point dimension when debugging, they might benefit from mechanics that pre-compute the answers they need.

Final Thoughts

The observability trend can bring great benefits to help digital platforms optimize their operations. Observability can help reduce the time to respond to issues and improve end user experiences. “Observability plays a key part in that—it provides you visibility into whether these practices are useful at all.”

However, the acceleration toward cloud-native architecture creates a storm of new alerts and signals. And if unaddressed, this data can quickly pile up, requiring greater visibility into the data itself. “The whole value observably brings needs to be put in light,” said Mao.

The amount of data the world is producing is extreme. Data takes up space. It accumulates and is costly to store at scale. Yet, people still think data is free and usually don’t plan for the data life cycle. “The mindset of those in charge of an observability backend shouldn’t be to create a data lake,” says Mao. “At a certain point, you have to do something about it.”

To counteract this trend, operators can’t treat every piece of data in the same way. In conclusion, to avoid drowning in lakes of observability data, limit unnecessary data collection and set smarter optimizations around when data is collected, its granularity, how it’s visualized and for how long it’s stored.