Telemetry Data: The Puzzle Pieces of Observability

October 23, 2024 Ajay Khana KubeCon

The Increasing Importance of Telemetry Data

The growing shift to digital business is causing a surge in telemetry data emitted from infrastructure and custom applications. According to some estimates, the data growth is well over 20% year-over-year. Telemetry data such as logs, metrics, traces and events provide insights into application performance, infrastructure reliability, user experience and security threats. However, as businesses continue to scale and move their operations to the cloud, telemetry data volume and complexity increase disproportionately, posing a critical problem for maintaining effective observability.

The dynamic nature of telemetry data makes it difficult for enterprises to confidently deliver the right data, in the right format, to the right teams at the right time, all while ensuring sensitive PII is handled properly. The lack of understanding of telemetry data and poor control of volumes leads organizations to make uninformed choices about what data to keep and what to discard, given the high cost of maintaining the data. This dynamic reduces trust in collected data, causing poor observability and an increase in resolution times. The impact on the business is subpar to customer experience and compliance risk. To address this, organizations must change how they manage their telemetry data.

The Challenge with Telemetry Data

Telemetry data is inherently unstructured, complex and voluminous. Take logs, for example, teams may be overwhelmed by system logs, application logs, security logs and transaction logs. These logs may be in text, JSON or XML and generated by servers, networks, databases and firewalls. Some may be just information logs, some warnings and others more serious or fatal logs. The logs come in all kinds of shapes and sizes. Understanding and being able to differentiate these from a value perspective is critical to ensure proper use and disposition.

A lack of understanding of logs — and the challenge of managing them — creates friction across teams. Given the high volume of data and pressure to reduce costs, the SRE teams are required to make decisions about keeping or dropping the data without much context. They retain all data, which may be good to keep MTTR down at the time of an incident, but that also puts them at risk of exceeding their budgets. Too much data and accompanying noise may also make it harder for developers to quickly find what they need to debug. On the other hand, if SREs drop certain data without any context, it increases the MTTR and frustrates developers and business owners who are trying to solve a customer problem. In some cases, SREs may urge the development teams to log less into their custom applications and that short-term cost-saving measure can backfire, resulting in more time wasted in debugging and troubleshooting when something goes wrong. Without the availability of relevant logs, developers struggle to diagnose problems efficiently, ultimately leading to increased downtime and resource expenditure.

An elegant solution to the problem is to use an intelligent telemetry pipeline that profiles the ingested data, automatically identifies the log patterns and provides context for what to keep and what to discard.

The Way Forward

Telemetry data is a critical enterprise asset that requires effective management to yield cost-effective insights for both technical troubleshooting and business performance monitoring. A well-defined system grounded in data engineering principles is essential for understanding and optimizing telemetry data. This system must be inherently adaptive and capable of responding to evolving business contexts. Modern telemetry pipelines are designed to collect, understand, transform and route data to streamline operational processes while controlling data volumes and unlocking new avenues for data-driven decision-making across the organization.

A telemetry pipeline provides context, separates signal from noise, automatically identifies offending applications and alerts SREs and developers when data aberrations are detected. It can take automatic actions such as reducing data volume in case of unexpected spikes or enabling full-fidelity data flow to observability platforms in case of incident detection.

To unlock telemetry data’s value, enterprises must understand the data by profiling and analyzing it to identify log patterns, detect aberrations and validate data quality. The optimization techniques available in pipelines can then separate signal from noise, reducing costly noisy data by sampling, deduping and other filtering techniques. Telemetry pipelines can help reduce data by up to 70% through such techniques. Further, it helps format the data as required by observability platforms, transform it to add more context or obfuscate PII and then route it to the desired observability or analysis platforms. Intelligent routing rules in the pipelines can direct critical data to key analytical systems and less critical data to low-cost storage, helping manage costs, while making sure that all required data is available when needed. Pipelines can also adapt to changing conditions, detecting data aberrations and triggering alerts for corrective action, for example, by adjusting data flow between various routes based on incidents.

To enable such capabilities, telemetry pipelines make use of various data engineering principles and AI capabilities that help users understand the data, its contents and quality and trigger alerts and responses when the context changes. However, the skills gap may prevent organizations from applying such data engineering and AI approaches to their pipelines. Therefore, these capabilities and principles must be built into the telemetry pipeline, and we should not require the DevOps, ITOps or developers to train in data engineering. Telemetry pipelines should support developers’ natural workflows, ensuring they have the data required to debug application issues quickly and devote most of their time to creating and innovating, rather than worrying about what and how much to log. On the other hand, it should support SRE’s natural workflows so they can detect performance issues and spikes in data volumes in time and collaborate with developer teams to solve the issues and meet their SLOs.

The Opportunity

The generation of telemetry data is set to surge with more volume and variety, simultaneously, more business teams will ask for access to data, and compliance regulations will require longer retention times. The data volume challenge will be exacerbated by the new AI teams that have an insatiable appetite for data. Forward-thinking organizations will implement robust governance and policies that do not restrict the data by only looking through the cost lens but evaluate the data for its quality and value delivered. They will put the necessary mechanisms in place to help build trust in the data. They will ensure measures are in place to control unnecessary data volumes and find ways to unearth and extract more value from the data. These leaders will leverage technology to make user workflows more efficient and responsive. Access to the right data will continue to be the foundation for successful customer experience, security and AI strategies and will be the competitive advantage in the brave new digital world.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12–15, 2024.