Architecting Observability in a Cloud-Native World With eBPF
Bill Mulligan, a Cilium committer working at Isovalent to grow both the Cilium and eBPF communities, co-authored this article.
Cloud-native applications and platforms, built using microservices and containers and leveraging platforms like Kubernetes, have an exponential increase in observability data volume because of their distributed nature. Excessive data can overwhelm engineers, decrease the value of collected data and create ongoing storage costs. Traditional observability tools have many visible and hidden costs in trying to keep up with this data deluge, including difficulties troubleshooting, missing information and wasting resources. By relying on code instrumentation, they affect response times and resource profiles, significantly impacting performance and increasing the total cost of ownership.
Extended Berkeley Packet Filter (eBPF) is changing how platform observability can be architected because it operates at the Linux kernel level, enabling comprehensive data collection with minimal performance impact. eBPF’s flexible implementation, low overhead and efficient resource usage make it an attractive option for observability.
Many new open-source observability projects like Caretta, Hubble, Pixie and Tetragon are leveraging eBPF. These projects use eBPF’s capabilities to monitor network traffic, gain insights into system behavior, provide telemetry data and enable better troubleshooting in cloud-native environments.
The lack of observability is incredibly costly in a cloud-native environment. The cost comes in the form of a drop in operational efficiency, an increase in unplanned outages and a reduction of application developer efficiency.
Observability in a Cloud(y) World
Almost every engineering team can relate to the following scenario: It begins with a bug surfacing in production, which proves to be a challenging puzzle to investigate and resolve. Unfortunately, this bug slipped through the cracks during the CI stage and managed to evade detection from monitoring mechanisms. The existing logs proved insufficient to shed light on the issue and none of the metrics provided any indication of an emerging problem. Hours were spent troubleshooting this incident until it became clear that an important step was required to prevent such a painful process in the future—bridging the gap by incorporating additional logs and custom metrics that could capture relevant indicators of this exact issue. The goal was set: To ensure we never encounter such an ordeal again, at least not for the same reason. It’s like the old saying goes: “Fool me once, shame on you; fool me twice, shame on me.”
However, just a week later, a team member noticed sharp growth in the associated observability data. Deep down, they knew what went wrong. That log line that was added generated a much larger volume of data than anticipated, or that new metric exhibited significantly higher cardinality than originally planned.
Instead of painting a picture of how the system is (and isn’t) working, users are instead left with another 1,000 data points to sift through—regardless of how valuable the data will eventually be. In the cloud-native world, the number of interactions to observe between services is increasing while simultaneously, each additional data point takes a toll on applications, infrastructure and people. However, trying to forgo observability in a cloud-native environment also comes at a high cost, with a drop in operational efficiency, an increase in unplanned outages and a reduction of application developer efficiency. Being caught in the middle is not where any organization wants to be. To keep a clear vision of what is happening, our observability solutions need to be architected for the new cloud-native. By examining the visible and invisible costs of this data explosion, we can start to understand how to architect an observability solution that is truly cloud native and avoid the price of no observability.
Cloud-native platforms are already undergoing a silent revolution where eBPF is changing projects “under the hood” by enabling enrichment with cloud-native context. By dealing with the data deluge, eBPF is also making observability ready for distributed computing, leaving a minimal footprint on application and infrastructure performance, lowering the levy of integration and ownership and providing more accurate information from within the kernel. For cloud-native observability, eBPF may be part of a silent revolution, but it is finally helping us see.
Cloud Native Makes Observability Data Size Cloudy
Many observability solutions that exist today were built for a different world. Engineering teams were mostly working on single-tiered monolith applications that combined user-facing interfaces and database access into a single platform.
Today, cloud-native computing brings with it great promise for faster software development life cycles by decoupling different pieces of the application into microservices. Instead of one system to watch, each user-facing traffic event triggers numerous API calls between various microservices. Gaining visibility into these smaller and more distributed, interdependent pieces becomes a much harder task at scale. There are many moving parts, hundreds of interactions and thousands of places where the root cause of the next problem could originate.
Because of these differences in architecture, cloud-native environments broadcast massive amounts of data—somewhere between 10 and 100 times more than traditional monolithic environments. Tracking all of this communication is where observability data volumes rose rapidly, and companies were left to pick up the tab while still searching for a needle in an even larger haystack when things go wrong. Companies are collecting more data, but they are not getting more value. The value they get is often decreasing because of this information overload. On the other hand, with all of these interactions, not having observability is even worse, like trying to find a needle in a hay field, with potentially devastating consequences for the business when something goes wrong.
As organizations find themselves dealing with uncontrollable data volumes accumulated across and throughout the entire organization, without a clear decision maker addressing the fundamental question of “Is this data truly necessary?” doubts start to emerge regarding the practicality and mission-critical nature of their observability data.
Predicting the Unpredictable: The (In)Visible Cost of Observability in Cloud Native
From the platform team’s point of view, it makes sense to always opt for more observability data. They know the pain and business impacts of not having observability and the associated toll of observability data is someone else’s problem. Yet, studies show that less than 1% of observability data is ever explored by users. Growing data volumes eventually hurt the end user, too. A developer logging into their observability tool is often overwhelmed by vast amounts of information, and reaching the data needed, the 1%, is nearly impossible. This architecture comes with many visible and hidden costs.
The size of observability data isn’t the only problem; it’s also that it is extremely unpredictable. Predicting how many logs and metrics will be generated by dynamic applications from one day to the next is difficult because data volume is tied to inherently uncertain application demand. This can lead to wild fluctuations in observability spending without any added benefit to the application teams.
And the costs don’t end there. One of the hardest challenges for any modern observability tool is keeping up with the ever-growing scale of data—delivering a comprehensive, accurate picture of a system, all while having minimal resource overhead. Most of the observability solutions available today rely on code instrumentation to collect data about the application’s performance. This method can impact the application’s response time and CPU memory profiles in a way that is hard to estimate or measure. With these resource impacts taking place across a complex microservices-based environment, the total cost of ownership of the observability stack may rise dramatically.
Without the right performance-first mindset, the once hidden overhead required to observe a system can suddenly become very visible and painful at large scale. When the observability solution dramatically impacts the resource consumption of the applications they are in charge of monitoring, eventually limiting their performance, engineers become painfully aware of the hidden overhead inflicted by their observability stack.
The visible cost of these uncontrollable data volumes is becoming crystal clear: Over 70% of teams do not have a full APM tier in place (DevOps Pulse 2022). They turn to lower tiers like logs and custom metrics, avoiding the higher, more pricey and high cardinality tiers.
What is the alternative hidden cost, you ask? Human resources: Expensive engineers and infrastructure teams maintaining complex stacks. Not having application monitoring in your cloud-native architecture means longer troubleshooting times, a harder path to solving complex issues and sometimes not even knowing a logical bug is taking place. Growing data volumes also inflict pain on the end-user. Data queries became a real bottleneck, manifesting in slow-loading dashboards and over-resourced logging management solutions. Finally, ever-expanding observability data also has a profound impact on how teams respond to burning issues. In this vast sea of complex data, junior team members often find themselves lost, escalating problems to power users who hold hard-earned knowledge of where to seek answers.
All of these (in)visible costs also add up if you don’t account for observability in your cloud-native architecture. They also become very visible when a prolonged outage happens, application teams can’t ship a critical new feature because of concerns over the platform’s stability, or engineers are spending expensive hours troubleshooting basic issues. If both the cost of having and not having observability in cloud-native environments is high, businesses find them between a rock and a hard place so what’s the path forward?
Making the Invisible Visible With eBPF-Based Observability
One promising solution to overcome the observability cost paradox is the Linux kernel technology eBPF. Unlike traditional monitoring tools that require heavy instrumentation of code or the installation of additional agents, eBPF operates at the kernel level and can see everything happening within the system. This enables seamless data collection with minimal performance impact for the monitored applications. By being in the kernel, eBPF offers complete observability with low overhead.
However, as outlined above, observability costs aren’t just from application overhead. The actual implementation and maintenance of the observability stack and storage is where the majority of the costs lie. eBPF offers a significantly different approach here. For implementation, rather than requiring reinstrumentation of code or the addition of an agent, eBPF can be added and updated at runtime completely out-of-band to the application. It allows fine-grained visibility into the system’s internals, offering deep insights into network traffic and application performance analysis with zero code changes or engineering efforts. This promises full monitoring coverage for teams with immediate time to value, relieving much of the need for expert human teams constantly maintaining a complex and fragile observability stack.
Being able to collect granular observability data much more easily is one side of the solution. eBPF also brings a way to keep observability solutions cost-effective, even at a high scale, to the table. It allows the code to access low-level kernel resources that would otherwise be complicated and costly (in terms of resource overhead) to access from within user space. This allows the observability solution’s implementation to run very efficiently, and out-of-band to the applications it monitors – reducing the total cost of ownership and eliminating any unexpected effects of the observability stack on time-critical application code.
Let’s take a look at DoorDash to see how eBPF works in the real world:
“As DoorDash experienced rapid growth over the last few years, we began to see the limits of our traditional methods of monitoring. Metrics, logs and traces provide vital information about our service ecosystem. However, these signals almost entirely rely on application-level instrumentation, which can leave gaps or conflicting semantics across different systems. We decided to seek potential solutions that could provide a more complete and unified picture of our networking topology.
One of these solutions has been monitoring with eBPF, which allows developers to write programs that are injected directly into the kernel and can trace kernel operations. By building at the kernel level, we can monitor network traffic at the infrastructure level, which gives us new insights into DoorDash’s backend ecosystem that’s independent of the service workflow.”
eBPF-based observability enables zero instrumentation visibility into applications at a low overhead while still being able to see everything happening in the system. The last—and crucial—part, filtering the data and presenting it to the end user, isn’t done by eBPF itself, but higher-level projects that leverage eBPF as their underlying engine.
A View of the Open Source eBPF Observability Ecosystem
Let’s examine some of the open source projects leveraging eBPF for observability to better understand how they change the architecture of observability.
Caretta
Caretta helps teams instantly create a visual network map of the services running in their Kubernetes cluster. Caretta leverages eBPF to collect data efficiently and is equipped with a Grafana Node Graph dashboard to quickly display a dynamic map of the cluster.
The main idea behind Caretta is gaining a clear understanding of the inter-dependencies between the different workloads running in the cluster. Caretta maps all service interactions and their traffic rates, leveraging Kubernetes APIs to create a clean and informative map of any Kubernetes cluster which can be used for on-demand granular observability, cost optimization, and security insights, allowing teams to quickly reach insights such as identifying central points of failure or pinpointing security anomalies.
Hubble
Hubble is a network observability and troubleshooting component within Cilium (which is also based on eBPF for networking). Hubble uses eBPF to gain deep visibility into network traffic and to collect fine-grained telemetry data within the Linux kernel. By attaching eBPF programs to specific network events, Hubble can capture data such as packet headers, network flows, latency metrics and more. It provides a powerful and low-overhead mechanism for monitoring and analyzing network behavior in real-time.
With the help of eBPF, Hubble can perform advanced network visibility tasks, including flow-level monitoring, service dependency mapping, network security analysis and troubleshooting. It then takes this data and aggregates it to present it to the user through the CLI or UI. Hubble enables platform teams to gain insights into network communications within their cloud-native environments and gives developers the ability to understand how their applications communicate without first becoming networking experts.
Pixie
Pixie is an open source observability tool for Kubernetes applications. Pixie uses eBPF to automatically capture telemetry data without the need for manual instrumentation. Developers can use Pixie to view the high-level state of their cluster (service maps, cluster resources, application traffic) and also drill down into more detailed views (pod state, flame graphs, individual full-body application requests).
Tetragon
Tetragon, also a part of Cilium, leverages eBPF to instrument the Linux kernel and monitor system calls, network activity, and other low-level events in real time for security observability. Using eBPF programs, Tetragon can capture and analyze system and application behavior at runtime, allowing it to detect potential security threats and policy violations.
With eBPF, Tetragon can inspect and analyze the system’s execution path, network connections and other relevant system events without requiring modifications to the applications themselves. Tetragon combines eBPF-based observability with a customizable policy that allows users to define their own detection rules. When Tetragon detects suspicious or malicious activity based on these policies, it generates alerts or triggers additional actions, such as sending notifications or blocking network connections translating low-level observability data into actionable information based on company policy.
The Visible Observability Revolution With eBPF
In the cloud-native world, limited observability carries a high price, and having a solution in place saves companies money and labor by painting a picture of their infrastructure to debug issues and track resource consumption. However, this value can be lost with the related (in)visible costs when the observability solution is not architected for the cloud-native world. eBPF-based observability presents a compelling proposition for organizations seeking to gain deep insights into their systems’ behavior while minimizing overhead and maximizing data quality.
By leveraging eBPF, businesses can collect data without intrusive instrumentation, benefit from granular visibility and filtering and optimize resource consumption. As eBPF continues to evolve and gain adoption, its role in revolutionizing observability of application behavior and driving informed decision-making without seeing spiraling costs is set to visibly impact the cloud-native world.