Why Your Kubernetes Network is Still a Black Box — And How to Fix It
It’s Friday mid-afternoon, a microservice in your production Kubernetes cluster starts dropping connections, and your incident bridge comes up. Your team scrambles but nobody knows where to look. The application team blames the infrastructure. The infrastructure team blames the app. And somewhere in the middle, packets are silently disappearing.
This is the reality for most Platform Engineering teams today. Kubernetes has transformed how we deploy software, but it has also made the network layer significantly harder to reason about. The good news: a new generation of tooling, built on a technology called eBPF, is finally opening up that black box.
Monitoring vs. Observability: Why the Distinction Matters
These two terms are often used interchangeably, but they represent fundamentally different capabilities, and confusing them is expensive.
Monitoring tells you that something is wrong. It answers predefined questions: Is CPU above 80%? Is latency above 200ms? These are useful signals, but they are reactive by nature. You have to know what to look for before you can monitor it.
Observability, on the other hand, lets you ask questions you didn’t think to ask in advance. It gives you enough raw signal—metrics, traces, flow data—to reconstruct what actually happened inside your system, even for failure modes you’ve never seen before. In a distributed Kubernetes environment, where a single user request might touch dozens of microservices across multiple nodes, observability isn’t a nice-to-have. It’s the difference between a 15-minute resolution and a 4-hour war room.
Nowhere is this gap more painful than at the network layer.
Enter eBPF: Kernel-Level Visibility Without the Performance Tax
For years, Kubernetes networking relied on iptables, a legacy Linux subsystem that processes traffic by copying packet data through sequential rule chains. At the scale of a modern production cluster, where pods are constantly spinning up and down and IP addresses churn by the second, iptables becomes a serious performance bottleneck.
Extended Berkeley Packet Filter (eBPF) takes a completely different approach. Rather than routing traffic through userspace proxies, eBPF lets you embed lightweight, sandboxed programs directly into the Linux kernel. These programs hook into specific network events, packet sends, drops, connection state changes and record telemetry data in real time, with near-zero overhead.
Think of it as installing a high-resolution camera directly at the kernel level. Every packet that moves through your cluster gets observed, classified, and recorded without slowing anything down. This is the foundation that makes true Kubernetes network observability possible.
Microsoft Retina: eBPF-Powered Observability for Every Cluster
Retina is an open-source Kubernetes network observability platform built by the Microsoft Azure Container Networking team. What makes it stand out is a property most enterprise teams care deeply about: it is completely CNI and cloud-agnostic.
Whether your cluster runs on Amazon EKS with the AWS VPC CNI, Azure Kubernetes Service, or Google Kubernetes Engine, Retina works without modification. You don’t need to replace or reconfigure your existing network layer, Retina simply layers on top of it.
How It Works: A DaemonSet That Watches Everything
Retina deploys as a DaemonSet, meaning one agent pod runs on every node in your cluster. Each agent loads eBPF programs into the Linux kernel of its host node, where they silently intercept and inspect every packet flowing through that machine.
Crucially, this happens at the kernel level, not inside your application containers. There are no sidecars to inject, no application code to modify, and no significant CPU overhead. The eBPF programs write telemetry data into kernel-level data structures called eBPF maps, which the Retina agent then reads and exports in standard Prometheus format.
The result is a continuous, cluster-wide view of network activity enriched with Kubernetes context like pod names, namespaces, and labels available directly in your existing Grafana dashboards.
Two Modes for Two Different Problems
Retina offers two primary operating modes, each suited to a different use case:
Legacy Mode (Metrics-Only): In this mode, Retina functions as a highly efficient metrics collector. It scrapes network telemetry directly from the kernel and exposes it in standard Prometheus format. For teams that already have a Prometheus and Grafana stack, this is a zero-friction path to pod-level network visibility with no additional tooling required. It’s ideal for long-term trend analysis, alerting, and capacity planning.
Hubble Mode (Live Traffic Tracing): This is where Retina becomes genuinely powerful for active incident response. By integrating with the Hubble API, the observability layer from the Cilium project, Retina provides real-time traffic traces and a live verdict on every network flow. For instance, was the packet Forwarded, or was it Dropped? For on-call engineers, that single data point can eliminate hours of guesswork. Instead of manually correlating logs across nodes, you get an immediate, actionable answer.
Hubble mode also exposes the specific kernel-level drop reason via the skb_drop_reason field telling you not just that a packet dropped, but exactly why. Was it a NetworkPolicy denial? A connection tracking timeout? An invalid TCP checksum? The kernel knows, and with Retina in Hubble mode, so do you.
This is Just the Starting Point
Retina solves a real and immediate problem for Platform Engineering teams: getting meaningful network visibility without overhauling your existing infrastructure. But it’s one piece of a larger puzzle.
In a follow-up piece, we’ll go deeper comparing Retina against running Cilium in Chaining Mode for teams that need Layer-7 protocol visibility (HTTP, gRPC), and examining the cost case for open-source eBPF tooling versus the native cloud observability products that vendors are eager to sell you. Spoiler: the difference at scale is significant.
For now, if you’re running production Kubernetes and you still can’t answer the question “why did that packet drop?” Retina is the fastest path to an answer.


