Understanding Kubernetes Networking Architecture

December 6, 2023December 4, 2023 Susie Su architecture, cloud-native networking, docker, kubernetes, Kubernetes networking

by Susie Su

From Docker to Kubernetes, container technologies are evolving. Most developers are familiar with Docker but don’t have the fullest grasp of Kubernetes. And the most difficult thing to understand about Kubernetes is its network architecture. Here, I will provide a detailed explanation of the comprehensive architecture of Kubernetes networking, ensuring all aspects are clearly understood.

What is Kubernetes?

Kubernetes, originally designed by Google and now maintained by the Cloud Native Computing Foundation, has been significantly influential in the world of container orchestration. Drawing on Google’s internal Borg system, Kubernetes enhances the management of containerized applications and is part of a larger suite of technologies that bolster Google’s infrastructure. This includes supporting millions of application and search engine servers worldwide with minimal downtime.

Kubernetes services also provide crucial load balancing and simplify container management across multiple hosts, thereby improving the deployment and scaling of applications.

In very simple terms, Kubernetes is similar to an OpenStack equivalent for container orchestration.

Kubernetes is an open source platform that’s both portable and extensible, designed for managing containerized workloads and services. It supports declarative configuration and automation, enhancing its usability. The platform boasts a large and rapidly expanding ecosystem, with a wide range of services, support and tools readily available.

Sponsorships Available

While Kubernetes a key player in the containerization technology landscape, it’s important to note its predecessor, Docker, and specifically Docker Swarm.

Docker Vs. Docker Swarm

Docker

Docker is an open source platform that enables developers to build, deploy, run, update and manage containers. These containers are standardized, executable components combining application source code with the necessary operating system (OS) libraries and dependencies, ensuring that code runs seamlessly in any environment.

Docker is tailored to accelerate the deployment of your applications. With Docker, you can detach your applications from your infrastructure, treating the latter as a manageable application. At its heart, Docker enables the safe isolation of almost any application within a container. This secure isolation permits the concurrent running of numerous containers on the same server. Thanks to the lightweight design of the container—which operates without the extra burden of a hypervisor—you can achieve greater efficiency from your hardware.

Docker aids in creating Docker images and containers from an application’s binary and configuration files. It utilizes containerization principles like namespaces and Cgroups, provided by the Linux Kernel, to effectively run Docker containers.

Docker Swarm. Docker Swarm, an open source tool from Docker, is used to manage a group of containers running in a Docker cluster. It helps create and manage the cluster nodes and schedules container jobs to run on these nodes.

In this mode, Docker Swarm allows containers to run across multiple hosts by combining them into a shared pool of compute resources. Docker itself monitors these containers, managing their allocation and initiating them as needed. By default, Docker Swarm is disabled.

Architecture: Docker Vs. Kubernetes

Kubernetes, often called “Linux for the cloud,” is the most popular container orchestration platform for a reason. Kubernetes supports a range of container runtime environments, including Docker, containerd, CRI-O and all Kubernetes Container Runtime Interface (CRI) implementations.

Docker is extremely valuable for modern application development, effectively solving the age-old problem of code that runs perfectly on one machine but fails elsewhere.

When comparing these platforms, it’s more appropriate to compare Kubernetes with Docker Swarm. Docker Swarm, also known as swarm mode, is a container orchestration tool similar to Kubernetes. As an advanced feature, it manages clusters of Docker daemons. The Manager/Worker architecture of Docker Swarm is comparable to the control plane/node mode in Kubernetes, much like the Master/Worker mode in Ali Cloud.

Kubernetes vs Docker Swarm: Key Parameters

Kubernetes is not a containerization platform that competes with Docker Container Engine (Docker CE/EE)

Docker uses a client-server architecture. The Docker client communicates with the Docker daemon, responsible for essential tasks such as container creation, execution and distribution. You have the option to run both the client and daemon on the same system or establish connectivity between the client and a remote Docker daemon. They communicate seamlessly through a RESTful API, often over UNIX sockets or a network interface.

I’ve created a diagram to illustrate this concept. Please refer to the diagram of Docker Architecture below.

Docker Architecture

In exploring the architecture of Kubernetes, it’s important to recognize that a Kubernetes cluster is a set of node machines for running containerized applications. A cluster has at least one worker node and a control plane.

The worker nodes in a Kubernetes cluster are responsible for hosting the Pods, which are the components of the application workload. Each worker node, functioning as either a physical or a virtual machine, is equipped with a Kubernetes node agent (Kubelet), a network proxy (kube-proxy) and a container runtime. The container runtime is essential for running the containers within the Pods, providing an environment for the containers to execute and interact with the operating system and underlying hardware.

Pods are scheduled onto the nodes by the control plane, which consists of the kube-apiserver, kube-scheduler, kube-controller-manager and etcd (a consistent and highly available key-value store).

See below for a diagram illustrating Kubernetes architecture details.

Kubernetes Architecture

However, the key difference between Docker and Kubernetes is in the architecture of their network.

Basic Kubernetes Concepts and Terms

Master. The control plane in Kubernetes, often referred to as the master node, is responsible for managing and controlling the cluster. It runs key components like the kube-apiserver, kube-controller-manager, kube-scheduler and etcd. The control plane serves as the central management point for the entire cluster, similar in concept to the leader node in Docker Swarm.

kube-apiserver: Validates and configures data for the API objects, which include pods, services, replication controllers and others. The API server services REST operations and provides the front end to the shared state of the cluster, through which all other components interact.
kube-controller-manager: Operates multiple Kubernetes controllers in a single process. This component manages essential background tasks in the cluster, such as node, replication, and endpoint controllers, centralizing key operational functions.
kube-scheduler: Is a control plane process that assigns Pods to Nodes. The scheduler determines which Nodes are valid placements for each Pod in the scheduling queue according to constraints and available resources.
etcd: Consistent and highly available key-value store that serves as Kubernetes’ backing store for all cluster data.

Non-Master (named Worker in Ali Cloud). A worker node is where containerized applications run. It hosts the actual workloads in Kubernetes. Worker nodes run the container runtime (e.g., Docker), the Kubernetes agent (kubelet) and an optional networking plugin (e.g., Calico).

kubelet: The primary “node agent” that runs on each node. It can register the node with the apiserver using one of the hostnames, a flag to override the hostname, or specific logic for a cloud provider. The kubelet manages a container networking implementation (via the CNI) → The kubelet manages a container runtime (via the CRI.)
kube-proxy: Is a network proxy daemon that runs on each node in a Kubernetes cluster, mirroring Kubernetes networking services on these nodes. Similar to the Kubelet, which is another per-node daemon, kube-proxy plays a specific role in handling network communication to and from the Pods.

Kube-proxy usually runs on worker nodes in a Kubernetes cluster and, in some configurations, it may also run on master nodes. Its primary role is to manage network proxying for pod-to-pod communication across all nodes, be they worker or master. Kube-proxy is responsible for maintaining network rules and facilitating packet forwarding for pods, ensuring efficient and reliable network connectivity within the cluster.

In the following Kubernetes architecture diagram, we will see all the components mentioned above and illustrate how each component interacts with the others.

Details of Kubernetes Architecture Components

Kubernetes Objects are categorized into three main types: Workload Objects, Service Objects and Config & Storage Objects. While there are numerous objects within each category, this overview will focus only on the most commonly used ones.

Workload Objects:

Pod: The basic building block, representing one or more containers with shared storage and network.
ReplicaSet: Ensures a specified number of pod replicas are running at all times, providing high availability.
Deployment: Manages the deployment and scaling of a set of pods, offering declarative updates. Deployments use ReplicaSets to ensure that the desired number of pods is always running and available. This object provides key features like version management, scalable deployment and rollback capabilities, enabling controlled updates to the application.
StatefulSet: Manages the deployment and scaling of pods with stable hostnames and persistent storage, suitable for stateful applications.
DaemonSet: Ensures that a copy of a pod runs on every node in a cluster, typically used for node-level system services like log collectors or monitoring agents.
Job: Runs a single task to completion, making it suitable for batch processing and one-time tasks.
CronJob: Creates Jobs on a regular schedule, similar to cron tasks in traditional Unix-like systems.

Service Objects:

Service: Exposes a set of pods as a network service with a stable IP address and DNS name for load balancing and service discovery.

externalTrafficPolicy is a configuration field that can be set for a Kubernetes Service object. It determines the routing of incoming traffic from external sources (outside the Kubernetes cluster) to the pods associated with the Service. This setting is particularly important for controlling how external traffic is handled, whether it’s distributed evenly across all pods or sent to the nearest pod in terms of network topology, affecting performance and source IP preservation.

There are two available options for externalTrafficPolicy:

→ Cluster: When externalTrafficPolicy is set to “Cluster” (the default), the traffic from external clients is distributed to any healthy pod within the Service, regardless of the node on which the pod is running. This option is efficient for distributing traffic but may not always preserve the source IP address of the incoming requests.

→ Local: When externalTrafficPolicy is set to “Local,” traffic from external clients is forwarded only to pods that are running on the same node where the traffic entered the cluster. This option preserves the source IP address of the incoming requests but may lead to uneven distribution of traffic if the nodes have different numbers of pods for the Service.

For example, in Ali cloud, when there are changes to the backend endpoints or cluster nodes corresponding to a Service, the Cloud Controller Manager (CCM) will automatically update the Server Load Balancer’s (SLB’s) backend virtual server group. The update strategy for the backend server group varies depending on the Service mode.

In the Cluster mode (where spec.externalTrafficPolicy = Cluster): By default, CCM will attach all nodes to the SLB’s backend — except those configured with BackendLabel labels.
In Local mode (where spec.externalTrafficPolicy = Local): By default, CCM will only add the node where the Service’s Pods are located to the SLB’s backend. That approach helps reduce the rate at which SLB quotas are consumed and supports source IP preservation at the Layer 4 level.

Ingress: Manages external access to services within a cluster, typically HTTP.
NetworkPolicy: Defines how pods can communicate with each other and other network endpoints, enhancing network security.

Config and Storage Objects:

ConfigMap: Stores configuration data separately from application code, allowing for dynamic configuration updates.
Secret: Stores sensitive information, such as passwords or tokens, securely and separately from application code.
PersistentVolume (PV) and PersistentVolumeClaim (PVC): Manage storage resources and volume claims for persistent data storage.
Namespace: Provides logical isolation and organization within a cluster.
Annotation: A mechanism for attaching arbitrary non-identifying metadata to Kubernetes objects. Annotations can be used to store additional information that may be used by scripts, tools, or utilities interacting with the cluster without affecting the runtime behavior of the object itself.

This is a brief history and an overview of some key concepts in Kubernetes. Now, let’s understand how Kubernetes’ networking architecture is organized — this understanding will help us work with the solution effectively, even if we start with fundamental knowledge.

Kubernetes Networking Architecture

Kubernetes runs on the Linux operating system. There are five key networking concepts in Linux that will help us understand how the Kubernetes network was designed to be compatible with these key components and meet the final requirements.

Network Namespace (netns)

Linux Network	Kubernetes Network
Network Namespace	Node in Root Network Namespace
Network Namespace	Pod in Pod Network Namespace

A network namespace is a feature in Linux that allows you to create isolated network environments within a single Linux system. Each network namespace has its own network stack including network interfaces, routing tables, firewall rules and other network-related resources. This isolation allows you to run multiple independent network environments on the same physical or virtual machine, keeping them separate from each other.

Containers like Docker and Kubernetes use network namespaces to provide isolated networking environments for each container. This isolation ensures that containers cannot interfere with each other’s network configurations.

Network Namespace Experiment

How do we create a network namespace and achieve communication between different network namespaces? Here’s an example.

Step One: Create two network namespaces:

ip netns add netns1
ip netns add netns2

Step Two: Create a veth pair:

ip link add veth1 type veth peer name veth2

Assign veth 1 to the network namespace netns1 and veth 2 to the network namespace netns2:

ip link set veth1 netns netns1
ip link set veth2 netns netns2

Assign IP address to the interfaces within the namespaces:

ip netns exec netns1 ip addr add 10.1.1.1/24 dev veth1
ip netns exec netns2 ip addr add 10.1.1.2/24 dev veth2

But veth1 cannot successfully ping veth2, as shown in the screenshot below.

ip netns exec netns2 ip link show veth2 # Check the status of veth2, which is down.

Activate the network interface veth1 within netns1 network namespace and veth2 within netns2.

ip netns exec netns1 ip link set dev veth1 up
ip netns exec netns2 ip link set dev veth2 up

Try to ping veth2 from netns1:

ip netns exec netns1 ping 10.1.1.2

So far, we’ve been successful; let’s continue.

Brainstorming

As we’ve learned from the above experiment, namespaces isolate the network, and we’ve used a veth pair to connect two virtual hosts. Now, if we configure the IP addresses with different subnets, what will happen?

Without a router or subnet reconfiguration, the two hosts in different subnets will not be able to communicate over a simple network cable connection.

It reminds me of a loopback address. For example, the loopback interface is a special interface for same-host communication. Packets sent to the loopback interface will not leave the host, and processes listening on 127.0.0.1 will be accessible only to other processes on the same host.

Veth (virtual ethernet) pairs

Linux Network	Kubernetes Network
Veth Pair	Veth Pair(on Pods)

Veth pairs facilitate communication between different network namespaces or containers. Traffic sent into one end of the veth pair can be received at the other end, allowing data to flow between isolated environments. Otherwise, it cannot interconnect between different network namespaces by default.

In simple terms, the veth pair acts like a virtual network cable that connects virtual hosts in two different namespaces.

Packets transmitted on one device in the pair are immediately received on the other host. When either host is down, the link state of the pair is down.

Linux itself features the concept of network interfaces, which can be either physical or virtual. For instance, when you use the ‘ifconfig’ command, you can view a list of all network interfaces and their configurations, including IP addresses. Veth is one such virtual network interface in Linux.

Types of Virtual Network Interfaces in Linux

Virtual Ethernet Interfaces (veth)

These are often used in containerization technologies like Docker. They create a pair of virtual Ethernet interfaces, and packets sent to one end of the pair are received on the other end.

ip link add veth0 type veth peer name veth1
ip link set dev veth0 up
ip link set dev veth1 up

Loopback Interface (lo)

The loopback interface is used for local network communication, allowing a system to communicate with itself. It is typically assigned the IP address 127.0.0.1.

ifconfig lo

ip addr show lo

Tunnel Interfaces (TUN, tap)

Tunnel interfaces are used to create point-to-point or point-to-multipoint network tunnels, such as VPNs. tun interfaces are used for routing, while tap interfaces are used for Ethernet bridging.

To create a TUN interface for a simple tunnel:

ip tuntap add mode tun tun0
ip link set dev tun0 up

To create a TAP interface for an Ethernet bridge:

ip tuntap add mode tap tap0
ip link set dev tap0 up

Virtual LAN Interfaces (vlan)

VLAN interfaces are used to partition a physical network into multiple virtual LANs. They allow multiple VLANs to exist on a single physical network interface.

ip link add link eth0 name eth0.10 type vlan id 10
ip link set dev eth0.10 up

WireGuard Interfaces (wg)

WireGuard is a modern VPN protocol, and wg interfaces are used to configure and manage WireGuard tunnels.

Create a WireGuard interface:

ip link add dev wg0 type wireguard

Generate private and public keys

wg genkey | sudo tee /etc/wireguard/privatekey | wg pubkey | sudo tee /etc/wireguard/publickey

Configure the WireGuard interface

ip address add 10.0.0.1/24 dev wg0
wg set wg0 private-key /etc/wireguard/privatekey
wg set wg0 listen-port 51820

Dummy Interfaces (dummy)

Dummy interfaces are used for various purposes, such as simulating network interfaces or providing a placeholder for network configuration.

modprobe dummy
ip link add dummy0 type dummy
ip link set dev dummy0 up

Bonded Interfaces (bond)

Bonding interfaces are used for network link aggregation and fault tolerance. They combine multiple physical network interfaces into a single logical interface.

modprobe bonding
ip link add bond0 type bond
ip link set dev eth0 master bond0
ip link set dev eth1 master bond0

Virtual TTY Interfaces (PTY – ‘Pseudo-Terminal’)

These are virtual terminal interfaces used for terminal emulation, often in the context of SSH or terminal multiplexers like tmux.

Bridge Interfaces (br)

Bridge interfaces bridge network traffic between two or more physical or virtual network interfaces. They are created when setting up software bridges using tools like brctl or ip.

ip link add name br0 type bridge
ip link set dev eth0 master br0 # eth0 is the physical network interface, and it can also configure veth (virtual network interface) instead, for example, with the command ‘ip link set veth master br0’.
ip link set dev eth1 master br0
ip link set dev br0 up

Now that you’re familiar with the various virtual network interfaces, let’s shift our focus to the Linux Network Bridge, which can be managed and created using ‘brctl.’ In Kubernetes networking, we use CNI to manage Linux network interfaces for container networking, as illustrated in the following table.

Linux Network	Kubernetes Network
Network Interface(Physical+Virtual)	CNI plugins (Container Network Interface)	Cilium
		Flannel
		Calico
		Weave Net

Linux network interfaces are the underlying foundation, while CNI provides a standardized way to configure container network connectivity.

Network Bridge Interface

A veth pair is primarily used for connecting two network namespaces or containers and provides network isolation between them. A bridge, on the other hand, is used to connect multiple network interfaces and create a single network segment without inherent network isolation, although additional configurations can be applied for segmentation. Bridges and veth pairs can be used together to create more complex network setups, such as connecting multiple containers within a single bridge.

Linux Network	Docker Network	Kubernetes Network
Bridge Interface	docker 0	cni0 (on Flannel plugin)

A network bridge is similar to a layer-2 switch.

For example, the “docker0” bridge is a default bridge network created by Docker when you install Docker on a Linux system. It is a virtual Ethernet bridge that enables communication between Docker containers and between containers and the host system.

When discussing bridge mode in both Docker and Kubernetes, it’s essential to understand the concept of container runtime because the container runtime is responsible for the actual low-level management of containers, including networking.

Container Runtime

Kubernetes is the most widely used container orchestration engine. It has an entire foundation and ecosystem that surrounds it with additional functionality to extend its capabilities through its defined APIs. One of these APIs is the Container Runtime Interface (CRI), which defines what Kubernetes wants to do with a container and how.

Several common container runtimes with Kubernetes

Docker: Historically, Docker was the first container runtime that Kubernetes supported directly. However, Kubernetes uses a Container Runtime Interface (CRI) to interact with container runtimes, which Docker did not originally support. To integrate Docker with Kubernetes, a component called Dockershim was used. Dockershim acted as an adapter layer that allowed Kubernetes to communicate with Docker’s daemon using the CRI.

In more recent developments, Kubernetes announced the deprecation of Dockershim as an intermediary. The primary reason for this was to streamline the Kubernetes architecture and use container runtimes that directly implement the CRI. However, Docker containers and images remain fully compatible with Kubernetes because Docker produces OCI-compliant containers. This means that the containers built with Docker can be run by other CRI-compatible runtimes like containerd and CRI-O, which Kubernetes supports natively.

containerd: An industry-standard container runtime focused on simplicity and robustness, containerd is used by Docker and supports CRI, making it compatible with Kubernetes.
CRI-O: A lightweight container runtime specifically designed for Kubernetes, fully conforming to the Kubernetes CRI, which allows Kubernetes to use any Open Container Initiative (OCI)-compliant runtime.
Mirantis Container Runtime (formerly Docker Engine — Enterprise): This is the enterprise-supported version of Docker runtimes and can integrate with Kubernetes via the CRI.

The vast majority of solutions (not just Kubernetes) rely on containerd, including developer-friendly tooling like Docker Desktop. Exceptions include Red Hat OpenShift, which uses CRI-O, and Podman, which directly interacts with low-level container runtimes like runC. Ali Cloud has replaced the docker with containerd for the latest ACK.

Container Network Mode

The industry, for now, has settled on Docker as the container runtime standard. Thus, we’ll dive into the Docker networking model and explain how the CNI (cni0) differs from the docker network model (docker0).

Docker container supports various network modes:

Bridge Mode (default): Containers are attached to a bridge network, providing network isolation and default mode for new containers.
Host Mode (–net=host): Containers share the host’s network, offering maximum performance but no network isolation.
Container Mode (–net=container): Containers share the network stack of another container, useful for direct communication.
None Mode (–net=none): Containers have no network access, isolating them from the host and others.

Docker containers also offer advanced network modes for specific use cases:

Macvlan: allows Docker containers to have direct, unique, and routable IP addresses on the same physical network as the host.
Overlay: allows for the extension of the same network across hosts in a container cluster. The overlay network virtually sits on top of the underlay/physical networks.

Let’s focus on Docker bridge mode first.

Docker Bridge Mode

In bridge mode, Docker creates a bridge network interface (commonly named “Docker0”) and assigns a subnet within a private IP address space (often following the RFC1918 model) to this bridge. For each Docker container, a pair of virtual Ethernet (veth) devices is generated.

One end of the veth pair is connected to the Docker0 bridge, and the other end is mapped to the container’s eth0 network interface using network namespace technology. Finally, an IP address is allocated to the eth0 interface of the container within the address range of the Docker0 bridge, allowing the container to communicate with other containers and the host system.

So far, we have solved the problem of a single-node container network through veth pair + bridge, which is also the principle of a Docker network. However, docker0 on one host has nothing to do with docker0 on other hosts. For Kubernetes, while it uses Docker as a container runtime, it handles networking differently. This involves the Container Network Interface (CNI).

Kubernetes CNI

Kubernetes imposes its own network model on top of Docker’s, which ensures that every pod gets a unique IP address. This model is typically implemented with the help of CNI plugins. Docker’s network modes are still relevant for local container management, but when it comes to Kubernetes clusters, the networking is largely managed by Kubernetes services and CNI-compliant plugins.

Common CNI Plugins

Flannel: Provides a simple and easy way to configure a Layer 3 network fabric designed for Kubernetes.

Calico: Offers networking and network policy, typically for larger or more dynamic networks with advanced security requirements.

Weave Net: Creates a network bridge on each node and connects each container to the bridge via a virtual network interface, allowing for more complex networking scenarios, including network partitioning and secure cross-host communication.

Cilium: Uses Berkeley Packet Filter (BPF) to provide highly scalable networking and security policies.

AWS VPC CNI, Azure CNI, GCP, and Ali Cloud CNI: Cloud provider-specific plugins that integrate Kubernetes clusters with the network services of their respective cloud platforms, enabling seamless operation and management of network resources.

Netfilter and iptables

Linux Network	Kubernetes Network
Netfilter(Security)	iptables
Netfilter/iptables	iptables/IPVS mode (Loadbalancer created by kube-proxy )

Netfilter is the underlying framework in the Linux kernel responsible for packet filtering, NAT, and connection tracking, while iptables is a user-space command-line tool that leverages the Netfilter framework to define and manage firewall rules. It is the underlying technology used by iptables to perform these functions:

Connection Tracking: Netfilter maintains a connection tracking table that keeps track of the state of network connections. This is essential for implementing NAT and ensuring that packets are correctly routed to the appropriate destination.
Network Address Translation (NAT): Netfilter allows for NAT operations, which are often used for masquerading outgoing traffic from pods to appear as if it’s coming from the node’s IP address when communicating with external networks.
Load Balancing: In Kubernetes, Netfilter also plays a role in load-balancing traffic to different pods within a Service.

Netfilter vs. iptables in Linux

In Kubernetes, there are several components and concepts that serve similar roles and functions as Netfilter/iptables in managing network traffic. These include:

iptables: In Kubernetes, it is used to implement various network policies and rules for managing traffic between pods and between pods and external networks.

iptables running over Netfilter enable Kubernetes to enforce network policies, perform load balancing for Services and handle NAT operations. They are critical components of Kubernetes networking, ensuring that traffic flows correctly within the cluster and between pods and external networks.

kube-proxy: kube-proxy is a Kubernetes component responsible for managing network rules and routing traffic to services and pods within the cluster. It uses iptables to set up the necessary rules for traffic routing.

CNI plugins: CNI plugins, such as Flannel, Calico, Weave Net and others, provide networking and network policy functionalities within Kubernetes. They often interact with the underlying networking components, including iptables and the Linux kernel’s network stack.

Network policies: Kubernetes Network Policies allow you to define rules that control the traffic flow to and from pods. These policies are enforced using iptables rules on the nodes. Network Policies provide fine-grained control over which pods can communicate with each other.

Name	NetworkPolicy Support	Data storage	Network setup
Cilium	Yes	etcd or consul	IPvlan(beta), veth, L7-aware
Flannel	No	etcd	Layer3 IPv4 Overlay network
Calico	Yes	etcd or Kubernetes API	Layer3 network using BGP
Weave Net	Yes	No external cluster store	Mesh Overlay network

NetworkPolicy Support Status in CNI Plugins

To use Network Policy, Kubernetes introduces a new resource object called “NetworkPolicy,” which allows users to define policies for network access between pods. However, merely defining a network policy is insufficient to achieve actual network isolation. It also requires a policy controller (PolicyController) to implement the policies.

The policy controller is provided by third-party networking components. Currently, open source projects such as Calico, Cilium, Kube-router, Romana, Weave Net and others support the implementation of network policies (as shown in the table above). The working principle of Network Policy is depicted in the diagram below.

Network Policy Working Principle

Routing

Routing in Linux is quite simple, so let’s delve directly into Kubernetes routing. First, you should be aware that in Kubernetes, there are several IP address blocks.

Pod Network CIDR: Assigns IP addresses to pods.
Service Cluster IP Range: Defines IP addresses for services.
Node Network CIDR: Assigns IP addresses to nodes.
Service External IP Range: Specifies external IP addresses for services.

In Kubernetes, routing refers to the process of directing network traffic between pods, nodes and external networks within the cluster. Routing plays a crucial role in ensuring that communication between different components of a Kubernetes cluster functions correctly.

Here are key aspects of routing in Kubernetes:

Pod-to-Pod Across Nodes Routing

Linux Network	Kubernetes Network
VXLAN/GRE/IP-in-IP/Open vSwitch	CNI Plugins	Flannel Network Type(udp/vxlan/host-gw/Cloud Provider VPC/alloc)
		Cilium/Calico/Weave Net

Kubernetes assigns a unique IP address to each pod within a cluster. Pods can communicate directly with each other using these IP addresses, regardless of the node they are running on.

The Kubernetes networking model, often implemented using overlay networks or network plugins, ensures that pod-to-pod traffic is efficiently routed within the cluster. Let’s elaborate on overlay networks and their relevant common network plugins.

Overlay Networks

Overlay networks provide a means to connect containers across different hosts. Both Docker Swarm and Kubernetes support overlay networks. In Docker Swarm, you can create overlay networks to enable container communication across multiple nodes. In Kubernetes, overlay network plugins like Flannel or Calico are used for pod-to-pod communication across nodes. It’s worth noting that while Docker container-to-container communication is similar to pod-to-pod communication in Kubernetes, the latter provides additional orchestration and management features.

This is how the overlay network (using flannel plugins) works in Kubernetes.

Flannel offers simple overlay networking.
Calico provides advanced networking with BGP routing and rich network policies.
Weave Net focuses on simplicity and includes DNS-based service discovery.

Kubernetes’ CNI plugins, such as Flannel, have specific networking characteristics. The traditional Flannel plugin, when used across nodes, relies on vxlan (flannel.1) and UDP (flannel0) for communication. This results in an additional layer of packet encapsulation when crossing nodes. However, Ali Cloud’s Flannel mode, known as ‘alloc,’ takes a different approach by directly utilizing the VPC’s routing table. This reduces packet encapsulation and leads to faster and more stable packet transmission. Cross-node communication is just one aspect of the Flannel plugin; it also performs functions like assigning IP addresses to pods and configuring routing.

Actually, the requirement for communication across nodes can be addressed using Linux’s overlay network technologies. While these technologies may not provide IP assignment to pods like Kubernetes’ CNI plugins, they share many similarities. It’s valuable to understand how Linux originally provided relevant solutions. These Linux L2/L3 routing technologies are integrated into the Linux kernel and are accessible without the need for additional software or drivers.

Linux Overlay Network Technology

Virtual Extensible LAN (VXLAN)

VXLAN is a tunneling technology used for overlay networking, often used in cloud and virtualized environments. It is supported in the Linux kernel.

Generic Routing Encapsulation (GRE)

GRE is a tunneling protocol that encapsulates a wide variety of network layer protocols inside virtual point-to-point links.

IP-in-IP

A simple tunneling protocol that encapsulates an IP packet within another IP packet.

Open vSwitch (OVS)

A multilayer virtual switch that supports standard management interfaces and protocols, including NetFlow, sFlow, SPAN, RSPAN, CLI, LACP, 802.1ag, and it can also operate both as a software-based network switch and as a network overlay for virtual machines.

Node-to-Node Routing

Linux Network	Kubernetes Network
L3 Routing	L3 Routing (underlying network)

Kubernetes nodes may need to communicate with each other for various reasons, such as control plane coordination or network traffic routing.

Routing between nodes is typically managed by the underlying network infrastructure and is necessary for features like LoadBalancer-type Services, which route traffic to different nodes hosting pods.

Service Routing

Linux Network	Kubernetes Network
Netfilter	iptables/IPVS mode (Loadbalancer created by kube-proxy )

This is what is most challenging to comprehend and the most different from Docker swarm. Kubernetes Services provides a stable and abstracted way to access pods. They rely on routing to distribute incoming traffic to the appropriate pods, regardless of the node they are on.

Services use kube-proxy and iptables rules to perform load balancing and route traffic to the correct endpoints (pods) based on labels and selectors.

Network Policies

It allows you to define rules that control pod-to-pod communication. These policies act as routing rules enforced by the underlying networking infrastructure (often implemented using iptables). Network Policies provide fine-grained control over which pods can communicate with each other, based on labels and selectors.

Ingress Controller

Ingress controllers, like Nginx Ingress or Traefik, manage external access to services within the cluster. They handle routing external traffic based on rules defined in ingress resources.

External Routing

Kubernetes clusters often require communication with external networks, such as the public internet or on-premises data centers. External routing is essential for ingress and egress traffic.

External routing is typically managed by the cluster’s network configuration and cloud provider integration.

In summary, routing in Kubernetes is a complex and critical aspect of cluster networking. It involves directing traffic between pods, nodes and external networks to ensure that applications running within the cluster can communicate effectively and securely. Kubernetes provides various networking components and abstractions to manage routing and connectivity within the cluster.

Classic Networking Scenarios in Kubernetes

In this section, we’ll cover a number of common network organization scenarios based on the theory of Kubernetes.

Between Container and Container

Let’s see how container one communicates with container two in the same pod, as illustrated in the following diagram.

Communication between Container one and Container two

In each pod, every Docker container and the pod itself share a network namespace. This means that network configurations such as IP addresses and ports are identical for both the pod and its individual containers.

This is primarily achieved through a mechanism called the “pause container,” where newly created Docker containers do not create their own network interfaces or configure their own IP addresses; instead, they share the same IP address and port range with the pause container.

The pause container is the parent container for all running containers in the pod. It holds and shares all the namespaces for the pod.

Linux network namespaces provide network isolation at the OS level, while Kubernetes namespaces offer logical resource isolation within a Kubernetes cluster, serving different purposes in networking and resource management.

In summary, the two containers above in the same pod within the same network namespace are similar to the two processes in Linux, and they can communicate with each other directly and share the same network configuration within the pod’s network namespace.

Between Pod and Pod

Pods in the same node

This is how communication between Pods on the same Node works:

Pod1, Pod2 and cni0 are in the same subnet

Pods across the nodes

This is how communication between pod1 and pod2 across nodes works:

Pods Communication across nodes

In the case of communication between Pods on different Nodes, the pod’s network segment and the bridge are within the same network segment, while the bridge’s network segment and the Node’s IP are two different network segments. To achieve communication between different Nodes, a method needs to be devised to address and communicate based on the Node’s IP.

On the other hand, these dynamically assigned pod IPs are stored as endpoints for services in the etcd database of the Kubernetes cluster. These endpoints are crucial for enabling communication between pods.

These are detailed solutions to ensure seamless pod connectivity across nodes.

Unique Pod IP Allocation:

Utilize Overlay network technologies such as Flannel plugins.
Utilize other network plugins

Mapping Pod IPs to Host IPs: Establish a correlation between the Pod’s IP and the node’s IP to facilitate communication among Pods by implementing Pod IP mapping to the node IP.

Between Pod and Service

When we deploy a pod, it generally includes at least two containers: the pause container and the app container.

The IP address of a pod is not persistent; when the number of pods in the cluster is reduced, or if a pod or node fails and restarts, the new pod may have a different IP address than before. That’s why Kubernetes introduced the concept of Services to address this issue.

When accessing a Service, whether it’s a pod through Cluster IP + TargetPort or using Node node IP + NodePort, the traffic is all redirected by the Node node’s iptables rules to the kube-proxy, which listens on the Service’s proxy port.

Once kube-proxy receives the Service’s access request, it forwards it to the backend pods based on the load balancing strategy. Kube-proxy not only maintains iptables but also monitors the service’s ports and performs load balancing.

Pod to Service

This is how communication from pod1 to service2 running on pod2 in different nodes works:

Pod to service

Service to Pod

This is how communication from service2 to pod1 works:

Service to pod

In summary, service discovery is achieved through Service resources that provide stable IP addresses and DNS names for accessing pods. These services manage load balancing and traffic routing to destination pods. Kube-proxy generates and maintains the mapping table between services and pod:port pairs, with options for iptables or IPVS load balancing modes to suit performance and efficiency requirements.

Between Internet and Service

In Kubernetes, communication between the internet and a Service is typically achieved through the use of an Ingress resource. An Ingress is an API object that manages external access to the services in a cluster, typically for HTTP traffic. It acts as a layer of abstraction for handling external traffic and provides features like load balancing, SSL termination, and name-based virtual hosting.

Here’s how Kubernetes achieves communication between the internet and a Service using an Ingress:

Ingress Resource: First, you create an Ingress resource that defines the rules for routing external traffic to your services. This includes specifying the hostnames, paths, and backend services to route the traffic to.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

spec:

rules:

– host: example.com # Hostname for incoming traffic

http:

paths:

– path: /app # Path-based routing

pathType: Prefix

backend:

service:

port:

number: 80 # Port of your Service

Ingress Controller: To fulfill the Ingress resource, you need an Ingress controller. The Ingress controller is a component responsible for implementing the rules defined in the Ingress resource. There are various Ingress controllers available, such as Nginx Ingress Controller, Traefik, and others. You deploy the Ingress controller in your cluster.

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: nginx-ingress

spec:

replicas: 1

selector:

matchLabels:

app: nginx-ingress-controller

template:

metadata:

labels:

app: nginx-ingress-controller

spec:

containers:

– name: nginx-ingress-controller

image: k8s.gcr.io/ingress-nginx/controller:v1.0.0 # Use a specific version

args:

– /nginx-ingress-controller

– –configmap=$(POD_NAMESPACE)/nginx-configuration

– –tcp-services-configmap=$(POD_NAMESPACE)/tcp-services

– –udp-services-configmap=$(POD_NAMESPACE)/udp-services

ports:

– name: http

containerPort: 80

– name: https

containerPort: 443

External Access: Once the Ingress controller is running, it watches for changes in the Ingress resources and configures itself accordingly. When external traffic arrives at the cluster, it is first intercepted by the Ingress controller, which then uses the rules from the Ingress resource to route the traffic to the appropriate backend service.
Load Balancing: If load balancing is required, the Ingress controller can distribute incoming traffic to multiple pods of the backend service, ensuring scalability and high availability.
SSL Termination: If SSL termination is specified in the Ingress resource, the Ingress controller can handle SSL decryption and encryption, allowing secure communication with the backend service.
Name-Based Virtual Hosting: Ingress can also support name-based virtual hosting, allowing you to host multiple websites or services on the same IP address and port, differentiating them based on the hostname specified in the Ingress rules.

Conclusion

Kubernetes infrastructure is designed to be highly compartmentalized. A highly structured plan for communication is important, as namespaces, containers, and pods are meant to keep components separate from each other.

There are three parallel network layers in Kubernetes:

Node-to-node communication: Physical or virtual host routing (underlay routing).
Pod-to-pod communication: overlay network or other network plugins to achieve it.
Service-to-service communication: use kube-proxy and iptables rules to perform load balancing and route traffic to the correct endpoints (pods) based on labels and selectors.

Service Loadbalancer over Pod Network over Node Network = Kubernetes Network

In more detail, it can be described as follows: ‘Service Loadbalancer over (Container Shared Namespace Network(Localhost) on Pod Network) over Node Network = Kubernetes Network.’

Whether it’s Docker or Kubernetes, they both run on Linux. Linux can be considered the fundamental foundation. Before Kubernetes existed, developers used Linux networking architecture to fulfill advanced requirements. Kubernetes, when introduced, was built upon Linux networking principles.

Hence, gaining an overview of the counterparts of Linux networking components in the Kubernetes networking architecture is valuable. Many engineers with years of Linux experience will find this format convenient for understanding how Kubernetes implements complex networking. Please refer to the table below for details.

Linux Networking	Kubernetes Networking
Network Namespace	Node in Root Network Namespace Pod in Pod Network Namespace
Network Interface(Physical+Virtual)	CNI(Container Network Interface)
Veth Pair	Veth Pair on pods
Bridge Interface	cni0 (on Flannel plugin)
NULL	CRI(Container Runtime Interface) (Docker/containerd/CRI-O Mirantis Container Runtime)
NULL	kube-apiserver
NULL	kube-controller-manager
NULL	etcd
systemd/Container Runtime/Process Supervisor/cgroup/namespace	kubelet
iptables mode/IPVS mode/routing and network policy /proxy server	kube-proxy
IPv4 and IPv6 Protocol Stack	SingleStack/PreferDualStack/RequireDualStack/ipFamilies
DNS	KubeDNS(old), CoreDNS (k8s version>1.13)
Load Balancer(nginx,haproxy)	Service LoadBalancer/Service ClusterIP(L4)/Ingress(L7)
Security(Netfilter/iptables)	NetworkPolicy
Logs	Fluentd/Logtail