Zero-Trust Kubernetes and the Service Mesh

April 26, 2023April 26, 2023 Flynn Buoyant, cloud native security, kubernetes, networking, service mesh, zero-trust

If you’re doing security in Kubernetes, chances are good that you’ve at least heard of zero-trust. You’ve probably been asked to deploy it or at least to think about how to deploy it—after all, even the White House is talking about it!

So, let’s talk a little bit about how to actually get that done.

Zero-Trust Summarized

A quick recap: Zero-trust is a model for helping us determine whether we trust a given entity to interact with another entity, centered around the ideas that perimeter security is not sufficient, and that we need to be doing fine-grained checks of every access, every time, from everyone. This is a great fit for the cloud-native world, given that we no longer control the hardware that used to provide the security perimeter.

The easiest way to cleanly handle that need at the workload level, by far, is to install a service mesh.

Service Mesh and Zero-Trust

A service mesh, to quickly summarize, is a layer of software under your application that adds security, observability and reliability features at the platform level, freeing up your application developers to focus on the business needs of your application. As usual in the cloud-native world, there are multiple service meshes to choose from, both open source and commercial—including Linkerd, Istio, OpenServiceMesh and others—but they all tackle this same set of functions (to varying degrees, of course).

Serive meshes work by inserting themselves into the cluster network stack so that they can mediate and monitor communications in your cluster. In most cases, they do this by inserting proxy sidecars next to your application containers. This might sound excessive, but Kubernetes makes it straightforward and fairly easy to manage—and an important added benefit is that since the sidecars are placed inside the application workload’s Kubernetes Pod, they inherit the same permissions and access rights as the application itself, providing a clear security boundary and operational simplicity.
Once the proxies are in place, the mesh reconfigures Kubernetes networking so that all communications done by the application pods gets routed through the proxies, letting the proxies control the way communication happens while also measuring how things are going without any changes to the application. Meshes can typically do things like automatic mTLS, robust authentication and authorization policy enforcement, automatic retries and quite a bit more. They are extremely powerful tools because they have extremely low-level, extremely broad access to communications happening in your application.

Identity in the Mesh

A key point to remember about zero-trust in Kubernetes is that we must not allow identity to be based on the network; we don’t control that. Instead, we need some workload identity tied to the workload itself with no association to the network at all. There are several different approaches, but a common thread is to base workload identity on the Kubernetes ServiceAccount used for the workload. Kubernetes practitioners are already used to the ServiceAccount concept, so it’s easy to have a unique ServiceAccount per workload and each ServiceAccount has an associated unique ServiceAccountToken that the workload has access to.

I’ll use Linkerd, the open source CNCF-graduated service mesh, as a vehicle to provide a specific example: For each workload, Linkerd uses its ServiceAccount token to cryptographically bootstrap a TLS certificate for that specific workload. These certificates then form the basis for Linkerd to do industry-standard mTLS for all communication between workloads. This lets Linkerd verify the identity of the workload on both ends of the connection, as well as providing encryption and integrity checks for all communications between workloads.

As soon as we talk about mTLS certificates, of course, we need to talk about the trust chain for the certificate. This is another area that varies per mesh, but all of them provide a way to hook the trust chain to an external CA, so that you can tie the mesh into your organization’s existing PKI—and all of them require paying attention to the certificates in the trust chain! A common technique is to use tools like CNCF Incubating project cert-manager both to allow the root CA for the mesh to come from your corporate PKI, and to automate certificate rotation for the mesh.

To continue with our Linkerd example: Linkerd has a two-level trust chain, where a trust anchor certificate signs an identity issuer certificate, which in turn signs workload certificates. Linkerd is commonly set up using cert-manager to pull the trust anchor from a system like Vault and with cert-manager directly managing rotation of the identity issuer.

Linkerd itself needs access to both the public and private keys for the identity issuer, but only the public key for the trust anchor. Limiting access in this way takes a bit of care when setting up cert-manager, but it’s worth it; it makes it safer to have fairly long expiry times for the trust anchor while still making it straightforward to automate frequent rotations of the identity issuer for better security. It also provides some independence from network topology; for example, secure communication across two clusters is easy if they share the same trust anchor.

Remember—we’re talking about workload identity here. This is separate from the identity of an end user of your application, but it’s a critical first step. If you can’t know that you’re really talking to the user-authentication microservice, how can you trust what it tells you about your end user?

Policy

Once we have workload identity sorted out, we can turn to policy to enforce authentication and authorization. For proper zero-trust, we need to pay particular attention to the principle of least privilege: Each workload should have only the access it needs and no more. Paradoxically, least privilege can require very complex descriptions of policy—things like, “The API gateway workload is allowed to request the list of users from the user-management workload but it is not allowed to attempt to update the list of users.”

It’s definitely possible to write checks like these into the application workloads. However, it’s expensive and fragile; all your application developers need to get it perfectly right every time, and you need to update the workload images whenever you want to update policy. Allowing the service mesh to do it instead allows you to separate the concerns of writing the application and keeping the policy descriptions up-to-date.

Again, different meshes approach this differently, but the most common mechanism here is mesh-specific policy CRDs. Continuing with Linkerd as the example, here’s a set of Linkerd CRDs which define policy. We start with an HTTPRoute which states that an HTTP GET of either /authors.json or any path starting with /authors/ will be handled by the authors workload:

—
apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
name: authors-get-route
namespace: booksapp
spec:
parentRefs:
– name: authors
kind: Server
group: policy.linkerd.io
rules:
– matches:
– path:
value: “/authors.json”
method: GET
– path:
value: “/authors/”
type: “PathPrefix”
method: GET

Next, the pair of Linkerd CRDs which allow that route to be used only by workloads with two specific workload identities:

—
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
name: authors-get-policy
namespace: booksapp
spec:
targetRef:
group: policy.linkerd.io
kind: HTTPRoute
name: authors-get-route
requiredAuthenticationRefs:
– name: authors-get-authn
kind: MeshTLSAuthentication
group: policy.linkerd.io
—
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
name: authors-get-authn
namespace: booksapp
spec:
identities:
– “books.booksapp.serviceaccount.identity.linkerd.cluster.local”
– “webapp.booksapp.serviceaccount.identity.linkerd.cluster.local”

The overall effect is that the books and webapp identities in the booksapp namespace will be able to read information from the authors workload, but they won’t be able to interact with authors in any other way. Workloads with other identities won’t be able to interact with the authors workload at all, because Linkerd takes the approach that once you link any AuthorizationPolicy to a route, then any traffic matching that route must match an AuthorizationPolicy to be allowed. Unauthorized traffic is summarily dropped.

Again, the various meshes approach policy in a few different ways, but this functionality is common across all of them: There’s always a way to define exactly which actors are allowed to use a route (or an entire workload) and to make sure that the rules are followed.

And remember that we’re talking about workload authentication and authorization. This is a necessary part of being able to secure your application: You have to have control over how the workloads can interact with each other to have confidence in application-level security. Trying to do all of it at the application level is too costly and fragile; better to let the mesh handle it.

Mesh Limitations

After talking about what service meshes can do, it’s worth talking about what they can’t do: No service mesh is a silver bullet for security, after all.

The biggest thing to be aware of is that meshes are all about security over the wire. They don’t help at all with security for data at rest, for example; the mesh can make sure that requests to your PII database will be encrypted, but you’ll still need to make sure that the data are encrypted when the database writes them to disk.
Additionally, for the mesh to be most useful, it needs to know what protocol your workloads are using to communicate. This is somewhat less about security and somewhat more about reliability, but most meshes are at their best when you’re using standard protocols like HTTP or gRPC to communicate.

Finally, as we’ve mentioned before, identity in the mesh is not the same thing as identity in your application. This is a feature—even if your logged-in user is allowed to, say, transfer funds between bank accounts, a weather-applet workload in your cluster shouldn’t be able to move money around! Effective zero-trust security requires checks at both levels: The request to move money should come from a workload that makes sense and be happening on behalf of a logged-in user with proper permissions.

The service mesh can tackle the workload part of this on its own, but application-level policy is a separate thing. In many cases, an API gateway atop the service mesh can be a great way to extend the security of the mesh up into the application.

Zero-Trust Kubernetes and the Service Mesh

Rethinking security for a cloud-native world is a tall order. We’re talking about changing how we manage identity, looking at policy separate from any application and managing it all at the platform level so that the application developers don’t have to worry about it. This might be happening under deadline (U.S. federal agencies, for example, have to get this done by 2024), and it will always be happening in a world where it’s critical to keep costs down and not interrupt critical services.

The good news is that Kubernetes users have a leg up: Sliding a service mesh under your application can meet a lot of your zero-trust requirements easily at very low cost. The existing Kubernetes mechanisms for injecting sidecars and reconfiguring container networking on the fly are extremely powerful: Together, they provide an elegant way to add functionality to your application without rewriting it.