Container Orchestration: Avoiding Errors and Misconfigurations

June 2, 2020June 1, 2020 Robert Brennan CI/CD, configuration, container images, container orchestration, containers, devops, kubernetes

by Robert Brennan

There’s no denying that containers are the future of application development or that container orchestration platforms such as Kubernetes are accelerating adoption. It’s a logical move: Containers neatly package up everything needed to ensure applications run smoothly and reliably across different environments. They also can help accelerate an organization’s digital transformation. So, it’s not surprising that teams everywhere are adopting containers.

But because both containers and Kubernetes bring so much functionality to the table, there are many variables in play—and that means many ways to make mistakes. This is especially true for organizations new to this technology that may not have a big enough DevOps team to effectively manage a Kubernetes cluster. If you’re familiar with cluster configurations, you know there are plenty of “unknown unknowns” and many ways to proceed, and it’s that multitude of options that can lead teams astray.

But let’s be clear: Errors can and do happen to anyone. Even experienced teams who are committed to correcting their most critical problems will always find new issues that have fallen by the wayside. The right preventive measures can help; so can adopting community-driven best practices. Yet, these can be tough to stay on top of in a rapidly evolving space.

Let’s look at the most common problem areas in container orchestration and their solutions.

Cluster Configuration

The first thing you’ll want to focus on is making sure the Kubernetes cluster itself is set up securely. This includes not only the baseline Kubernetes version but the add-ons and optional APIs you use as well.

One way to make sure you’re safe is to stay up-to-date with the latest releases, at least at the patch level. Updating add-ons such as cert-manager and nginx-ingress can be a chore, especially when there are breaking changes, but staying on the latest stable version ensures you’re staying ahead of any vulnerability announcements. Updating Kubernetes itself can be scary, but the same principle applies and you should at least have a process and timeline for managing updates.

You should also make sure you’re using built-in management features such as role-based access control (RBAC), network policy and resource quotas. These are mechanisms for ensuring that both applications and your engineers are only able to access the APIs, workloads and compute resources that they need. Without these mechanisms in place, an attacker can easily gain admin-level privilege and honest engineers can make fatal mistakes.

Vulnerable Container Images

Container vulnerabilities are ubiquitous. Often teams will use outdated base images or install older versions of components of components on top of them. Even if the image initially has no known vulnerabilities, new CVEs are being announced every day and it’s all but guaranteed that any given image will eventually have a vulnerability announced.

Here’s one insidious example of a common scanning gap: Many teams scan container images during the CI/CD process and will break the build if a vulnerability is found. This is great! But say CI/CD passes and the container image makes it into your Kubernetes cluster. What happens if a vulnerability is announced the next day? Since you’re only scanning in CI/CD, you won’t catch it until someone else makes a change. And until then, your cluster will be vulnerable.

Instead, teams should make sure they’re continuously scanning every image they’re using. This could be done using a regularly scheduled CI job, an in-cluster CronJob or a third-party validation platform.

Deployment Configuration Gaps

This is where most errors occur. The reason is simple: The Dev team is responsible for the container, the Ops team is responsible for the cluster and the deployment configuration lies somewhere in a murky gray area between them. Cultural differences, miscommunication and mismatched expectations in container orchestration can lead to serious issues here.

Talk to a Dev team and an Ops team and you’ll find they have very different goals. The Dev team wants to ship features as quickly as possible and pursue innovative changes. But the Ops team aims for stability, scalability and predictability.

This cultural difference often comes to light when building a deployment configuration. The Dev team’s priority is to ensure the application functions; they may request large amounts of memory and CPU, grant their containers excessive security capabilities or neglect to build “optional” liveness and readiness probes. But when there’s a security breach, the cluster starts running out of resources or an application fails to scale appropriately, it’s the Ops team that’s on the hook, so they’ll advocate for a much tighter and more comprehensive deployment configuration.

When developers own the deployment configuration (which is often the case, given the amount of application context necessary), it can be difficult for Ops to convince them of the need for things such as resource limits and health probes. After all, Kubernetes treats these fields as optional and doing the work to set them appropriately can feel like a distraction from user-facing features. Third-party configuration validation tools, either open source or commercial, can help communicate the importance of strong configuration and encourage development teams to stay up-to-date with best practices.

Getting deployment configuration right is difficult, but critical. Without the right measures in place, you’re likely to run into problems down the line—anything from a security breach to cloud cost overruns to full-blown application outages.

Cultivating Collaboration

The solution here is more human than technological: intelligent collaboration. While the aforementioned cultural differences aren’t going away anytime soon, team leads can facilitate a more effective approach to container orchestration with a few practices.

Establish a culture of trust. The Dev and Ops teams need to feel they are partners in the same goal, rather than pursuing separate missions under the same company mantle. Any tension will hinder the ability to work together effectively, so make sure these teams recognize each other’s value.
Set up strong lines of communication. These teams need to talk to each other and easily share documents, updates and ideas. Rather than keeping them in silos, create pathways so they can check in with each other at a moment’s notice, collect feedback and access the same information.
Standardize processes and tooling. Shared, neutral ground is critical for positive collaboration. If your Ops team uses Jira, but your Dev team uses GitHub issues, there’s going to be miscommunication. Engineers will hesitate before opening communication inside another team’s tooling, so make sure everyone feels at home in the same place.

Shaping the Future of Software

The benefits you reap from a new technology are commensurate with the strength of the process you build around it. Teams that bring strong container orchestration practices to the table, whether their own or a partner’s, are going to enjoy faster deployments, a more efficient infrastructure and a positive culture of collaboration. From streamlined operations to a more secure environment to stronger productivity, the advantages are too valuable to ignore—the right practices can unlock a new world of cloud-native development.