Using Kubernetes for Mission-Critical Databases

Successfully using Kubernetes with production databases requires the right level of management and monitoring

It is remarkable how quickly Kubernetes has moved through the hype cycle to become an integral part of the discussion around agile enterprise IT environments for its ability to orchestrate containers. Kubernetes ensures they are automatically reconfigured, scaled, upgraded, updated and migrated without disrupting applications and services. It also monitors the health of the runtime container deployment and provides a level of high availability and continuity in failover or disaster scenarios. It can also monitor, load balance and automatically scale services up and down in response to changes in demand, as well as facilitates networking among containers or to external environments. In addition, Kubernetes handles data storage in general, and statefulness in particular, through the management of attached persistent or ephemeral volumes, which can be local or in the cloud.

What Kubernetes Can’t Do So Well

However, there are some additional technical requirements when considering Kubernetes and relational database containers. Quite rightly, Kubernetes treats pods running the same service as identical. This means if you are running a master and multiple database replicas all the pods in the Kubernetes cluster are treated like cattle, not pets. Unfortunately, due to the nature of databases and the fact that they manage mission-critical information, this approach is not ideal. Traditionally, a database would be seen as a pet—indispensable or unique and a system that cannot afford to go down. We also must consider issues such as master/standby relationships and replication lag when repairing a cluster.

Cattle are disposable servers built with automated tools, so if they fail or are deleted they can be replaced with a clone immediately. Kubernetes treats a database as cattle, so although it offers high availability for the cluster, it is not enough reassurance for tier 1 enterprise applications that require four- to five-nines of unplanned downtime–this is roughly 4.3 minutes a month for 99.99% and 26.3 seconds a month for 99.999% availability. If a database were to go down, connecting to the storage and performing crash recovery will take time—time that enterprise-ready databases cannot afford. They must be operational as close to immediately as possible, starting from the same point in time and without losing any data.

Key Considerations for Kubernetes and Database High Availability

This is not to say that Kubernetes cannot perform some level of high availability as mentioned earlier. So the key question has to be, How much downtime can your business cope with?

While this appears obvious, it should be foundational for deciding how you approach high availability for containerized databases. Clearly, there are different categories of database environment:

  • Development: This is where you develop and test applications and do not require completely robust performance. Kubernetes can offer high availability functionality in these environments.
  • Production: A database requiring limited high availability support.
  • Mission-critical: For enterprise customers, this full production framework must ensure the highest possible availability.

Beyond understanding your tolerance levels for the downtime, it is also important to consider how you manage your high-availability cluster. Clustered containers running Postgres (either PostgreSQL or EDB Postgres Advanced Server) require a controller to monitor and manage the cluster. There are open source tools, such as Patroni and Stolon, that provide some level of functionality, such as monitoring of the cluster and management of the Postgres instance. As a business, you should evaluate your priorities and the total cost of ownership of using open source tools versus using tools from companies that combine support with the tool to provide the levels of assurance needed in such mission-critical environments. You should create a checklist of functional requirements, assessing whether your monitoring and cluster management tool can:

  • Support failover and switchover, known as controlled failover.
  • Discover the nodes that are participating in the cluster and then coordinate operations within that cluster.
  • Perform network connection routing, because incoming connections must be routed from a fixed endpoint to the appropriate node.
  • Provide flexible load balancing across the cluster.
  • Ensure efficient use of database server resources with connection pooling.
  • Select nodes if failover events occur, in particular, tracking of the WAL receive and replay location and the ability to exclude a node entirely from the possibility of becoming the new master.
  • Offer both synchronous and asynchronous replication modes.
  • Initialize new replicas.
  • Ascertain the nature of any failure, not just whether a failure occurred. Failure detection—and more importantly, understanding of the detected failure scenario—is extremely important in understanding how to behave when managing a cluster of servers.

Conclusion: Designing for Your Reliability Requirements

Ultimately, if you are considering containerization of your database infrastructure and looking to use Kubernetes for orchestration, the process is no different from building and deploying any other mission-critical database.

Kubernetes and containers are still maturing as a technology, which means you must carefully consider the implications of moving mission-critical data to such a platform. It is reasonable to adopt the combination for test environments, but if you want to employ Kubernetes and containers for production databases it is essential that you put the right level of management and monitoring in place.

Databases expect to receive VIP treatment when it comes to ensuring high availability and minimizing downtime, so you must weigh your organization’s tolerance of downtime versus the potential agility and flexibility that containers may offer when deploying database instances. Taking this approach will enable you to understand the best way to ensure high availability for your business while benefiting from the agility of containers and Kubernetes.

Dave Page

Dave Page has been actively involved in the PostgreSQL Project since 1998, as the lead developer of pgAdmin, maintainer of the PostgreSQL installers and occasional feature hacker. He also serves on the project's web and sysadmin teams and is a member of the PostgreSQL Core Team. He joined EnterpriseDB in 2007 and has been influential in the company’s direction and development of critical database management tools.

Dave Page has 1 posts and counting. See all posts by Dave Page