Making Sure Your Cloud-Native Applications Can Fail

April 25, 2023April 25, 2023 Lee Atchison cloud native, cloud-native applications, containers, observability, SRE

Make sure your applications can fail. Sounds weird, doesn’t it? But nothing is more critical to creating a highly reliable, cloud-native application than to ensure you can fail successfully.

The key is planning.

In a previous article, I talked about accepting more failures by reducing the amount of QA you put your applications through during the release cycle. The idea behind this strategy is simple: Reducing QA time would allow a faster cycle time. A faster cycle time means faster releases and hence faster and less costly deployments. By making deployments less expensive, you can fix issues you discover more quickly and less expensively. Lower cycle time means cheaper fixes.

In essence, reducing the amount of QA ends up reducing your risk which ultimately improves your quality. Your entire deployment process is faster and—surprisingly—more reliable. Accepting some amount of failure allows for building a more reliable system.

But there is more to accepting failure than reducing cycle time. Using architectural patterns that allow systems to self-repair from failures is critical in maintaining a highly available application. Rather than investing heavily in ensuring your infrastructure is reliable, leveraging cheaper infrastructure and investing in greater redundancy improves your overall availability.

For this model to succeed, you must architect your system to accept and respond to failures effectively and automatically. If you build systems, processes and procedures for handling what to do when your application fails, you can accept a higher failure rate because you are more prepared to deal with it.

Let’s consider an example. Let’s look at a cloud server instance. A common best practice for operating a cloud server instance is to maintain it as a stateless server. You can do this by, among other things, using storage external to the instance and not storing critical data on the server itself. Then, construct the server instance so that, on startup, it automatically sets itself up and automatically takes on its assigned job. This step usually involves some automated setup scripts and procedures.

Once you’ve done these two things, server management becomes substantially more straightforward. For example, if a process goes crazy on one of your servers and is mucking with the system’s operations, your response is simple. Terminate the server and start a new one. The new server will start up and replace the bad instance. It will automatically set itself up and begin operating precisely as it should.

What you’ve done is you’ve created a design pattern for your application to manage an infrastructure failure gracefully. As a result, a simple infrastructure failure (in the form of server failure) doesn’t result in a catastrophic application failure.

Implementing a few simple but essential steps, such as these, allows your entire application to have higher availability because it can tolerate and respond to failures faster and easier. By understanding and accepting the types of failures that can routinely happen, you’ve made your application resilient to those types of failures.

The entire cloud is built on this basic premise. Rather than building robustness into the cloud—robustness designed to keep the cloud operational at all times—the cloud is instead constructed from normal and reduced reliability components. Then, it is architected using redundancy and self-healing protocols so that when (not if) a failure occurs, the system can quickly and easily recover without causing higher-level failures.

The highly dynamic nature of cloud resources enables this design-for-failure ability. Being able to quickly replace resources with fresh resources, such as server instances, is one aspect. Being able to add needed resources rapidly to respond to desired or undesired traffic spikes is another aspect that improves your ability to survive potential failure scenarios. Networking problems, denial-of-service (DoS) attacks, resource failures, data center failures—any of these types of failures can be managed defensively with your cloud-native, cloud-enabled application architecture.

This philosophy was best expressed by Werner Vogels, CTO of Amazon, when he said, “Everything fails all the time.” What matters isn’t how you keep something from failing. What matters is what you do and how you respond when something fails. By architecting your application to be tolerant of failures, you dramatically improve the availability of your application as a whole.

Cloud-native applications are built using this premise. Multiple instances of each microservice are set up to operate in a highly distributed infrastructure designed so that if parts of the system fail, the rest of the system can respond, repair and continue operating normally. Designing for failure is a critical aspect of any cloud-native application architecture.