Striving for Higher Quality Cloud-Native Apps by Leveraging Chaos

In a recent article, I discussed how bugs and mistakes should not be feared in the software development process. In a related article, I discussed the use of ‘replace, don’t repair’ in resolving production issues.

The fact is, building high-quality modern applications is quite different than it used to be. No longer do we need—or want—long test suites written by large QA teams to evaluate our applications before we deploy them. No longer do we need system administrators to hand-tune our servers and keep them operating. We terminate servers when they no longer perform the way we want them to.

Skipping QA can improve cycle time, terminate failed infrastructure and reduce mean-time-to-repair. Together, these strategies improve the quality of our applications.

These strategies are easier to implement when the application is a cloud-based, cloud-native application. The dynamic infrastructure of the cloud combined with the continuous integration and continuous deployment (CI/CD) methodologies inherent in cloud-native architectures, enables rapid innovation.

Another similar technique that can improve our long-term application quality and reliability is chaos-driven infrastructure.

Chaos Means Quality

A chaos-driven infrastructure intentionally generates chaos by inserting errors into the system to encourage a higher-quality solution overall. Netflix was one of the early adopters of this methodology, using a tool they called “Chaos Monkey.”

The idea behind chaos-driven infrastructure is to constantly introduce problems into an operating network and design the network to recover from these problems automatically.

Done correctly, your application creates a balance. On one end, the chaos injection balances the automated problem resolution on the other end. As a result, your application continues to operate normally.

However, suppose the chaos introduces a problem that cannot be automatically corrected. In that case, support teams are immediately engaged to resolve the issue and update the mechanisms so they can automatically fix these problems in the future. You are increasing the resiliency of your software by improving the automated repair functionality.

The result is a system that stays operational and can resolve problems quickly and, in many cases, automatically. Additionally, you have a staff that is used to fixing these types of issues and knows what it takes to bring a system back online quickly. Your system has built a natural immunity to the random chaotic problems that occur.

If a real, random problem were to happen, chances are this chaos-immune infrastructure will be able to handle it, and the problem won’t bring your application down. It’s just like a human vaccine that works by empowering the natural immunity centers in your body to combat a disease if it ever appears.

By actively managing when and how chaos is interjected into the system, you can make sure the “chaos-based learning” only occurs at times when there are support engineers around to resolve issues that crop up—this reduces annoying support escalation processes that always seem to happen in the middle of the night. This approach also reduces longer mean-times-to-resolution that can exist with problems that come up when you aren’t physically and emotionally available to deal with them.

Adding chaos to your infrastructure improves your application quality.

Quality Through Chaos

Modern applications require high availability and scalability to meet the demands of our customers. Achieving quality in these applications is essential but not always straightforward. Traditional methods like using a QA team to develop and execute test suites are no longer effective in our fast-paced, innovation-driven world. Chaos-driven architectures give an unexpected yet highly valuable route to high-quality applications.

From fast cycle time to replace, don’t repair to chaos-based infrastructures, quality comes in unexpected ways in modern cloud-native applications.

Lee Atchison

Lee Atchison is an author and recognized thought leader in cloud computing and application modernization with more than three decades of experience, working at modern application organizations such as Amazon, AWS, and New Relic. Lee is widely quoted in many publications and has been a featured speaker across the globe. Lee’s most recent book is Architecting for Scale (O’Reilly Media). https://leeatchison.com

Lee Atchison has 59 posts and counting. See all posts by Lee Atchison