Replace, Don’t Repair, for Higher Quality Cloud-Native Apps

In a recent article, I discussed an odd strategy for improving the quality of an application: Not testing it. The strategy is based on the philosophy that bugs and mistakes should not be feared in the software development process.

As odd as it may sound, skipping QA can improve the overall quality of your application by decreasing cycle time and your ability to respond when something does need to be changed.

This rapid cycle time is a core characteristic of DevOps and is made substantially easier when the application is a cloud-based, cloud-native application. The dynamic infrastructure of the cloud—combined with the continuous integration and continuous deployment (CI/CD) methodologies inherent in cloud-native architectures—enables the rapid innovation leveraged by this technique.

In fact, there are other techniques that, on the surface, appear to lower quality. In reality, they assist the long-term quality improvement inherent in modern application architectures. One of those strategies is replace, don’t repair.

Replace, Don’t Repair

When you build out your own data centers, you need expertise in operating the servers in those data centers. System administrators are needed to fix problems that occur within the servers. Simple matters such as processes getting stuck, log files not rotating correctly, and configurations getting corrupted occur on servers all the time. If the server is used in production, sometimes you must take it offline to fix it. But in any respect, your application suffers while you are repairing the failed server.

With cloud-native applications, this problem goes away. Each cloud component—especially servers—can easily be replaced. If a server begins to act up—for instance, if a process isn’t working right on that server—it’s a simple matter to terminate that server instance and replace it with a completely new one. It’s often that easy to resolve an internal server issue. Rather than spending hours and hours of a valuable sysadmin’s time keeping an ill-performing application operating on a malfunctioning server, a simple terminate-and-restart can get your system back up and running in minutes. And, assuming you are handling redundancy and availability correctly in your cloud-native application, the action can be completely transparent to your running application, and customers may never even notice that you brought a server down.

So, is the server low on memory? Terminate and restart. A process is hung or stuck? Terminate and restart. Is an unknown problem causing the application on a server instance to act up? Terminate and restart. The vast majority of the time, that will completely resolve the issue.

This capability is due to the dynamic nature of the infrastructure used by cloud-native applications. New resources are easy to come by and old resources are recycled and reused in other applications.

This terminate-to-repair strategy may seem counterintuitive. How can you fix a problem by presumably making it worse? If a simple process needs to restart, why destroy the entire server? The strategy works because it’s easy, quick, convenient and less obtrusive to the application.

In short, a willingness to “fail” a server ultimately makes your application higher quality.

Quality is the Goal

Higher-quality applications demand higher availability and scalability and operate at the level that our customers demand. This is an essential characteristic of all modern applications. Yet how you achieve quality is not necessarily straightforward. Traditional methods of achieving quality—such as using a QA team to develop and execute a test suite—are no longer appropriate in our fast-moving, innovation-driven modern world. Instead, we must use our innovation to drive higher quality, and cloud-native applications have great mechanisms for doing just that.

From fast cycle time to replace, not repair, quality comes in unexpected ways for modern cloud-native applications.

Lee Atchison

Lee Atchison is an author and recognized thought leader in cloud computing and application modernization with more than three decades of experience, working at modern application organizations such as Amazon, AWS, and New Relic. Lee is widely quoted in many publications and has been a featured speaker across the globe. Lee’s most recent book is Architecting for Scale (O’Reilly Media). https://leeatchison.com

Lee Atchison has 59 posts and counting. See all posts by Lee Atchison