Why Maintenance Windows Don’t Work for Cloud-Native Apps

Several years ago, I purchased a digital “smart” thermostat for my home. I wanted to be able to set the temperature remotely from my iPhone. I wanted to know how it was working while I was traveling. I set it up and connected it to the manufacturer’s cloud backend. It was working fine at first, but that would soon change.

A couple of weeks later, I received an email from the manufacturer about an upcoming upgrade to its backend cloud service. It was a major upgrade, and the vendor wanted to inform us what was going to happen. According to the email, the company would bring down its application “for many hours at a time” and would do so at “various times of the day.” They, of course, did not say when those times would be or for how many hours it would be down.

But that wasn’t all. They went on to say that this maintenance window of up and down availability would last for “several months.”

Okay, wait. Let me see if I understand this. At seemingly random times during the day, my thermostat was going to stop working for many hours at a time, and this would go on for months. And this was all planned. I don’t think so. The next day, I replaced the thermostat with one from another company. There was no way I was going to deal with that level of bad service.

This illustrates an extreme example of a common problem in many online applications: The companies operating the applications create “maintenance windows”—periods of time where they regularly bring the application offline to perform routine maintenance and upgrades. The idea is that, by announcing the maintenance windows in advance, companies feel their customers will be able to work around the downtime. The downtime isn’t a “failure.” It is planned, so it wasn’t an availability problem.

The problem is that companies with these maintenance windows treat the windows as if they are “free downtime.” They feel free to bring their applications down to work on them without it “counting” as “real” downtime. Since it was planned downtime, the argument goes, it doesn’t count.

Nothing could be further from the truth. Downtime is downtime. Whether it is planned, expected, or unplanned and unexpected, if your customers want to use your application and the application is unavailable for any reason, it is downtime.

You cannot operate a modern online application without maintaining a high level of availability. And this availability has to be measured from the customer’s point of view, not your own internal point of view. If your customers want to use your application and it is down, it doesn’t matter whether or not the downtime was planned. They do not care about your maintenance schedules. They want to use your application when it’s convenient for them, not when it’s convenient for you.

In these modern times, with the tools, services and processes available for modern application development, there is no reason why an online application should require any downtime for any maintenance or upgrades. In today’s world, it is unnecessary. From a customer’s point of view, it is also unacceptable.

Almost any upgrade can be performed live on your site without taking the application down for maintenance. Even upgrades that require database schema changes and other data migration tasks can be implemented without requiring downtime. Sometimes, it might take more planning and effort to perform the upgrade without downtime, but it can be done. Virtually all maintenance and upgrade tasks can be performed while the application continues to operate. There is no longer any valid reason for you to plan on bringing your modern application down.

The High Cost of Maintenance Windows

A previous client of mine regularly scheduled a two-hour maintenance window each week so they could perform upgrades while allowing them to keep operating normally the rest of the time. By scheduling downtime, their argument went, they could keep the application operating 100% of the time. If a problem required downtime, they routinely held off the change until the next maintenance window and made the change when it presumably didn’t impact their measured availability.

The problem is that the maintenance window is, by itself, a major hit to availability. A two-hour maintenance window means that the greatest availability you can offer to your customers is 98.8%. By definition, you will not be able to operate greater than 98.8% of the time.

Compared to other online applications, 98.8% uptime is a horrible statistic. For example, the Amazon S3 service guarantees 99.99% service availability (and has an even higher data integrity SLA). This guarantee amounts to a maximum of 61 seconds of downtime each week. For Amazon S3 to make this SLA consistently, Amazon can never plan to have any downtime for any maintenance, ever. Any outage at all will cause them to fail their contracted SLA.

And they back up this SLA policy with money. If Amazon S3 is down a mere 4.3 minutes in any given month, AWS will refund 10% of everyone’s storage costs for that entire month. As you can imagine, this would be a significant loss of revenue.

And it’s not just S3. It’s a mindset across AWS and across all of Amazon. This commitment is ingrained in the minds of every engineer at Amazon. You build everything so that no downtime is ever needed, no matter what the change to the system involves. No downtime, ever.

Not all companies can achieve this level of availability. Sixty-one seconds of downtime a week is an extremely aggressive number. But if you create routine maintenance windows, you automatically start at a reduced level of availability, whether you need to or not.

In today’s modern world, all applications can strive to maintain operational status at all times. For most applications, this is absolutely essential.

For modern, cloud-native applications, there is no excuse not to try. Don’t use regular maintenance windows.

Lee Atchison

Lee Atchison is an author and recognized thought leader in cloud computing and application modernization with more than three decades of experience, working at modern application organizations such as Amazon, AWS, and New Relic. Lee is widely quoted in many publications and has been a featured speaker across the globe. Lee’s most recent book is Architecting for Scale (O’Reilly Media). https://leeatchison.com

Lee Atchison has 59 posts and counting. See all posts by Lee Atchison