Privilege Escalation in Cloud-Native App Production Environments

With today’s modern digital applications, managing access permissions during operational events is crucial to ensuring the safety and security of an organization’s production applications and infrastructure. A common and essential security principle, the principle of least privilege, states that developers and support engineers should have as little access as possible to the production environment and the data it contains to mitigate the risk of unauthorized access or data breaches.

However, limiting access to production can become a hindrance during operational emergencies when the on-call engineer needs additional permissions to resolve production problems. This is a particular problem for modern cloud-native applications. These applications tend to have a distributed responsibility model, such as STOSA, which tends to increase the number of engineers that potentially need emergency access during production issues.

How do you grant on-call engineers the necessary permissions to tackle production emergency issues while ensuring application security and data integrity by limiting access? The answer lies in privilege escalation.

Privilege escalation is the process of giving on-call engineers temporary additional access permissions to production resources for them to resolve production emergencies. There are several ways of doing this.

One of the best models is to create limited-scope support tooling. These privileged tools perform production-impacting actions that are sometimes necessary to resolve production problems. But it does so safely and securely without giving up complete control of your production environments. The easiest way to understand limited-scope tooling is to consider an example.

A typical action on-call engineers must take to resolve an ongoing production emergency is a simple server reboot. Often, by simply rebooting a particular server or set of servers, the current emergent problem will disappear. Of course, this may not resolve long-term issues. Still, it can be an excellent short-term remedy for an emergent issue impacting a customer’s ability to use some capability of your application.

But what does it take to reboot a server? Typically, an on-call engineer does not have physical access to the box, and it usually isn’t even located geographically close to the on-call engineer. So, a software reboot is required. For most Linux servers, this means logging in as the root or superuser on the server and issuing a reboot command. However, this means your on-call engineers need to have superuser permissions on your servers. Yet, with superuser permissions, the on-call engineers can do all sorts of other actions on the servers—intentionally or accidentally—including deleting valuable data. And a disgruntled employee can use the access to destroy the server and steal sensitive data. Therefore, giving your on-call engineers superuser access to all your servers is not a reasonable production best practice and goes against the fundamental tenets of the principle of least privilege.

So, an alternative is used. Instead of giving all on-call engineers superuser access, a tool (or script) is created that doesn’t require superuser access to operate. Instead, the tool validates the user has permission to perform a simple server reboot. If they do, the tool itself performs the reboot. Finally, the tool will create a log paper trail of who requested the reboot and when for auditing and follow-up analysis.

This reboot tool is a simple example of such a limited-scope support tool. By creating tools such as this for commonly performed support tasks, you can reduce cases requiring special production permissions.

While this can go a long way toward removing the need for giving extra permissions to your on-call engineers, it probably won’t eliminate the need. Even if you’ve invested heavily in building tooling to help with every known support scenario you’ve seen in the past, you still don’t know what unforeseen issues will come up during an on-call rotation. You still need to allow your on-call engineers to perform operations beyond their normal permissions level. How can you do that safely and securely?

A common model is to use two-person requests. With this model, whenever anyone needs to use elevated permissions, such as superuser permissions, they need two separate engineers to work on the problem, and both must agree to the request before the permissions are granted. Then, by policy, all actions taken in the elevated state must be reviewed and approved by both engineers before they can be executed. Processes such as this are designed to provide a checks-and-balances approach to ensure bad actors can’t use the permission escalation mechanism to get unauthorized access to your applications. Yet when a real crisis is ongoing, legitimate engineers can still get their job(s) done with the escalated permissions they require.

Between using specialized tools and two-person permission requests, you can provide your on-call engineers access to the production underpinnings of any cloud-native application without endangering the security, governance policies and requirements your business and customers demand.

Lee Atchison

Lee Atchison is an author and recognized thought leader in cloud computing and application modernization with more than three decades of experience, working at modern application organizations such as Amazon, AWS, and New Relic. Lee is widely quoted in many publications and has been a featured speaker across the globe. Lee’s most recent book is Architecting for Scale (O’Reilly Media). https://leeatchison.com

Lee Atchison has 59 posts and counting. See all posts by Lee Atchison