Rethinking Anomaly Detection in Cloud-Native Applications

March 31, 2022March 31, 2022 Scott Fulton anomaly detection, cloud-native applications, container security, containers

From microservices to multi-cloud, modern application architectures have evolved significantly and created new challenges that are drowning engineers and DevOps teams in data and increasing the number of tools they are being required to manage. Today’s modern cloud applications are complex and the IT infrastructure supporting them has become dispersed, with multilayered systems. Add to that the incredible amount of telemetry data that systems generate each day and the result is a flood of alerts that have now outpaced the human ability to assess and manage it. The ability to detect anomalies now requires a simplified and improved approach that contextualizes application workloads with their environments and the large quantities of operational data. While existing approaches to anomaly detection have existed for decades, they predate the cloud-native era and are limited when applied to more complex microservice applications.

In this piece, we will explore three areas of anomaly detection that keep DevOps teams awake at night, including cloud application workloads and their impact on application behavior, the limitations inherent in relying on a single metric outlier and how isolating the root cause analysis without context creates remediation challenges. To properly address these challenges, we will also discuss a new application-aware approach to anomaly detection that can address the scale and complexity of microservices architecture.

Lack of Account for Cloud-Native Application Workloads

Isolating the cause of an anomaly is difficult without proper context. Too often, the lack of contextual information creates an opportunity to ‘kick the can down the road’ and put off resolution until later. Without specifics about the application workloads, alerts cannot be attached to appropriate runbook remediation, which leaves engineers to manually analyze the cause and resolve the problem.

To reduce the manual effort, the detection of anomalies in cloud-native applications requires more context to streamline the process and avoid too many false positives. But gathering this level of insight requires embedding application knowledge into the anomaly detection process to account for application workloads and their impact on application behavior.

Limitation of Relying on a Single Metric Outlier

Rare or infrequent anomalies within systems often entail debugging at a very granular level, which can make relying on a single metric outlier a challenge. When users manually decide on which metric(s) to set a threshold like response time, it can often miss the true problem indicators. Additionally, besides the burden of tuning the thresholds to avoid false alerts, the manual process is susceptible to missing the true problem indicators that can often be derived from environmental conditions.

What’s often overlooked in existing approaches is creating metrics that measure the value derived from a system’s overall performance, such as environmental conditions like workloads. By using a machine learning approach to gather metrics on workloads in relation to system uptime, response time, the number of requests per second and how much processing power or memory an application has, the necessary thresholds can be created. If an abrupt spike in traffic occurs in the system, this correlation of metrics can provide the visibility and insight to understand the cause of the spike which may be due to an incorrect service configuration, malicious behavior or issues with other parts of your system. In addition to providing visibility, you can also use the information to detect and determine the severity of issues.

Isolating Root Cause Analysis Without Context

Root cause analysis (RCA) is a concept first developed by Sakichi Toyoda in 1958 as part of Toyota’s manufacturing process. Since it was introduced, it has been adopted by almost every industry from publishing to engineering. The idea of identifying an anomaly and then using the RCA process to help triage and resolve the resulting performance issues has traditionally been the accepted approach. The challenge for modern environments is that once a war room is created to investigate, some other change has already occurred that complicates the exercise and ties up ongoing cycles from developer and infrastructure teams.

Prioritizing and remediating actions based on what impacts the business requires context. By using data on how individual components of a system work together and tracing those correlations to find the root cause, connections between seemingly unrelated events can be identified. Then, it’s easier to understand why a change in performance occurred and how to avoid it in the future. Today’s complex dependencies in cloud-native applications mean that technologists often understand less about the application than they think.

While cloud-native technology adoption is mainstream, a recent survey revealed that security (24%), management (23%), high availability (22%) and observability (16%) continue to be the top concerns. Addressing these priorities requires rethinking traditional anomaly approaches that rely on setting thresholds to detect outliers on specific metrics that are independent of the application context. Rather, using a detection system that is based on learning correct behavior and utilizes appropriate data collected at runtime over some configurable time—for example, 24 hours or less—is required for modern cloud application environments. With this type of learned model deployed at runtime to check for deviations from expected state, DevOps teams and engineers can work effectively to identify, isolate and fix anomalies quickly to meet customer experiences.