Chaos Engineering Testing Correlates with Kubernetes Adoption

A survey of 400 IT professionals finds a high correlation between organizations that have adopted Kubernetes and those that have implemented chaos engineering techniques to ensure availability and decrease recovery times in the event of a failure.

The State of Chaos Engineering survey, conducted by Gremlin, a provider of a platform for launching tests based on chaos engineering techniques, finds more than 68% of respondents have conducted a chaos engineering testing attack during which components of an environment are randomly made unavailable against their Kubernetes cluster.

Aileen Horgan, vice president of marketing for Gremlin, says testing based on chaos engineering techniques is gaining traction in Kubernetes environments because of the dependencies that exist between microservices deployed on Kubernetes clusters. Randomly making services unavailable allows IT teams to determine how robust a complex Kubernetes environment really is, Horgan says.

The primary benefit of conducting tests using chaos engineering techniques are increased availability and decreased mean time to resolution. (MTTR). The top 20% of respondents had services with an availability of more than four nines, while 23% of teams had an MTTR of under an hour, with 60% having an MTTR of under 12 hours.

Chaos engineering as a best-practice form of IT environment testing is still in its infancy. However, 60% of respondents report they have conducted at least one chaos engineering attack. The most common form these tests take is network attack (64%), followed by resource attacks (38%). A full 70% of chaos engineering attacks are aimed at hosts, followed by containers (29%).

The study finds the primary reason more organizations are not using chaos engineering techniques is, not surprisingly, a lack of awareness (80%). Gremlin is making its entire library of attacks and experiments available for free to help organizations gain more experience with chaos engineering. Despite the lack of deep expertise, however, the survey notes more than a third of respondents (34%) have launched a chaos engineering test against a platform in a production environment.

In general, the survey finds 19% of respondents are experiencing 10 to 20 high-severity incidents a month, with 81% coping with somewhere between one and 10. The most frequently cited root causes of those issues were bad code (41%) followed closely by internal dependencies (39%).

The most widely employed tools for monitoring IT environments are health checks/synthetic testing (64%), followed by server-side responses (50%) and real user monitoring (37%).

Thus far, Gremlin reports close to half a million attacks have been launched using its platform. As more organizations embrace chaos engineering to test increasingly complex IT environments, the frequency of tests will increase. The challenge now is getting IT teams that are naturally inclined to ensure stability to deliberately disrupt systems and software to test application resilience.

Regardless of their approach to monitoring and testing, a major reason IT organizations embrace microservices is to make sure applications degrade gracefully rather than experiencing an outright failure in the event a component, or components, suddenly is unavailable. While that approach sounds good in theory; in practice, there is usually one or more dependencies that result in a microservice application experiencing failure. The challenge is determining potential sources of disruption before those can impact production environments.

Mike Vizard

Mike Vizard is a seasoned IT journalist with over 25 years of experience. He also contributed to IT Business Edge, Channel Insider, Baseline and a variety of other IT titles. Previously, Vizard was the editorial director for Ziff-Davis Enterprise as well as Editor-in-Chief for CRN and InfoWorld.

Mike Vizard has 1754 posts and counting. See all posts by Mike Vizard