Resilience. Today’s distributed software systems need more of it. Unstable applications and fragile architectures lead to outages – and outages cost companies big bucks. As John Allspaw, Co-Founder of Adaptive Capacity Labs, puts it: “‘Resilience’ is the story of the outage that DIDN’T happen.” Chaos engineering, which involves conducting thoughtfully-planned failure injection experiments to teach engineers how systems behave in the face of failure, can build that resilience and ensure our stories have a happy ending. But it can be a hard concept to sell to corporate executives.
What Is Chaos Engineering?
Chaos engineering can be thought of as being like a vaccine or flu shot. And, like such shots – where we deliberately inject ourselves with something harmful – the principles of chaos can seem counterintuitive on the surface. However, vaccines work because they are like a training course for our immune systems – they make us produce antibodies to fight disease without actually infecting us with the disease itself. As such, when a vaccinated person does come in contact with the disease, their body’s immune system recognizes it, and immediately produces the antibodies it needs to fight it.
Chaos engineering proves that the same technique can apply to complex software systems in the cloud, as well. By carefully injecting a small amount of harm into the system, we can test the system’s ability to respond to it – and thereby go about building an effective, resilient immune system that fights against failures.
This is chaos engineering – i.e. the practice of introducing controlled disruptions into a distributed system, carefully studying the resultant behavior, identifying weak areas, and then improving resiliency based on what we’ve learned.
The Hard Sell
Naturally, trying to explain to the C-suite that you want to start deliberately creating problems for the system can be a bit of a hard sell. Admittedly, chaos engineers flirt with danger. After all, the practice involves proactively forcing applications and services to fail – the fact that the purpose is to learn more about how to keep them running perhaps doesn’t sound particularly persuasive if they’re all running fine at the moment.
As a team of senior software and chaos engineers at Netflix puts it in The Business Case for Chaos Engineering, published in IEEE: “Chaos Engineering sounds dangerous (‘you want to intentionally cause problems in the production system?’), so you’ll need to make the case to your organization that a chaos program is worthwhile.”
So – where should you start?
Illustrate the True Costs of Downtime and Failure
Well, to begin with, it’s important to communicate just how high the stakes can be. All companies must do everything they can to mitigate the cost of downtime – which Gartner puts at $5,600 per minute, or $336,000 an hour. Real-world examples will help illustrate your case. For example, in 2012, US trading firm the Knight Capital Group lost $440 million in 30 minutes due to a software configuration problem. More recently, in 2017, a technological failure which stranded tens of thousands of British Airways passengers cost the company around £80 million ($102.9 million).
In both cases – and many more like them – if chaos engineers had tested the respective systems for failures ahead of time, the outages and the huge monetary losses might have been avoided.
Take the story of Netflix, for example, and how it managed to avoid any serious consequences as a result of the 2015 outage that ripped through Amazon Web Services. The outage affected many well-established and tech-savvy companies that rely on AWS, costing them several hours of operation. But Netflix came out practically unscathed. How? Because Netflix is one of the pioneers of chaos engineering, and was running a set of testing tools – the Simian Army – that simulate outages and system failures before real ones strike.
“Chaos Kong is one part of a suite of tools Netflix has built to test their software’s operation in the event of outages of the infrastructure that it runs on,” writes Cornelia Davis, Vice President of Technology at Pivotal, and author of ‘Realizing Software Reliability in the Face of Infrastructure Instability’, which appears in IEEE. “Different testing tools that make up the Netflix Simian Army are cast against different failure domains. Chaos Kong simulates region outages. Chaos Gorilla simulates availability zone failures. Chaos Monkey actually kills servers.”
Companies that use cloud services should expect outages and service failures – chaos engineering ensures that organizations are prepared for these inherent inevitabilities.
Minimizing the Blast Radius
In forest management, when it’s wet and cold in winter, controlled burning is deployed to prevent devastating forest fires in the hotter and drier summer months. The same principle applies to small, controlled chaos experiments, which are designed to uncover vulnerabilities that would otherwise cause much more significant damage.
However, when building your case, you must be sure to convince the board that the fires you want to start will indeed be small, contained, and can be stamped out before they burn down the entire system.
“Regardless of the level of impact, it’s important to have safety features with chaos experiments that minimize blast radius,” write the Netflix team in The Business Case for Chaos Engineering. “Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the chaos engineering team to ensure the fallout from experiments is minimized and contained.”
The Cost-Benefit Analysis
There are essentially two first-order costs to a chaos program. The first is the chaos engineering team that you will need to employ; the second is the cost involved with any harm that is caused to the system, plus the cost of mitigating that harm.
The Netflix team explains it like this: “Some induced harm will be intentional, and some will be accidental. Intentionally inducing harm may sound odd, but that’s part of what chaos engineering is about. Every outage preventable by chaos engineering will have corresponding chaos-induced harm that is uncovered by an experiment. The difference is that the chaos-induced harm (if done correctly) should be orders of magnitude smaller than the outage it prevented.”
In the report, the authors devise a return on investment (ROI) equation which can be used to calculate the total costs of running a chaos program.
(Image source: computer.org)
To illustrate how the equation works, the team created their own ROI example:
(Image source: computer.org)
Chaos engineering will require both a technical shift and a cultural shift within the organization. From a technical perspective, since applications may not have been originally designed to support chaos experiments, new tools – such as fault injection and guard rails to minimize blast radius – must be incorporated. From the cultural side of things, a successful program will fundamentally change the way software engineers build systems by focussing on and creating incentives for resilient design.
Though the idea of chaos might sound crazy to some, the payoffs can be huge. When pitching, focus on how chaos experiments are designed to be controlled to minimize the blast radius, that the benefits outweigh the costs, and build a compelling narrative that illustrates just how high the stakes are.
The last word goes to the Netflix team: “By telling a story around the benefits of chaos realized at organizations such as Netflix, and by making quantitative arguments based on your historical data, you should be able to make a strong case that introducing chaos engineering will yield real, tangible benefits to the business.