If a monkey hits keys at random on a typewriter for an infinite amount of time, at some point it will surely write the complete works of William Shakespeare. So, why wouldn’t we do something like that to find flaws in software? That seems like an odd question. Yet, since 2011, it makes sense, especially for software testers.
That was the year in which Netflix made the switch from a physical infrastructure to the cloud to improve its streaming services. Such a move forced the company to test the new system’s reliability. How did the Netflix Engineering Tools team solve that? With the creation of chaos engineering and the help of a tool called “Chaos Monkey.”
What is Chaos Engineering?
You could trace the history of Chaos Engineering back to the time when large-scale distributed systems were growing in popularity. When those systems got deployed, engineers faced a huge challenge: test its resilience. How could you test the system’s ability to overcome failures while ensuring maximum quality?
The answer came from Netflix’s team and its Chaos Monkey. As you can imply from its name, a Chaos Monkey is a tool that was born out of the idea of wreaking havoc. As the streaming giant puts it, it’s “a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.”
In other words, using a Chaos Monkey (or any other similar tool) would be introducing intentional failures to see how the system would react in front of specific issues. Any team can do this just by following a 4-step procedure:
- Define the ideal system behavior
- Create a control group and an experimental group
- Introduce failures in the experimental group, such as a code error made by the JavaScript development team or a server outage
- Identify the difference between the control group and the behavior group
Such a seemingly simple process is at the heart of chaos engineering. As such, you could see the practice as a way to detect vulnerabilities in systems. It’s a great way to check on a system’s reliability, stability, and capability in front of unexpected failures.
It’s important to note that you can’t use chaos engineering anytime you want. You need to apply its principles in the production environment (AKA the moment when real people start using your software). The reason is simple – you need to truly know how your system would react in front of these failures. Since all that will depend on the environment and your traffic patterns, you’ll have to do it in the production environment, as you can’t mimic those patterns very well in another one.
Why Use Chaos Engineering
At this point, you might be thinking that using chaos engineering is like letting a monkey loose inside a fine porcelain store. However, don’t misunderstand the technique’s goal. It’s true that the word “chaos” might have you believe that the process is, well, chaotic. But in reality, you’ll be controlling every failure you inject, even if you can’t anticipate the consequences.
Chaos engineering is all about reliability and stability. So, instead of a simian army breaking everything in their path, you’d have something closer to a Shakespearian monkey. In other words, instead of creating chaos just for the sake of it, you’d be introducing failures to better understand your system. Ultimately, you’d end up with the IT equivalent of the complete works of Shakespeare: a system prepared for failures of all kinds.
Of course, that’s an obvious utopia, as systems will inevitably fail at some point. The value of chaos engineering is there, nonetheless. Using this technique will help you in several ways, including:
- Increasing your system’s resilience
- Detecting weaknesses across your system
- Implementing a proactive approach to bug detection
- Exposing hidden vulnerabilities
- Limiting the risks
The best thing about chaos engineering is that it takes a deep look at the complexity of the entire system. This means that it doesn’t just check the software but also the people using it. How so? Chaos tools can detect code errors and poorly implemented algorithms but it can also estimate potential human actions that can bring unexpected (and sometimes ugly) effects.
In that way, chaos engineering can manage the complexity of systems and people. And since both of them are prone to failure, this process can take you in the direction of a more flexible and adaptable architecture and a team that will let you better understand your weaknesses and prepare to offer a better service for your customers.
As it stands, chaos engineering is a fitting practice in today’s software development. That’s mainly because of approaches like DevOps and microservices architectures, which take development down the route of continuous improvement. In other words, if you’re using a distributed, ever-changing, complex development methodology, you’d be better off using chaos engineering.
A Proactive Approach Towards More Reliability
For chaos engineering to work at its best, you’ll need to make a switch in your mentality. Rather than waiting for failure to happen to see how you can solve it, you can take an active stance. Injecting controlled failures of all kinds will let you anticipate how your system will behave and introduce chances when the results aren’t optimal.
There’s a reason why Netflix, Facebook, Google, Amazon, Microsoft, and many other big enterprises are using chaos engineering – it just works. With so many tools available to implement this approach and an already-tested way to increase the system’s resilience, it’d be unwise to ignore what these companies already know as a truth: that chaos engineering can increase your confidence in your system’s capabilities.
Sure, the “monkey tools” you’ll use to do so won’t end up writing Hamlet nor give you a flawless system. But they will bring you so close to it that your customers will love you for it. Be sure to check out chaos engineering and see for yourself, once and for all, why you need it in your software development strategy.