What Are the Benefits of Chaos Engineering in DevOps?

If your team is already working according to DevOps principles and you want to test how good your workflows are, then you should take a look at chaos engineering.

This is an approach to testing a system to see how it will react to problems. Basically, the aim is for software to always run correctly, reliably, and securely, even under difficult conditions. But what you often see instead is the motto, “It will work somehow. After all, we’ve survived this far.”

When teams work by this motto, it means they do nothing more than hope everything is working as it should, rather than actually ensuring it is.

Making Systems Fail

All the steps of the development process are good and necessary. Nevertheless, you can go one step further, because many systems are very complex and it is difficult to really understand all the components and all the possible problems.

In particular, you want to know whether you can ensure that the service will essentially continue to function if a subsystem fails. Systems can become quite complex, especially when microservices are used, so it is important to ensure that the system will still work if one or more subsystems fail (for whatever reason).

Many companies set up their systems with high availability but hardly ever test their setups. When failures do occur, you always hear, “That shouldn’t have happened!”

Chaos engineering involves deliberately creating “chaos” in order to bring down parts of the system, such as creating a high load and a lot of stress on production systems to see how the system behaves. These experiments can reveal opportunities to make the entire system more robust so that there are fewer problems in the event of actual chaos. Your system should be built in such a way that it still works.

Netflix started using this approach early on in its development. Netflix developed the Chaos Monkey tool. One of its core functions is that it randomly plays with virtual cables and switches off servers or throws virtual machines and containers out of the production environment.

In addition to Chaos Monkey, there is also the Chaos Toolkit and, in the cloud-native area, Chaos Mesh and Litmus Chaos. You can find more information here.

Recurring Experiments in the Chaos Mesh UI on DNS from Specific Domains

Although the term chaos engineering contains the word chaos, the engineering part is anything but chaotic. It has to be planned properly so that the whole process is carried out in a controlled manner. This is the only way to learn anything new from these experiments. Chaos engineering can be used to test a system at various levels: the infrastructure, the network, and the application. Tests are carried out at each level; during an infrastructure test, for example, servers are randomly switched off, or a high load is generated to see how the system reacts.

During a network test, the network bandwidth between various systems is throttled, or random packets are dropped. Both of these cases can occur frequently, so you should test them in a controlled manner in the production environment. In extended chaos experiments, DNS and HTTP responses can be overwritten in order to evaluate how the application behaves in the event of an error.

Finally, during an application test, chaos experiments test how the application behaves under a heavy load or what happens when nonsensical requests are sent to it. If no errors occur and the application remains usable for the end user, you know that you and your team are doing a lot of things right.

Chaos Engineering without Chaos, but with a Plan

In order to avoid actual chaos when carrying out chaos engineering, you should work with a concrete plan so that you can proceed in a targeted manner and can measure the results. The Principles of Chaos website specifies four steps for practical applications (see figure below):

Definition of the status: Formulate a definition of the stable system in which the normal behavior is expected.
Definition of the hypothesis: Formulate a hypothesis that the stable state will continue to exist in both the control group and the experimental group.
Adaptation of variables: Implement real events that upset the state of the system. Inject errors such as failing servers and hard disks, network connections that no longer function properly, and completely overloaded subsystems.
Disproving the hypothesis: Compare how the system has changed and evaluate the system’s behavior using metrics.

Chaos Engineering Step by Step

After evaluating the system’s behavior, you can then determine the next steps. The principle of chaos engineering assumes that the testing is carried out on production environments. You may not carry out the testing on production environments at first; instead, you can do this initially in staging environments.

Before you carry out chaos testing, you should fully understand the entire system. You can then divide problems into these four categories:

Known knowns: These are problems that are known and understood in the organization and among employees.
Known unknowns: These are problems that employees are known to not fully understand.
Unknown knowns: These are parts of the system that the organization understands but that contain problems the organization isn’t aware of.
Unknown unknowns: These are problems that the organization is neither aware of nor fully understands.

Regardless of chaos engineering, as much as possible should fall under “known knowns.” It is important to minimize the other three categories as much as possible. Chaos engineering can help with this.

One advantage of chaos engineering is that it can increase the robustness of the entire system. However, you also need to have the confidence to run it on production systems. Even with the right monitoring and the right metrics, you should consider choosing a time to run chaos engineering when there is little load on the systems anyway in order to minimize the risk of failure. What is often ignored is the fact that with controlled failures, as is the case with chaos engineering, the errors can be corrected much better, faster, and cheaper than if errors occur under high load with only real users.

Chaos engineering gives teams a much better understanding of how the entire system works, where there may be problems, and how they can be corrected to actually find problems in good time, especially problems that are not quite so obvious.

Once you have carried out these tests once or several times, you can also automate the whole process and run it regularly to constantly increase reliability.

As is so often the case, an implementation only works if managers allow development teams to conduct experiments. If managers are too afraid to allow these experiments, it shows that they generally do not trust the production environment. But a lack of trust is a good reason to run these tests—if you can cause the failures you’re afraid of, you can address their root causes and avoid major problems down the line!

Reflection

Chaos engineering behaves like blue-green deployments and feature flags in the operation of an application and infrastructure; they are ways of working that you can carry out only if you have full confidence in your team and your application.

As a first step, you should experiment with chaos engineering methods on staging environments before moving on to production environments. This alone will provide you with many insights, even if your organization still considers running such testing on production environments to be too dangerous.

Editor’s note: This post has been adapted from a section of the book DevOps: Frameworks, Techniques, and Tools by Sujeevan Vijayakumaran. Sujeevan is a senior solutions engineer at Grafana Labs. Previously, he worked at GitLab, where he helped large corporations from Germany, Austria, and Switzerland transition to a DevOps culture. He cohosts the German technology podcast, TILpod, and enjoys giving talks at open-source conferences—not only on technical topics, but also on good teamwork, efficient communication, and everything else that is part of the DevOps culture.

This post was originally published in 3/2025.

What Are the Benefits of Chaos Engineering in DevOps?

Making Systems Fail

Chaos Engineering without Chaos, but with a Plan

Reflection

Recommendation

Comments

Latest Blog Posts

Is Jenkins Still the Right Tool for DevOps?

DevOps Platforms at a Glance

The official Rheinwerk Computing Blog

Blog Topics

Blog curated by

About