Chaos Engineering for Traditional Applications

Not all on-prem applications have a future in the cloud, but can those same on-prem applications leverage cloud-like capabilities to help make them more reliable?

In 2011 Netflix introduced the tool called Chaos Monkey to inject random failures into their cloud architecture as a strategy to identify design weaknesses. Fast forward to today, the concept of resiliency engineering has evolved, creating jobs called “Chaos Engineer.” Many companies like Twilio, Facebook, Google, Microsoft, Amazon, Netflix, and LinkedIn, use chaos as a way to understand their distributed systems and architectures.

But all of these companies are based on cloud-native architectures, and so the question is:

Can Chaos Engineering be applied to traditional applications that run in the data-center and will probably never be moved to the cloud?

What would it take to practice chaos engineering on your traditional J2EE, Websphere, MQ-Series, client-server applications that have been around for perhaps years? These apps might have been adequately maintained, but still may suffer outages and have quality control problems. The snap judgment might be to “re-write” using the Refactor/Re-architect strategy from Amazon’s 6Rs. But most traditional organizations don’t have the budget or deep talent pool of the mega-techs, so what to do? And what if your core applications are partially or wholly based on AIX or IBMi, the likelihood of recreating those in a complete cloud-native fashion seems remote.

If there was some magical way to perform chaos engineering on a traditional application, what types of tests would you execute? The obvious ones would be resource-based like:

Low memory
Not enough CPU
Full disk volumes
Low network bandwidth, high latency
Hardware failures like a failed disk drive, failed server, disconnected network

And not so obvious ones could be:

Database/server process down
Microservice down
Application code failure
Expired Certificate(s)

And even less obvious:

Is there sufficient monitoring and have alarms been validated?
Understanding the repair time to correct different types of problems once identified. If a database server goes down, how long does it take to bring it back up along with all the associated application components that talk to the database?

It doesn’t take much imagination to come up with an extensive list of possible test scenarios that can be applied to traditional application architectures. You could probably take historical failures as a source for a starting point.

But if you had a way to perform aggressive testing against your on-prem applications, what would that mean? Could you extend the life of that system if you could make it more reliable? Could you put off the “rewrite” decision?

And so now the question is: How do you do it? You probably don’t want to inject chaos into your production application. Also, you might not have a test system that fully represents production, so the results of testing might not carry over. And finally, if you had a test system that looks like production, and you could run “destructive” tests against it, how long would it take you to repair it to run the next experiment? The answer to “How do you do it?” is to use the cloud.

With the cloud, there is potential to create a production-like environment that has all the same application components as the original system of record. In this model, you would not convert anything to “cloud-native.” You would do a simple lift-and-shift, change no lines of application code, and run all the same servers as the original application; they would just be virtual machines in the cloud. You would use the same IP addresses, same hostnames, same network topology, the same amount of memory, disk, etc. You would recreate the original application’s “twin” in the cloud. And the twin would be where you would do various chaos engineering tests to observe the behavior of individual components as well as the overall system in general.

To be fair, infrastructure in the cloud won’t exactly replicate what you have on-prem. For example, the model and capacity of your enterprise SAN (storage array) won’t be replicated in the cloud. So you wouldn’t be able to do a test of “failing the SAN.” What you can easily do in the cloud is disconnect or manipulate a disk attached to a VM to simulate a failure. Same for network components, you can disconnect the virtual network from a cloud-based VM, and it would be similar to what would happen on-prem if a physical or virtual network segment failed.

By using the “lift and shift” approach, you could re-create a representation of your traditional on-prem application in the cloud. Have it work “the same as it does now,” with no redesign, and then be able to do aggressive tests on it to determine how to make it better. And by making it better, you are extending its life.

And lastly, if you run a bunch of tests that eventually destroy the application clone running in the cloud, how do you reset it for the next round of tests? You don’t want to rebuild/fix things by hand, which could take days, weeks, or longer.

If you are already doing “infrastructure as code,” then you might already have scripts and tooling to re-create the system from scratch. Different clouds have different approaches to this. Since I’ve previously mentioned that I work for a company called Skytap, I can tell you how Skytap does it, but there are many approaches. The goal is to be able to “quickly” re-create a ready-to-use running set of infrastructure and application components that represent the entire working original application. All the servers(VMs), storage, networks, installed software, the configuration of the OS, everything. And be able to do that in minutes or hours, not days or weeks.

In Skytap, an “Environment” is a running image of an entire application that is managed as a single object. All the VMs, networks, storage, along with the data on the volumes, are all considered part of the environment. You will have multiple environments simultaneously running like PRODUCTION, PRE-PROD, QA#1, QA#2, and others called “CHAOS#1”, and maybe “CHAOS#2”, etc.

All of the environments might simultaneously use the exact same hostnames and IP addresses. Each environment runs in a sand-box. So even though I might have five copies of the same application running, they don’t see each other and don’t create network conflicts. If you have to change IP addresses on servers in the cloud, then the cloud system doesn’t match production, and you are introducing the potential for false positives in your testings. In the cloud, you must be able to re-create the same RFC-1918 address spaces that you are using on-prem. All of the major cloud services have some form of network natting technique if a cloud environment has to talk back to on-prem.

Then in Skytap, you can take an Environment and create a Template from it. The Template is not “running,” but is a moment in time snapshot of the entire environment. All the VMs, all the networks, disks, and data, etc.

From the Template, you can create a new “clone” of the original environment. No scripting is needed, just a few button clicks. Then once the new Environment is created from the Template, you “run” it. All the servers startup and you have essentially “reset” your test environment. Everything is back to the state at the moment in time when the servers were saved to the Template. In many cases, this whole process can be completed in minutes.

And so your final Chaos workflow is:

Import your on-prem environment into the cloud. This will be the longest part of the initial process. Different clouds have different capabilities for bringing content in from on-prem. In Skytap, you can “import” VMware or AIX images or restore an IBMi system from a backup.
Once you have a working application, you need to “save it” somehow so you can re-create on-demand clones in a short time.

Then your actual test workflow is:

Deploy a clone of the application from your Template (or scripts).
Run your “chaos” tests and collect your results.
Once your tests are done, completely delete your test environment.
When you are ready for the next test, go to step #1.

Summary

The concept described in this article is how you can leverage the cloud to help you improve the quality and reliability of your on-prem applications that will never run in production in the cloud. You are using the cloud as your on-demand sandbox where you can create and destroy things and then quickly recover. This concept works for original application systems of record, disaster recovery systems, and even software development pipelines.

Using a lift-and-shift model, make the cloud “look like” your traditional on-prem applications, then hack on the cloud-based environments without any risk to your original systems. Take what you’ve learned from the chaos testing and bake the knowledge back into your on-prem processes. Use the cloud as a way to make your traditional applications “better.”