Working with the Chaos Monkey

In the mid 70s, Dick Morse (Mrs. Morse’s sone Helmut a he was dubbed by Hugh Rundell) and I talked through the idea of having software fire drills built into systems.

The idea, at the time, was that there were points in protocols where errors could be injected to ensure that the recovery procedures worked and also that operators saw them enough (but always recoverable) to know what it was like (and avoid the Maytag repairman syndrome).

I actually designed a real-time subsystem for operating multiple terminals off of a Xerox 530 minicomputer in which there were fire-drill points.

It was a valuable design exercise but I never needed to pull a fire drill.

It happened that there were some heuristics for estimating the size of data blocks needed to satisfy a terminal request or response that would guess wrong often enough that the recovery code for that was exercised regularly enough and it was visible (to those who knew what was happening) and it recovered properly. Meanwhile, cases of dropped responses from the controller, a situation that could have been injected, happened often enough that we never had to do that. We didexpose a problem in the hardware architecture, however. The terminal controller was on the other side of a cheapo-adapter that provided no way for the minicomputer to force a reset of the controller. So if the controller (or the adapter) went autistic, all we know was all of our requests were timing out and all we could do was slowly shut down all of the sessions as if the terminal operators had simply all walked away without logging off.

My interest in this kind of fire drill was inspired by an earlier experience in the late 60s when Sperry Univac was building a System/360 semi-clone. (It was not plug compatible, and it could some of use the same devices but not the operating system). In the test center when early production machines were being used to develop the operating system, including all of the device drivers, IBM disk drives were being used until we had delivery of our own. Everything was going along great until newly-manufactured competitive drives were installed. These drives were not so reliable and the OS started crashing, because the error recovery paths in the drivers had never been exercised and they failed.

This reminds me of an article I read a while ago about custom JVM with a high-speed garbage collector.

http://www.artima.com/lejava/articles/azul_pauseless_gcP.html

We didn’t take the typical approach where you try and optimize for the common fast case, but remain stuck with some things that are really hard to do, which you push into the future. Then you tune and tune to make those events rare, maybe once every ten minutes or every hour—but they are going to happen. We took the opposite approach. We figured that to have a smooth, wide operating range and high scalability we pretty much have to solve the hardest problem all the time. If we do that well, then the rest doesn’t matter. Our collector really does the only hard thing in garbage collection, but it does it all the time. It compacts the heap all the time and moves objects all the time, but it does it concurrently without stopping the application. That’s the unique trick in it, I’d say, a trick that current commercial collectors in Java SE just don’t do.

Pretty much every collector out there today will take the approach of trying to find all the efficient things to do without moving objects around, and delaying the moving of objects around—or at least the old objects around—as much as possible. If you eventually end up having to move the objects around because you’ve fragmented the heap and you have to compact memory, then you pause to do that. That’s the big, bad pause everybody sees when you see a full GC pause even on a mostly concurrent collector. They’re mostly concurrent because eventually they have to compact the heap. It’s unavoidable.

Our collector is different. The only way it ever collects is to compact the heap. It’s the only thing we ever do. As a result, we basically never have a rare event. We will compact the young generation concurrently all the time. We will compact the old generation concurrently all the time. It’s the only thing we know how to do. And we do it well.

So if you have a hard case that you can delay but never completely avoid, try spiting sanity and making it more common – ubiquitous in fact – rather than less.

I’m a programmer, but for a while all I did was web design which I left because most aren’t programmers and the code they write was ugly, slow and overall the worst of worst practises.
Any way, the first rule was to degrade gracefully. This however is only form the user side. So we build a basic HTML site that has all the elements we need. Then we had basic CSS. Jut things that are supported across pretty much all browsers, no hacks included. Then we start adding more advanced CSS and hacks to fix IE and pretty much only IE. Then we add JavaScript on top of that. Then we add more and more features which are increasingly complicated and likely to fail, built onto the previous. This way, from the top level down, whenever something fails, it doesn’t really affect the site. If someone has JS completely disabled, that just means they get none of that magic. If someone visits with a pure text browser (like JAWS), everything is still there, it just doesn’t look nice. Even if a few functions failed because the browser had a quirky JS engine issue we didn’t know about, it didn’t matter, a simple version existed.
We did use PHP and I do know they built in redundancy in case a database couldn’t be accessed and whatever else they do.

Now back as a programmer, I adhere to this quite well. I do think it has made me a more thoughtful programmer. I consider things like calling a function with the wrong types which happens more often than not. What if a module cannot be accessed… all this stuff that really shouldn’t go wrong, but can, especially when the app is accessing internal server farms to retrieve information.

But…just a month ago, Netflix was down. A “rare technical issue”.

I think the problem here is that entropy constantly increases. How many processes does the Chaos Monkey kill? Does it run more often over time? You have to fail more over time to compensate for the law of entropy.

Following up from @Kevin Reid: http://dslab.epfl.ch/pubs/crashonly/ is the core paper. “There is only one way to stop such software – by crashing it – and only one way to bring it up – by initiating recovery.”

There is a downside to all of this, though. While reliability is a good thing, it’s not free. Chaos Monkey may make a system very robust but the time and expense that it imposes may be more than the occasional downtime it prevents. Obviously, this will vary from system to system, but for every addition 9 added to uptime there is something else that must be foregone.

That makes sense. It reminds me of graceful degradation as far as web features go or assuming users will put something vial in contact forms…always prep your system for failure. I’ll keep that in mind!

We’ve been designing software with this idea in mind for over 10 years. When you design with the idea of “everything fails, deal with it”, your software becomes much more robust. Not only does it become more robust, but the maintenance becomes easier as well. Need to take down a server? Who cares, just do it, the system will respond properly. Database died? Who cares, just go back to sleep and deal with it at a reasonable hour.

There is a downside however. You must make sure you have proper monitoring systems in place so that you will know when something has failed unexpectedly. Otherwise you could have your system running along in a potentially degraded state and you’re not aware of it.

It definitely takes all the stress out of managing a large system.

“The best way to avoid failure is to fail constantly”
“The best way to avoid failure is to fail constantly”
“The best way to avoid failure is to fail constantly”

I definitely agree with Corey. What was the issue Jeff ?

The suspense is killing me!

Talk about a cliffhanger.

Wondering what went on at amazon this weekend? Here’s a scene from the bunker at HQ.

Please tell us what was the cause of server problems. It is causing us headache when you only tell us half the story.

This is great advice not just for the technical field but for life in general. Practice skills until perfected. Practice in rough conditions, while handicapped, or both. When required to perform (during concerts, sporting events, survival situations, or in this case on the internet) you will be adequately prepared.

Incidentally, the Android developer kit includes a program called Monkey which generates random stream of user events:
http://developer.android.com/guide/developing/tools/monkey.html

I like this idea of using a Monkey for user facing apps :slight_smile:

But for background processes, I would say it depends. Since the incidence of calamities is rare, the cost to benefit ratio would be different for each company / application.

Go double redundancy!

I agree with Edward Chick and am also very curious.

Wow, the Chaos Monkey keeps everyone honest. It is the ultimate environment that many enterprises are striving for, although many still struggle to understand the basics of high availability and where single points of failure still fester. I do sympathise with some of the folks hit by the Amazon EC2 outage, despite that fact that in an ideal world they could have avoided their fate. Sometimes the best learning happens by failing aka “fast failing” management fashion du jour.

I’d think of the Chaos Monkey as the architecture that everyone should be building towards or aiming for. Only in the cloud could you even discuss an architecture like that.

-Sean Hull
My discussion of the Amazon outage: http://www.iheavy.com/2011/04/26/amazon-ec2-outage-failures-lessons-and-cloud-deployments/

Interesting, but let’s remember it was also a chaos monkey who caused the Chernobyl disaster. They simulated a power outage to test the stability of their systems:

So I guess as with everything, we always have to balance out things to find a good middle ground.

(I work at Netflix; and this should not be considered in any way official)

One point worth making about the need to enhance Chaos Monkey is that killing instances isn’t enough. Some of the most interesting (in a bad way) issues we’ve seen involved instances that got into a weird state (e.g. instance is still up as far as an ASG is concerned but not up as far as an ELB is concerned). As you note, Chaos Monkey by itself is important, but the next step is to find ways to mess with your environment that are more complex than simply a clean death for your instances.

Of course, sometimes we get to have our extreme scenario testing done for us, for free. Like when Amazon messes up their EBS environment …

Boeing started using a technique called FTA (Fault Tree Analysis) circa 1966 when designing civil aircraft. It’s basically a formal method to ensure no system is left without backup. While the “Chaos Monkey” is a neat idea, I’d never design a critical system without FTA.

@uala Thank you…I remembered working with a similar tool as the Monkeylives back in my palm days, couldn’t remember the name. Gremlins it was.