Working with the Chaos Monkey

codinghorror · April 25, 2011, 12:00am

Late last year, the Netflix Tech Blog wrote about five lessons they learned moving to Amazon Web Services. AWS is, of course, the preeminent provider of so-called "cloud computing", so this can essentially be read as key advice for any website considering a move to the cloud. And it's great advice, too. Here's the one bit that struck me as most essential:

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

MorganT · April 25, 2011, 12:00am

Sounds like a great model. I think Gooogle had the same idea for their distributed file system right? Assume failure.

Tinkertim · April 25, 2011, 12:00am

Incidentally, the people who used distributed file systems hardly noticed the outage. If you market something as indestructible (or close to it), you’d be amazed at how infrastructure is designed around that assumption.

We have something very similar (we call it drunken monkey) that randomly shuts OVS vlans and does ‘bad’ things to our API.

As far as storage goes, and the people who plan around it … united we fail, but sometimes in a good way.

JamesB · April 25, 2011, 12:00am

Jeff, is this a tacit admission that you were the cause of those outages?
“Raise your hand if where you work, your boss deployed a daemon or service that randomly kills servers and processes in your server farm.
…
Who in their right mind would willingly choose to work for a Chaos Monkey?”

I’m surprised that you didn’t link to <a href=http://www.folklore.org/StoryView.py?project=Macintosh&story=Monkey_Lives.txt&sortOrder=Sort%20by%20Date&detail=medium&search=monkey>this story from the Macintosh development team. I imagine that the Netflix guys knew about it when they named their tester.

JamesB · April 25, 2011, 12:00am

Proper link, sorry:
http://www.folklore.org/StoryView.py?project=Macintosh&story=Monkey_Lives.txt

No comment editing?

MichaelB · April 25, 2011, 12:00am

“If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.”

Great idea for a distributed system, but when you have the whole kitten kaboodle deployed to AWS and AWS has issues chances are all your systems will have issues too!

wwwm · April 25, 2011, 12:00am

Very good advice Jeff, and a very interesting read. Luckily I don’t need to deploy anything… yet. But the advice will come in handy some day, I’m sure.

Cheers,

Ruben Berenguel

NathanP · April 25, 2011, 12:00am

I feel the same way about bugs. They suck finding but in the end you almost always come out with a better system (besides obviously/hopefully fixing the bug). Whether it be a better architecture, logging, personal understanding, whatever.

Nick_Berg · April 25, 2011, 12:00am

BTW Michael, the phrase is “kit and caboodle”, though “kitten” is a hilarious homonym for that.

Aczarnowski · April 25, 2011, 12:00am

I’m reminded that what seems like a failure in Windows, the “reboot to fix it” mindset, is a similar advantage. If you pull the plug on more ostensibly stable OSs you have much higher change of ending up with an even bigger mess on your hands.

Kpreid · April 25, 2011, 12:00am

This reminds me of the “crash-only software” concept — that is, avoid writing a “shutdown” mechanism and instead ensure the system can restart when terminated at any point. The idea being that then your recovery system isn’t a rarely-invoked special case, so it is more likely to work when you need it (and it has pressure to be efficient), and also that you don’t have the cost of performing the shutdown when you need to.

Aslemos2009 · April 25, 2011, 12:00am

Or, failure and function continually engender each other. Code is Poetry, says WordPress, and I agree.

Jpotisch · April 25, 2011, 12:00am

Yes, yes, a thousand times yes. Distributed systems that rely on all the pieces being up all the time are simply at odds with reality. Every interaction with someone else can result in a success, a failure, a rejection, or your request simply getting lost. Your design doesn’t have to accept that fact, but failing to design for it doesn’t make it go away. I wrote about this two years ago:

When you build internet-scale distributed systems, you should always assume you are in flaky connection mode. Maybe the tubes are down today. Maybe your vendor’s server went down. Even with all the contracts and SLAs and angry phone calls in the world, you fundamentally don’t have any control over that box staying up and reachable when you need it.

Brooksmoses · April 25, 2011, 12:00am

Like Morgan Tiley in the first comment, this was also reminding me of Google’s approach to distributed systems. There, the collections of systems are big enough that you will have a chaos monkey just from hardware failures, so you have to build the system to deal with that – at which point, they famously asked why use expensive high-reliability hardware when the cheap stuff is vastly cheaper and only less-than-vastly less reliable?

Seems to work well for them. And, on the face of it, hardware failures are much less friendly than an artificial Chaos Monkey that you can simply reboot from.

It’s definitely an interesting approach to include a mild one voluntarily, though!

Ws1 · April 25, 2011, 12:00am

Good advice. Regarding the server that was causing trouble: I’ve dealt with a server that had very similar symptoms. Weeks of troubleshooting that led nowhere resulted in me throwing my hands up and assuming the motherboard itself was just bad and swapping hardware with an available hot spare. Problem solved. Months later after trying to redeploy the original server, I realized the DRAC card was both faulty and misconfigured, causing the aforementioned nightmare. Card removed, problem solved.

CoreyH · April 25, 2011, 12:00am

@JeffAtwood Don’t keep us in suspense – what was the root cause of the dropped server problem? (The Broadcom NIC thing you’ve mentioned before?)

Uala · April 25, 2011, 12:00am

Good story. As a side note, it makes me remember my old PalmOS Developer days, with one of the greatest tools I’ve ever worked: PalmOS Emulator with Gremlins!

http://www.accessdevnet.com/docs/emulator/Emulator_Testing.html#975120

Chaos monkey is a kitten compared with those gremlins hordes…

BradG · April 25, 2011, 12:00am

Actually AWS appears to still be partially down. metabase.cpantesters.org
is still down. Which has crippled the ENTIRE Perl testing infrastructure.

No one can upload new test reports.
CPAN authors aren't getting up-to-date reports of failure.
Reports from people testing distributions against the latest version of Perl, are not getting through. Which means that we can only hope that there aren't any new failures that aren't getting sent to the mailing list. ( Perl v5.14.0 may come out on April 28th )
CPAN authors may be leery of putting out new versions of their modules while this black-hole exists. (I know I am)

This is not the first time that there was a problem with sending reports. Last time there was a simple work-around. This time, the only work-around is to setup a relay server, and put it into offline mode until further notice.

Mintz · April 25, 2011, 12:00am

I was gonna post this link to reddit, an AWS site, but it was down. Now it’s back up. It’s like a Katy Perry song or something.

Tyseo · April 25, 2011, 12:00am

To volontary kill services and shut down servers is hardcore testing but this seems a great way to stress web app, softwares and computers.