Working with the Chaos Monkey

Tartley · May 12, 2011, 12:00am

How about a similar thing on the source-code control side?

A test coverage tool will mutate your codebase, e.g. changing a ‘<’ operator into a ‘<=’. It does this in order to check that one (and ideally just one) test fails as a result.

To this conventional mix, add the idea that if we find a code mutation which does not cause any tests to fail, then the mutated code is automatically committed back into our repo.

This script should be runs over production code unpredictably, several times per week.

It will teach your development team to write thorough tests, with great coverage! Or else!!!

Damien_Dube2 · July 26, 2011, 12:00am

After watching this video from ted http://www.ted.com/talks/kevin_slavin_how_algorithms_shape_our_world.html I was thinking that the financial market would be in a great need of a chaos monkey!

saintneko · March 13, 2012, 12:00am

“requiring the server to bluescreen before it would reboot.”

I think I already found your problem and the solution.

TysonB · April 23, 2012, 12:00am

Any tips for this kind of redundancy implementation and testing with limited memory and throughput? (Specifically embedded systems). I mean if I had an infinite amount of memory and hardware, it would be easier to add much more error handling/checking, but in an embedded system, you’re limited by both storage and speeds. Any tips for kinds of systems to improve? (not just massive server type systems basically)

nearyd · July 3, 2014, 7:45pm

So, did you ever find out what the problem was with the servers dropping off the network? I’m dying to find out.

codinghorror · July 4, 2014, 10:21am

Yes, related to this:

http://blog.serverfault.com/2011/03/04/broadcom-die-mutha/

TL;DR “we tried some Intel NICs and the problem went away”