Working with the Chaos Monkey

How about a similar thing on the source-code control side?

A test coverage tool will mutate your codebase, e.g. changing a ‘<’ operator into a ‘<=’. It does this in order to check that one (and ideally just one) test fails as a result.

To this conventional mix, add the idea that if we find a code mutation which does not cause any tests to fail, then the mutated code is automatically committed back into our repo.

This script should be runs over production code unpredictably, several times per week.

It will teach your development team to write thorough tests, with great coverage! Or else!!!

After watching this video from ted http://www.ted.com/talks/kevin_slavin_how_algorithms_shape_our_world.html I was thinking that the financial market would be in a great need of a chaos monkey!

“requiring the server to bluescreen before it would reboot.”

I think I already found your problem and the solution. :wink:

Any tips for this kind of redundancy implementation and testing with limited memory and throughput? (Specifically embedded systems). I mean if I had an infinite amount of memory and hardware, it would be easier to add much more error handling/checking, but in an embedded system, you’re limited by both storage and speeds. Any tips for kinds of systems to improve? (not just massive server type systems basically)

So, did you ever find out what the problem was with the servers dropping off the network? I’m dying to find out.

Yes, related to this:

http://blog.serverfault.com/2011/03/04/broadcom-die-mutha/

TL;DR “we tried some Intel NICs and the problem went away”