What's Worse Than Crashing?

It seems that some people misunderstand the notion of of fail fast.
The fail fast idiom does not come instead of very careful, robust design, thorough testing and coding with discipline.
If you can recover from the error, of course you should. If you can design the system so that such errors cannot occur - even better.

The idea is to catch the “impossible” errors, from which you cannot possibly recover:

  • You just checked the free buffer size and still buf_insert() fails. How can that be!?
  • The device driver function table pointer is NULL. What can you do?
  • Stack magic cookie is incorrect. You have a stack overrun on the interrupt stack.
  • You disabled the interrupt and still somehow got it. WTF?

The idea is to those things that cannot possibly fail, you cannot recover from safely. If the fxn table is NULL, you can bet other important system data is corrupt. No recovery is possible at that stage. But if you fail fast, and your system is designed to recover from such failures, your software is more robust.

Again, fail fast is for those extreme conditions that should NEVER happen. But they sometimes do and you better be ready.

Some systems should never ever reboot, but such systems often have other fail safe mechanisms not available to most (hw redundancy, special watchdog timers, etc.), and go to extreme lengths and expense to assure reliability and availability.