What's Worse Than Crashing?

As always, clarity is king. Silent failure is the killer. Fail fast, and transparently.

I’m a big fan of the Pragmatic boyz and Ron Jeffries in this camp:

Patrol your borders well. Include exception handling where there are no consequences. Treat errors deep in your code like the end of the world. Unit test to back up your assumptions. (Particularly boundary conditions.)

Your app should definitely fail immediately if the database suddenly disappears, or if a packet of XML gets lost in the ether… unless you can try for it again… If the user picks a bad date range though, ask them to fix it, or correct it behind the scenes if possible.

Personally I think the push to make computers the perfect user experience is overblown. The commute to work is fraught with flaws: potholes, detours, wrecks, late trains, broken escalators, and faulty air-conditioning. There’s a price/performance scale that people have for quality. (A friend of mine calls it the “5, 500, F*** IT! Rule”.) Look at how hard NASA has to work to guarantee slim margins of error. As long as they have benign failure, the perfect world is something most people don’t care to finance.

“Giving your manager a 90% bug-free app and being overdue will get you a better performance review than giving your manager a 50% bug-free app on-time.”

Roger Farley, what planet are you from? Please say. I want to move there. I want to move there yesterday if not earlier.

One thing that you need to take into account when you talk about “failing fast” or “failing slowly” is the nature of your application – how is the state of the application coupled to the state of your data?

In most web apps, when someone gets an error, they simply know they can hit the back button most times and retry. With a windows-based app, that’s (of course) not the case – but does it have to be that way?

For business applications, my WinForms-based apps tend to use stateless middle tiers. This makes it darned tough for any front-end problem to seriously corrupt the state of my precious data, whether that be a database, queue, or something on the server’s filesystem.

That does make it a bit safer to write slower-failing clients (albeit ones which notify the user and log their error before allowing them to retry).

Some comments:

This point really can’t be overstated. Having code to talk to yourself is how you catch and fix odd customer errors.

  • You should understand completely how your components can fail, and should wrap the cases where there are complex failure conditions. Take the registry as an example: even when the function doesn’t fail with an error code, it can have really wierd postconditions.

    1. Wide char string with 21 bytes when it should be divided by 2
    2. Wide char string with more than one null-terminator
    3. Wide char string without any null terminator
    4. QWORD value with 16 bytes (documented to have 8 bytes – 64 bit number)
    5. Value of type REG_NONE
    6. Strings with zero size for which query is not setting the null terminator
    7. There are more, too.
  • Test your error cases.

I would like to add a plug for logging errors if you try to fixup. This is actually what automotive systems do (contrary to the comment above). That is what the check engine light is about. As I understand it, there are several processors that cna pick up if another fails. “Fail fast” is the opposite of what the company wants.
“Fixup” may mean restart or abandon a thread or transaction (with error return back the the client).

Jeff, what about logging errors when they happen as a compromise to merely “failing fast”? This would be transparent to the user. After logging, the app can provide the typical “just in case” error handling we all know and love…

This would give the development staff metrics to go by as they attempt to gauge how severe an issue is, how frequent it crops up, etc. while still allowing the user’s app to hobble along in the meantime by attempting to recover from its errors until the developers can fix the bug.

In our eCommerce apps, I’ve had good success with an approach of, when faced with an exception condition:

  1. of emailing an alert message to our small development team with the relevant information to identify the issue, httprequest params, stacktrace, userkeys, etc.

  2. do whatever you can (e.g. use default values for the missing data that caused the exception) to return the user a valid page along the lines of what they were expecting. It may not have all the information that would have returned had the exception not occurred, but the user continues with a sense that the application is still working.

As a developer, you are highly motivated to fix an issue that is filling your inbox. It’s much more visible than an exception in a log file.
(Note: It’s also a good idea to implement such developer alert sending methods with some safety valves that stop sending emails after a certain threshold. I’ve returned to work once (once!) the following morning after a new build to discover several thousand emails in my inbox. During that period, not a single user perceived an error or logged a support call.)

The truth as usual is somewhere in between ‘fail-fast’ and ‘corrupt-data.’ I have personal experience of a program that goes thru a list of items and on the first failure exits - leaving many items waiting that could be completed. Another program continually fails with null object errors and the logic of the application precludes fixing this problem without a re-write. It is important to distinguish between real failure and ephemeral conditions. Never lie, never ignore, never throw an exception as an alternative to control flow and validation, never believe that a failure can’t happen, never check to see that the computer has power!

Never lost any data? You didn’t just say that, did you? Your data is going missing and soon. The software gods are listening.
I do embedded. Always fail fast, but save a trail. We added “Asserts” to our product. The assert saves the state of things, and stops the applications. It covered the “This could never happen” and default cases. They show up all the time. Many hard to find and unexplainable bugs were rooted out.
Yes, marketing and the customers do not want to see an assert. You have to remind them that an assert is better than a weird problem that never gets fixed.
Now, two years later, an assert is so rare, when one happens people don’t know what it is. The quality of the product is better.
Admitting there is a problem is the first step.

But what does “fail” mean in context of fail fast?

With a generic application exception handler, “failing” could mean absolutely nothing in terms of UI. Yes this is bad for the end user and is probably never implemented this way, but you can fail fast as well as later.

Your logging framework, or whatever you have in place to deal with the unknown, can catch the failure immediately. It can log details about state of the machine at this point. Yet the application can continue on. The end user need not see anything more than a simple “oops” message on the screen.

If your application is structured properly, the impact to the end user will be minimal. Your “oops” message could even include a suggestion to restart the application, if you don’t trust yourself (and why should you, you just released software with this bug!)

I’ve seen this practice used in very popular commercial software.

Steve:

Point taken! You’re absolutely right. It’s truly scary what end users “work around” when they should simply ask for a fix.

The important thing is to realize that what’s good for the development process is not necessarily good for the users. That’s why we have debug builds. I think it was in Steve McGuire’s excellent (if dated) “Writing Solid Code” that I first came across the “fail immediately” idea, and it’s presented squarely as a development philosophy. In release builds you try to recover as best you can, in debug builds you want everything to blow up the moment one of your assumptions isn’t met.

Maguire’s argument was against defensive programming. If there’s a bug but the program doesn’t crash then there’s a temptation to keep going and fix it later. This is pretty much always a bad idea. In development you always want to know immediately if your assumptions about what your code is doing are wrong. That means you don’t understand what’s actually happening, so it’s just plain luck if anything seems to work. (And your luck is guaranteed to run out at just the wrong time…)

Agreed wholeheartedly with “fail fast”. Code which does something like

Thing* t = createThing ();
if (t == NULL) return; // grrr!
t-doSomething ();

is not wise defensive programming. Let it crash, even in production – at least you’ll get a nice diagnosable crash dump that way.

There’s also one annoying problem with crashes, slightly tangential to data loss, and that’s a loss of the recent configuration changes, when the application naively believes that it’s the best to write out the settings on its exit.

Namely, it’s a pain with firefox; proxy settings which don’t get saved, new accounts (like, for s3 organizer) which get lost if the main executable will terminate by crash or forced kill.

Not the greatest thing to keep in mind, to restart the application after configuration changes for them to get really saved, just in case.

I have to agree with the “fail fast” philosophy, especially for a lot of embedded software.

I worked at a place where we had our own in-house kernel, about 50KLOC of kernel space code. The system was an embedded system, which could not be serviced at the site at all.

The guiding principle was: if the error points to memory corruption in the kernel, or a really bad hardware failure (bad parity on PCI for example): kernel panic immediately.
The system is already designed to recover well from an unexpected boot (power failures, kernel panics) and it will boot into a special, minimal kernel mode after such a failure.

This is by FAR much better than to continue working on an unstable kernel, which can cause data corruption or even destroy your hardware (yes it can!).

The result? When the system is deployed, there were no kernel panics at all. In fact a kernel assert was so rare even during development, that if it happened every kernel developer would drop everything and come and look at it.

On the other hand: your software really needs to be robust. Asserts should NOT happen in embedded, safety critical or value critical systems. You really do not want your X-ray machine software to assert before turning off the radiation. The aircraft aviation software should not assert during flight, if it cannot recover from such.

Something which I think is worse than any of those (aside from data corruption) is a program which refuses to die. Regardless of how rare the bug occurs, it completely unacceptable. This post is quite timely as earlier today I was burning a disc with nero and in the disc reading process nero froze. I couldn’t shut it down. Not even through task manager. I literally had to restart windows to close the application.

It seems that some people misunderstand the notion of of fail fast.
The fail fast idiom does not come instead of very careful, robust design, thorough testing and coding with discipline.
If you can recover from the error, of course you should. If you can design the system so that such errors cannot occur - even better.

The idea is to catch the “impossible” errors, from which you cannot possibly recover:

  • You just checked the free buffer size and still buf_insert() fails. How can that be!?
  • The device driver function table pointer is NULL. What can you do?
  • Stack magic cookie is incorrect. You have a stack overrun on the interrupt stack.
  • You disabled the interrupt and still somehow got it. WTF?

The idea is to those things that cannot possibly fail, you cannot recover from safely. If the fxn table is NULL, you can bet other important system data is corrupt. No recovery is possible at that stage. But if you fail fast, and your system is designed to recover from such failures, your software is more robust.

Again, fail fast is for those extreme conditions that should NEVER happen. But they sometimes do and you better be ready.

Some systems should never ever reboot, but such systems often have other fail safe mechanisms not available to most (hw redundancy, special watchdog timers, etc.), and go to extreme lengths and expense to assure reliability and availability.

@Roger Farley:
“I am surprised how I have to constantly make the statement “If a condition that would cause an exception can be caught and dealt with, then it should be caught and dealt with before it throws the exception.” I find this very CS101. Exceptions should be left for the unknown, not the known.”

That depends very much on the language, actually. In many languages, exception handling is expensive, which leads to the idiom you describe. However, in Python, exception handling is cheap, which leads to the opposite idiom, using exceptions for conditions that are special but not unexpected (like reaching the end of a file while reading). It’s preferred for functions to raise exceptions rather than return error codes, as well.

Agree with M

By virtue of the fact of typing ‘a program should be able to recover from an(y) error’ is saying that you already know what the error is and thus how to recover from it. Not recovering from said error is therefore not an exception, but just a lazy programmer. E.g. writing to a file that is set as read-only without first checking.

An exception is truly that. You didn’t expect it would happen when the method was written.

  • You do expect network links to go down.
  • You do expect users to enter in bad dates.
  • Despite all those meetings, your users provide a ‘perfectly valid’ (in their minds) data file that fails the preconditions you’ve already written. That’s an exception.

Fail-fast appears to be the best we can do because it tries to catch as many errors upfront in development and testing. It’s not perfect, but it’s better than exception gobbling software that continues to run and fails to quit despite unknown levels of corruption. This isn’t however license for software to just go poof and disappear. There is absolutely the need for proper logging and tracing to nail down what happened, especially for shrink-wrapped software. Also, different software environments dictate different approaches, and that’s life.

I write a lot of code that throws exceptions, and barely any that handles them. Usually there’s just a top-level handler that does the logging.

PS Analogies are lame.