Crash Responsibly

As programmers, it is our responsibility to ensure that when something goes horribly wrong with our software, the user has a reasonable escape plan. It's an issue of fundamental safety in software error handling that I liken to those ubiquitous airline safety cards.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/05/crash-responsibly.html

“That might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn’t pull the rug out from under you when you dereference a null pointer.”

Well, let me see, what could you do to solve that problem…
Hmmm…
Every time you dereference a pointer you should check if it’s null. Every Time You Dereference A Pointer You Should Check If It’s Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT’S NULL.

If it’s null, and that’s a major problem (i.e. you don’t know why it’s null), up pops the screen saying “Null pointer dereferencing at line at line squiddlybeep, I’m a lazy sonofab***h, and didn’t consider this possibility. Press “Oi, Loser!” to inform me of my screwup.” Because the user can’t feed NULL pointers into your program. That’s all you, baby.

And while we’re at it, Initialize Your Damn Pointers, because random garbage often offends.

"I think Visual Studio has the best variety of crash symptoms - my favourite is when it just vanishes usually when you get to the critical point in a debug session."
To be fair, VS encounters more than its fair share of the 80% of problems that effect 20% of users.
For example, the critical point in a debug session is generally the point where you’ve decided to do some crazy crap that nobody’s ever done before.

Sorry Jeff, but I think I must disagree with you. You say: “If users have to tell you when your app crashes, and why, you have utterly failed your users.”

Yeah, it’s true. But, how can I know when and why my software crash? Do you mean that I must use an automatic error reporting? But, how about the privacy? I cannot do it without asking the users, so how do I do? I mean, thinking of MS: not all software crashes depends from software, nor they all depend from the os. So, if it’s due to os, and it send the report without asking the user, MS is a Big Brother that send personal data without the user’s agree. If they ask for notification, then MS has failed to their users. So, how can you resolve it?

I find it interestingly sad that global error handling in .NET Winforms is rather a chore.

great post jeff

you should always be able to see what the application is doing or at the very least see it is doing something. I hate when the application stalls, thank god when windows can show the “not responding” message.

Good point. I particularly enjoy putting complete core dumps in my automatically generated error reports so I can go through financial reports from companies all over the world.

Privacy? Letting users opt-in to sending me information on what they’re doing with my precious intellectual property? Yeah, like that’ll ever happen.

Blackstorm writes:
But, how about the privacy? I cannot do it without asking the users, so how do I do?

Good points. The solution is a basic opt-in or partially opt-out choice at install. During install, or upon first launch, ask the user if it is okay to automatically send anonymous error reports in the event of a catastrophic error. Then give then the options of “always” and “ask before sending each time”. (Worded a little better of course.) As long as it is a rare occurrence for such this dialog to appear, a “never send” option shouldn’t be needed. You could also present the opt-in /opt-out screen the first time a major error ever occurs. But the risk here is you are asking at a point in time when the user is frustrated, and perhaps very upset with the product.

When the error occurs, if they opted in, you just send it. If they partially opted out, you present with a dialog asking if it is ok to send the report. Include a button that displays the full detail of the report. IMHO, that is the best way to deal with this.

I agree with Jeff’s points. We as consumers would never accept problems in other products that we as programmers sometimes expect users to accept. II unfortunately have worked on far too many teams where error handling is given little if any consideration. To a user, an application that crashes without providing any feedback or apologies is beyond annoying.

For C++ and native apps developpers, it is worth mentioning the excellent blackbox utility by Jim Crafton :
(a href="http://www.codeproject.com/KB/applications/blackbox.aspx"http://www.codeproject.com/KB/applications/blackbox.aspx/a)

That might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn’t pull the rug out from under you when you dereference a null pointer. Not so much in languages like C++. Never the less, I am implementing an error reporting system for my software. On every platform it runs on.

In my opinion it isn’t the handling of fatal errors that’s difficult, doing it crossplatform is tough however.

For an opensource project I’m working on ( http:/hwz2100.net ) I’ve created a fatal error aka exceptionhandler that works quite well crossplatform.

On Windows it uses the unhandled exception handler framework as exposed by the Windows API to catch falatl errors, in Unix systems it uses POSIX signal handling for that purpose. For the production of stacktraces it uses gdb when available on Unix, or glibc’s backtrace facility as fallback. On Windows it uses the Windows API to retreive the function addresses from the stack and then uses a demangling library (libbfd) to retrieve function names. Some other info describing the user’s system is also dumped.

You can find this error handler here, it might give you some clues for how to implement something similar yourself (or use this implementation if the license (GPLv2+ and I’m planning on releasing some parts of it as LGPL) is fine with you):
http://trac.wz2100.net/browser/trunk/lib/exceptionhandler (Subversion URL: http://svn.gna.org/svn/warzone/trunk/lib/exceptionhandler)

In addition to the privacy concerns, this solution (logging all information necessary to debug the error) does not scale.

The type of errors we’re talking about are because “a previously unknown bug in your code causes the application to crash and burn in spectacular fashion”. So there’s no way we can trust the application to write a neat, comprehensive error log when that happens: we can’t trust it to be sane at all. The only way to meet this goal is to log everything, in huge detail — enough to debug the application in the event of a catastrophic error.

That sort of logging information, in comprehensive, verbose detail, takes lots of space. We’re talking hundreds or thousands of log entries for even simple operations, that might take a minute or even a few seconds. Multiply that by months and years of countless operations like that and the log files are huge, even if they’re rotated out regularly.

Then, if the advice of this article is to be followed, every program on your operating system — hundreds of them — should be keeping such logs all the time they’re in operation. It simply can’t scale.

So a compromise must be made: programs run with little or no logging, unless the person responsible for the disk space makes a decision to start chewing it up with verbose log output from a particular program.

If you log every program verbosely, all the time, that’s just as much a failure as the failure this article talks about. But if you don’t, then you can’t get the debugging information without the user being inconvenienced further.

Provided it’s genuinely opt-in, even a stack trace can often be enough to help identify the place where some work is needed. It’s better than nothing.

Also, if you license the patented high-grade intellectual property from Microsoft you’re allowed to make a hash of the stack trace and use that to determine if the crash has already been reported (which can even save on bandwidth) and even immediately offer work-arounds and pointers to updates. But Microsoft invented and owns that idea, so don’t do it.

Still, I am glad to learn one thing: I used to believe that users didn’t read installer dialog boxes, but now it turns out that as long as you ask the question once when the application is installed you can be certain that 100% of your users will remember their choice and never be bothered about your gathering of allegedly anonymous information.

For all the Microsoft bashing (really just bashing patents) I’ld be just as happy to sign up for access to all the data they collect and be done with it. Sure, it really is opt-in but at least its there, implemented, and gets the job done. The problem is developers who don’t roll their own solution (a valid option) and ignore the existing solution. They have the data available but never use it.

It would be nice if every time your app crashed it was your fault.

For my field, sometimes people try to manually alter querystrings / url’s. It’s my fault if the application gave them a broken url, but it’s their fault if they fat-fingered something. We still must provide a nice error page mind you.

There need to be different scenario’s for different types of applications. ASP.Net web applications are particularly easy so there’s no need to ask what they were doing, but older desktop apps, and apps that don’t or can’t provide a meaningful stack trace may need to ask what the user was doing when it crashed and/or instrument their code in a less transparent way. I really hate instrumenting an entire application with logging, and more often than not it’s completely useless information.

Another issue is you must consider is who comprises the installed user base? Corporate intranet, small office app, customised commercial app, shrinkwrap, www?

Ha! I still get a kick out of those ‘EOF and BOF are true’ errors that still happen from time to time with really old crappy ‘classic’ asp websites that fail to check if a recordset contains any data.

That’s the great thing about Open Source software, it’s ok if it crashes because hey i’m not getting paid for it. :slight_smile:

I’m just kidding.

Jeff,

I have been reading your blog for about six months now and have listened to all of the StackOverflow podcasts.

I must say that the ELMAH tip has been by far the most valuable piece of information I have picked up from you.

Thank you for doing your thing,

Yonah

That might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn’t pull the rug out from under you when you dereference a null pointer. Not so much in languages like C++. Never the less, I am implementing an error reporting system for my software. On every platform it runs on.

(And it’s not that easy to do!)

Rails has Exception Notifier, which emails you a bug report, request headers, and full stack trace. This has changed my applications dramatically. By the time a client informs me of a problem I usually have already patched and updated the application.

http://svn.rubyonrails.org/rails/plugins/exception_notification/README

I’ve moved away from emails to RSS subscriptions. My favorite for ASP.NET apps is definitely ELMAH

http://code.google.com/p/elmah/

Terrible name, but great implementation!

“…the first thing I do on any new project is set up an error handling framework.”

I guess it depends on what you mean by “new project” but this set off red flags for me. Ideally this is a once per company task. Well, ideally, there would be something simple enough in the .Net framework. Last I checked the everything-to-everybody Exception Application Block just had too much going on.

How do you handle privacy concerns when automagiclly sending error reports? How do you handle apps like ZoneAlarm which not only block such communications, but also pop-up a blaring siren accusing your app of being bad?

Etcetera.

Kevin