Crash Responsibly

Max · May 19, 2008, 12:00am

“Every time you dereference a pointer you should check if it’s null. Every Time You Dereference A Pointer You Should Check If It’s Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT’S NULL.”

There are so many problems with this statement.

So, suppose the pointer is null, and it’s not supposed to be, what do you do? The correct strategy is to dereference the pointer and let the system throw an access violation exception, which you then handle appropriately. The page fault (and other system-wide) exceptions are there for a reason, use them.

Second, checking for a null pointer does not catch bad pointers that are not null. What’s the point of expending all the energy on something that can’t catch a simple uninitialized pointer?

Third, as someone has pointed out, this strategy is simply dumb and inefficient. Use a walled-garden approach instead - define unsafe and safe interfaces. The unsafe interfaces can handle bad data, including bad pointers, the safe interfaces handle only good data.

Reed · May 19, 2008, 12:00am

Thanks for that info Mike, I didn’t know it.

To catch a crash on Unix, just set a signal handler for signals like SIGSEG, SIGILL, SIGFPE, SIGABRT, SIGBUS. Lots of Linux systems also happen to have gdb (the debugger) installed, so you can also try running gdb -batch -x script where script is a file containing the gdb commands you want to run to get a back trace or whatever then quit. (“set pagination off” and “set width 0” are also helpful there).

Reed · May 19, 2008, 12:00am

Oops,

‘gdb -batch -x SCRIPT’ where SCRIPT is a file containing the commands.

o_s71 · May 19, 2008, 12:00am

o Let users know that it’s our fault, not theirs.

This is a really good point but soooooo many software shops are too arrogant to heed it. For instance with Visual Studio theres a weird bug in the debugger where it will only report back ‘Expression cannot be evaluated at this time’ whenever you try to evaluate any variable in the immediate mode window. I searched around for a fix for this VS bug and all I found was this page: http://msdn.microsoft.com/en-us/library/f221hs8y(VS.80).aspx

The only advice was to change the syntax of the expression. Oh boi so how was it my fault that the debugger couldn’t evaluate a
Records.Length() statement. This was a HUGE headache on my last project!!

Stew · May 19, 2008, 12:00am

Great post, Jeff! It really motivates me to create an error handling package that my entire team can use. I’ve been very, very lax in this in the past.

Thanks for improving my work, again.

o_s72 · May 19, 2008, 12:00am

Oops meant Record.Count(mixed up with Java) but it was definetely a problem though.

Nate_Nead · May 19, 2008, 12:00am

I agree with O.S. there are so many software shops that either don’t care or are too arrogant to admit it. It is a great idea. The last paragraph of your post was the most important I think. Prioritizing the problems is where we’ll be able to nip the problems in the butt before they come back to roost.

Jeff_Davis · May 19, 2008, 12:00am

An important consideration is how do you handle include files that don’t exist? Does your application crash responsibly?

If you develop for the web, you probably notice that you upload files often. No matter what method you use for uploading, there will be brief periods where files are not available. Does your website disappear, or explode when somebody visits a page while the include files don’t exist?

It’s something most of us don’t think about, as we don’t visit pages while we are uploading changes. But if you make many edits/day, realize that you might be alienating users who happen to visit while you are making changes.

AndyB · May 19, 2008, 12:00am

actually, a call to SetUnhandledExceptionFilter will catch unhandled .net exceptions… after all, what do you think .net exceptions are implemented with?

Also, in your exception handler, put a call to “MiniDumpWriteDump” (its in the dbghelp.dll - google it) and you can write a minidump to send back to the developers. Load this up in windbg (or Visual studio) and you’ll get a stack trace, registers, parameters and more. There should be more than enough info there to debug the crash and make sure it never happens again.

StillNull · May 19, 2008, 12:00am

This is getting so off-topic w.r.t pointers.

Sure @Max, but equally, you can’t guarantee that all code inside the walled garden always “Does the right thing”?

How do you define safe interfaces? Is that a new construct?
Safest == private
Marginally Safer = internal
Unsafe == protected
Unsafe = public

note that I don’t say any are safe, but a private method should be the safest

@mikeb - no, not all null pointers are bad, but you still can’t de-reference a null pointer.

null values passed to methods can be completely legitimate.
what happens when null is passed is completely implementation specific.

However, if null is a valid state, a Null Object pattern can get rid of a lot of redundant code checks.
http://en.wikipedia.org/wiki/Null_Object_pattern
http://www.cs.oberlin.edu/~jwalker/nullObjPattern/

MitchB · May 19, 2008, 12:00am

I feel that if our software were designed correctly in the first place, we would have a lot less errors: http://softwareindustrialization.com/TheHumbleSoftwareEngineer.aspx

Jheriko · May 19, 2008, 12:00am

hmm… a lot of talk about .net exceptions.

The first comment is gold considering that C++ has exceptions.

As for:

“Every time you dereference a pointer you should check if it’s null. Every Time You Dereference A Pointer You Should Check If It’s Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT’S NULL.”

Erm… NO NO NO. This attitude may make your code work, but it not very intelligent, efficient, or good for debugging.

Everytime you dereference a pointer it should have a valid value. Constantly checking to see if its null is mindless… you shouldn’t be doing things to break pointers once they are allocated and there are other invalid values for a pointer than just NULL. Check for null on allocation only… in this situation NULL almost always means "I couldn’t allocate the memory. and, as this is a specific error code being returned, it is more than acceptable to check for it.

Its safe, I’ll agree, but its terrible practice, especially in a language like C++ where you should know, or be able to work out, exactly what a pointer is pointing to based on where it is in code. Besides that 9 times out of 10 its a bad number of loops or faulty pointer arithmetic which causes the exception to be thrown, which will not make the pointer NULL.

But yeah, exceptions, use them! Correct exception handling code will handle almost everything. (I’m sure there is something out there I don’t know about yet!)

NG1 · May 19, 2008, 12:00am

it describes a lot like my project
but JS doesn’t see any issue there

Dennis40 · May 19, 2008, 12:00am

Hi Jeff,

maybe you should apply this to your blog when users type long replies and then lose the whole thing if there is an error and we have to hit ‘back’.

SeanP · May 19, 2008, 12:00am

This comes from a web app angle, but I’m looking to implement this into some of my winforms apps soon…

I got hooked on log4net a long time ago. (http://logging.apache.org/log4net) It provides amazing flexibility in a small package. I essentially have all my code with logging details setup and then a general catch all log for anything that bubbles to the top.

From there I wrote a web app (http://www.codeplex.com/hacksaw) that allows me to view these log files. That way the end user can simply notify me if they are having issues that I haven’t discovered yet and the log can give me the details. The logger is initially set to dump just the exception messages, but if I need to go in and get details on a particular method, I make one simple change to the config file (no recompile necessary) and I can get a full dump of the variables and such for each method.

Besides, how many users actually remember to take a screen shot of the exception message being displayed to them, let alone know what to tell you other than “my program isn’t working” 8^D

HenryB · May 19, 2008, 12:00am

MiniDumpWriteDump shouldn’t really be called from the crashing process, much less from the crashing thread (think about “stack” and guard pages). Also, in the managed world, to be able to use sos on the dump, the dump is not really going to be “mini” any more.

In addition to the AppDomain.UnhandledException event, the System.Windows.Forms.Application.ThreadException should be subscribed to, to hook unhandled exceptions from the message loop.

There is no good Windows Error Reporting story for .NET. Also, WER/OCA is useful only for signed binaries. Everything else just goes down the big drain in Redmond. (Verisign signed, BTW. I can’t believe where all in the world these guys keep their hands wide open.)

Crash dump and logfile privacy is a huge issue. Think “compliance”. SOX or FDA, or whatnot. Also, all your clients may not have a persistent, cheap internet connection ready for you to send out e-mail or tons of data.

There are many good logging frameworks, beyond or in connection with EnterpriseLibrary, mostly adding value by great aggregation and analysis tools e.g. SmartInspect.

Lastly, not too many people know Microsoft’s kernel-based, high-performance “Event Tracing for Windows” (ETW) for drivers and applications, although it’s a hot runner-up for “Greatest Thing Since Sliced Bread” because it allows to decouple trace source and trace sink and is claimed to process a log entry in “only 1500-2000 cycles, depending on settings”. So they are long finished when EntLib hasn’t even yet figured out which LogSink factory to invoke.

There is a managed ETW wrapper in .NET 3.5, but only for Vista and it didn’t get a lot of love from it’s developers, it feels.

Hmm. I got off topic.

We have, for our Windows apps, a dialog that reads “We apologize for the inconvenience.” and allows to save some error information in the user’s local application data folder for support personnel (all exception infos, dump, screenshot). It’s the least we can do.

But in the end, users will dismiss the dialog as fast as anything else they deem usual software noise.

Vix · May 19, 2008, 12:00am

One can find more info about Windows Error Reporting here: https://winqual.microsoft.com/default.aspx

And I think that you can get a discounted cert for $100.

mikeb11 · May 19, 2008, 12:00am

I didn’t say anything about dereferencing a NULL pointer being OK - what I said is that “not all bad pointers are NULL pointers”, meaning you can have a pointer that has non-NULL garbage in it. So even if you check for NULL pointers before each and every dereference you would still want the error handling infrastructure that Jeff is advocating.

However, my main point is that even if your policy is to check for bad pointers (NULL or otherwise) at every opportunity, that policy still does not help the user or your support process for the situation where you make a mistake in implementing that policy. Bugs happen - having an error handler in place will help to identify and mitigate the problems when your program crashes.

Chris_Nahr · May 19, 2008, 12:00am

Good post. I recommend sending error reports in XML!

This, however, is not a good recommendation:
“Leverage the 80/20 rule”

Joel Spolsky once said about the 80/20 feature rule that everyone needs a different 20% subset of feature which is why “light” editions rarely work, unless they’re really not “light” at all like Photoshop Elements.

In the case of error reporting, 80% of your users might well encounter the same 20% of bugs but many will also encounter some other (perhaps minor) bugs – just different ones for each user.

The 80/20 rule now becomes problematic if you take the Microsoft stance to never fix bugs unless they’re widely reported, even when the fix is obviously trivial. That means a lot of your users will keep encountering a zoo of annoying little bugs, and the general impression becomes that your software is shoddy, even though it may not have major issues. Doesn’t bother our monopolist, but should bother anyone who is not a monopolist. Those bugs in the lower 50% of frequency still contribute to the subjective impression of how polished your software is.

Also, Microsoft’s attitude had the effect on me that I stopped reporting issues to Microsoft Connect which I had once done quite frequently – back when they actually fixed them. Now I know that anything I report won’t get fixed unless many others report the same thing, so why should I bother? Of course from Microsoft’ perspective this may look like their software is miraculously bug-free now, assuming others no longer bother reporting issues either…

Fake51 · May 19, 2008, 12:00am

@bignose:
“The type of errors we’re talking about are because “a previously unknown bug in your code causes the application to crash and burn in spectacular fashion”. So there’s no way we can trust the application to write a neat, comprehensive error log when that happens: we can’t trust it to be sane at all.”

Here’s what I don’t get: you have an exception handler in place. It gets triggered. You check the exception and it’s certainly not something you can recover from. What is the problem in the following scenario? You a) check that your exception handler code hasn’t been modified (a quick checksum will do, just to make sure there were no mem overwrites), b) try to create an output for logging and if that works, c) write out the details of the exception, then you d) check the validity of the memory of the stack and if valid write that out, e) check the validity of your data pointers and if valid f) start validating data and if valid g) try to save it down. Then h) quit/restart.
Where does the assumption that you cannot trust anything at all come from? The fact that your exception handler executes means that there is at least one thing you can trust - so start working from that.

Regards
Fake