Crash Responsibly

Given this (as well as the referenced “what’s worse than crashing” entry, what are your thoughts on this post from an MS blogger?

http://blogs.msdn.com/eric_brechner/archive/2008/05/01/crash-dummies-resilience.aspx

TH, I talked about “Fail Fast” here:

http://www.codinghorror.com/blog/archives/000924.html

@mikeb - my apologies. I hadn’t grokked the extent of your comment. As you say, checking pointers is just one small part of a good error handling strategy.

@Liron Levi: Please exclude personal data from your logging and encrypt the files reliably before they go over the wire into your inbox and delete them as soon as possible. Thank you for handling data responsibly.

From my experience, I’d have to say that rule #4 worked the best for us (not that the other rules are meaningless…they definately aren’t). I hate it when a bug, small or not, reaches the customers. However, it seems like nine times out of ten that fixing that one bug resolves all of the customer’s problems.

Also, that first picture is awesome! I love Fight Club!!!

“Which doesn’t help you catch null pointer dereferences since they don’t throw C++ exceptions.”

Good point. This is what I get for being a smarty pants.

Still… nobody should be struggling with null pointer dereferences anyway… :slight_smile:

I am programmer, most of the time I make sure that I log the errors and actions performed by my program into a text file. After reading this article, I remembered about few applications that I use actually urge to send a dump file created to an e-mail address mentioned. Few times I did send, but I was not sure if the company did something with the dump file, but sent it anyway. It’s a mixed response actually. In some application it did help, but in some the error still existed, may be they fixed it or not I am not sure. One such application which I don’t want to mention, it used to generate this error report and ask the user to fill the details and submit it, but the whole process so slow. Actually come to think of it Dr.Watson’s error log kind of makes sense to read, it is not easy to interpret, but at least makes some sense. Any way I am thinking of incorporating the process automatically sending the email to a particular address in case my application crashes.

There is a nice phrase about this in The_Hitchhiker’s_Guide_to_the_Galaxy - Mostly Harmless :

(It was, of course, as a result of the Great Ventilation and
Telephone Riots of SrDt 3454, that all mechanical or electri-
cal or quantum-mechanical or hydraulic or even wind, steam
or piston-driven devices, are now requited to have a certain
legend emblazoned on them somewhere. It doesn’t matter how
small the object is, the designers of the object have got to find
a way of squeezing the legend in somewhere, because it is their
attention which is being drawn to it rather than necessarily that
of the user’s.

The legend is this:

“The major difference between a thing that might go wrong
and a thing that cannot possibly go wrong is that when a thing
that cannot possibly go wrong goes wrong it usually turns out to
be impossible to get at or repair.”)

Awesome images! Which airlines/aircraft did these come from?

@Daniel Serodio: The airline image on the right comes from the movie Fight Club.

In case of SEH you can’t trust your application to log crash, because client memory space can be destroyed, heap can be destroyed, I/O libraries can be in bad state, etc.

And, in general, we should not send any data without user’s approval.

@Chris

“The 80/20 rule now becomes problematic if you take the Microsoft stance to never fix bugs unless they’re widely reported, even when the fix is obviously trivial. That means a lot of your users will keep encountering a zoo of annoying little bugs, and the general impression becomes that your software is shoddy, even though it may not have major issues. Doesn’t bother our monopolist, but should bother anyone who is not a monopolist. Those bugs in the lower 50% of frequency still contribute to the subjective impression of how polished your software is.
Also, Microsoft’s attitude had the effect on me that I stopped reporting issues to Microsoft Connect which I had once done quite frequently – back when they actually fixed them. Now I know that anything I report won’t get fixed unless many others report the same thing, so why should I bother? Of course from Microsoft’ perspective this may look like their software is miraculously bug-free now, assuming others no longer bother reporting issues either…

Myself, and others in my company have all had the exact opposite experience. I submitted a couple of bugs for Visual Studio that literally no one else had ever reported. One was fixed for VS2008, and another was marked no-repro. However, it was reopened soon after and slated for the next version. Throughout the whole process Microsoft was very attentive and helpful in researching the issues.

I don’t think I did anything special. I submitted every piece of information I could possibly gather, all stuff that I would ask for if I was fixing a bug. State information, repro steps, machine information, etc. Sometimes as developers we might forget that sort of thing when submitting problems with someone elses software, and just fall back to “it’s broken”.

The same is true of others in the company. They’ve all gotten bugs looked at and fixed. We’ve even gotten tweaks, not bugs mind you but tweaks, considered and approved at during testing of an Release Candidate. We’ve also had one thing get escalated up to Scott Guthrie. And he actually said that nobody had ever had that issue, but he was still willing to work with us to get it addressed.

Maybe lots of people do have issues getting their concerns addressed with Microsoft, but I’ve never had anything but success and good experiences.

Too bad Microsoft never learned this lesson:
http://dotmad.blogspot.com/2008/05/throw-exceptions-responsibly.html

What about this kind of bug:
http://folklore.org/StoryView.py?project=Macintoshstory=Disk_Swappers_Elbow.txt

Part of “crashing responsibly” is not actively punishing your users. The size and complexity of PC applications means that it isn’t directly comparable, but the way the iPhone handles problems is brilliant.

In short: as a user you’re never really informed of a fatal error. This kind of sounds bad (especially as a developer) but it works well because you rarely lose any data and re-launching the application is almost instant.

More here: http://www.zx81.org.uk/computing/opinion/error-mishandling.html

hahahahahahahahahahah whats with that picture. Nobody does this

maybe I’m not abusing it enough, but I think I’ve only seen VS 2005 crash once.

I don’t use FF3b5 anywhere near as much as Opera 9.5b2 on my Mac, but Opera is a lot more crash-prone for me. However, I still can’t live without searching from the address bar (CMD+T, “g search terms”; I’ve also set up i for google image search, yt for youtube, etc), or paste and go. Paste and go is probably Opera’s killer feature at this point.

On privacy: if you ask the user if they want to send a crash report they will probably drag the window as far off the screen as possible and keep working. However, I make a point of sending crash reports! Except if it’s my own software.

I’m developing a medical call center application (client/server).
I have the following rules when writing the software:

  1. I place detailed error logging code everywhere in the code where the code should not execute (but actually get executed due to a bug). I use try/catch clauses a lot (most of the time only for logging purposes). I use other severity levels as well, but the error logging is important because:

  2. Whenever an error log is created - my log4net appender will automatically prepare a zipped version of the log file and send it to a special email account. I can do this because I have a responsibility to make sure that the software runs 24x7 with minimum downtime. My customers even appreciate this because many times I know about problems before they do and am able to prepare quick fixes.

This simple scheme allows me to get near real-time status for all my installations. When a problem do occur - I have a detailed log file with enough information (stack traces are invaluable in this respect) to fix the problem.

In my opinion logging is vastly underrated in our industry. After 14 years of active development I came to the conclusion that no amount of logging is too much. Now I know the arguments againts logging (slows the software, takes too much disk space, useless information etc).

This is rubbish, pure rubbish. Today, most of my time is spent on actually solving a bug and only a fraction is spent in an attempt to decipher what caused it in the first place. This would not be possible without the level of details I get from my logs.

Now I’m not saying this scheme is perfect. I have tons of ideas how to improve it even better but I have 80% of the problem solved this way.

Hope this helps
Liron

@Henry Boehlert: my logs don’t contain personal information. Encrypting the logs may be part of a more comprehensive system I’m thinking of (as I’ve alluded in the previous post).

One of the problems of the current log system I’m using (log4net+special customizations) is that it is not suited for off-line tool analysis. For example - I want to use the logs as a means for collecting usability information like window usage statistics, number of steps needed to accomplish a specific task etc. Many times I want to log the state of complete objects. These things are difficult to do with log4net.

What I’m thinking of is a complete DB based log (something simple like SQLite) with support for different types of logging events (e.g., special event records for user actions, another for performance counters etc). This will come with a generic log viewer that is able to display all events on a time line, complete with serialized logged objects, special logging event types etc.

Only problem is I have more urgent priorities at the moment :frowning:

Liron

I’ve actually seen the Windows error dialog come up for a .NET application with a global exception handler already installed. I can only surmise that it was out of memory or was in some weird hardware state that actually fried the .NET framework. The thing is, the application would start, but every time it actually tried to access any data or web services it would crash.

Not surprisingly, all it took was a reboot to solve it. I’m not really sure if it would even have been possible to do anything proactive, given that the crash was obviously happening deep within the bowels of the framework and therefore totally out of my control.

I agree with the approach, more or less, but sometimes technology just fails. You’re likening the situation to an airplane crash, but it’s a false analogy because you’ll know about that kind of “crash” well in advance. Sometimes with software, by the time you know that there -is- a problem, it’s already too late to do any damage control. Not always, but sometimes.