What's Worse Than Crashing?

Here's an interesting thought question from Mike Stall: what's worse than crashing?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2007/08/whats-worse-than-crashing.html

Network BDE/Paradox apps corrupt data all the time - it’s our number one cause of support calls. I’ll be glad to be rid of it.

Jeff,

It seems to me the fail-fast article you linked to is not referring to all software development. Rather it is referring to the development of in-house software for corporations. In such an environment you can be informed of - and respond quickly to - “fail-fast” crashes. In this situation it is a really good tool for flushing out bugs that made it into production.

However as a general principle of software development, it may not be quite so good. Shrink-wrap software probably needs to have more effort made to catch errors and intelligently recover from them.

I work in an embedded system which operates with money (no, its not an ATM). We design software before writing it, we test software before giving it to QA, QA tests it before giving it to the regulators, regulators test it before giving it to the client, and clients test it before rolling software out.

Crashing at any of these steps, and we’re back to square #1. Buggy software is not allowed to escape.

Problems with software and bugs is never technical, it is all financial. If money can be made by releasing buggy software, then that is exactly what will happen. If it costs more money to release buggy software than to spend the time to actually do it right, then good quality software gets created.

At the end of the day, the choice is up to the customer. Dont settle for second best.

I Disagree with the fail fast whatever… you may isolate a copy of the data the user working on and still let them work they way out by them self.

still waiting for your “Why Developers hate Software” post :wink:

smallstepforman - the problem is that there are rarely first best options. It’s hard to vote with your wallet when there’s nothing worth voting for.

Anyway, I think one issue is the relative crappiness of most exception systems. What is really needed is a system which enables higher up code to catch exceptions and delegate to a lower level handler. In other words, the higher up code specifies the policy with regards to the exception, while the lower code specifies the specifics. I believe something like this exists in Common Lisp.

The way this might manifest itself gui wise would be a dialog notifying the user of the pertinent info, and with a button allowing selection of a strategy to deal with it. A checkbox with “backup project to new file” might be a good idea :slight_smile:

Sure - it’s hard to write a bugfree application, but you can - by design - make sure that errors don’t cause data corruption or failure. And testing, QA and unit testing will help you to get you to a point where the two or three litte errors at least do no harm.

I am - as most other who commented here - a software developer for embedded systems. Crashes or degrading behaviour is not an option. My, and our life might one day depend on code I’ve written.

One trick to get stable software is to hande errors as they occur and propagate them up to the caller. You will at a higher level of abstraction have the chance to handle hard errors such out of memory. Had bad luck displaying an incomming message? Maybe next time you have better luck because some background process has finished and freed some memory. If not - give up after x tries but keep your data intact.

How does the alternative looks alike? You got a null-pointer, didn’t handled it. Some code calling you expected that everything went well and writes some data into nirvana. This could cause funny behaviour or as well crash the entire system.

All vital subsystems shouldn’t use dynamic memory or write to files anyways. They are ore or less autonom as long as noone passes garbage to them or corrupt their data via bad pointers. What could go wrong on his level is just data corruption due to writes from the outside (use asserts here during development time) and floating point NANs creeping up and propagating themself into the guts of your database.

You have to check for those! Errno and low level exception handling is your friend here.

It looks like a difficult and time consuming process to follow these rules, but it is not. If you do it from the start it’s quite easy, and after a month or two it becomes second nature.

Uh, smallstepforman got there first, but I’ll reiterate his point anyway. A program that crashes should never make it through testing. From a testing standpoint, a program that crashes soon at an obvious place is much easier to diagnose and fix than one that tries to recover, messes up and then crashes, obscuring the real cause of the bug. You’re right that the end-user should never see a crash: a program that buggy shouldn’t even reach production!

Crashing is good. It is an essential reminder that despite our self-made ideas of supremacy over technology, it is indeed a minor miracle that most any of this crap works much of the time.

One benefit of a web app is that it makes error reporting so much easier, as you are pretty much assured they have a way of communicating with you (the Internet).

I think in light of this, mutating the previously suggested logging errors into a silent error report could be a very powerful tool for the future (indeed it’s already used in some areas of serving websites). Where you would normally fail fast in a debug build, you could now feasibly, in production builds, send these as minor error reports and then try, if it makes sense, to recover. Somewhere close to the best of both worlds I feel.

I would agree that people probably try to recover too often when they shouldn’t. If you don’t know exactly how to recover, don’t try.

I don’t think the article ever talked about releasing programs that fail as soon as you start them, of course you wouldnt release that.

However bugs will happen eventually in all programs and do you then try to hide it or crash so that it’s easier to find the actual bug ? That’s what this article is about.

Bugs will happen no matter how much QA you have, it might be something totally unrelated to your program such as a hardware failure on the clients computer but if you try and hide a problem by naively thinking you can fix everything then you will have numerous other problems and a really hard time to find the orginal problem.

Fail fast is the only methodology that makes sense, compare it to any other system in the world, if a car engine would somehow attempt to fix a serious issue with itself you would surely make the whole car life threatening. Why would you think software should attempt to fix serious issues ? Fail fast and then diagnose and fix the real issue.

I’ve read that article a few months ago and I have to agree. Catching exceptions can be problematic and cause other issues to crop up later. And a well debugged program shouldn’t need to patch or hide it’s errors. they should be shown in their full glory, that way everyone knows the deal and it can be dealt with quickly.

Here’s one I’ve never been able to recover or decipher. I have a web application that I’m developing. After a few days of running I get random errors. This only occurs on my machine. And the error is too many elements into an array. The only way to fix it is to logout and back in. Thankfully it hasn’t been reproduced in production but I think of the hours I lost tracking this ghostbug. bad memory sector perhaps?

There’s a mnemonic for ‘fail fast’: the code must live by the code of samurai honor. If the life isn’t perfect - time for seppuku :-).

I’ve switched two relatively large projects to fail fast. In both cases, there was a resistance from users, QA and developers at first; more like a disbelief “What?! Are you actually telling that crashing is good?” But it quickly diminished when they saw the quick turnaround on bug fixes and disappearance of tricky bugs and data corruption.

It is very easy to explain to users/QA, in my experience they understand it well. Just tell them that the alternative is corrupted data and tricky bugs taking forever to fix. Versus one crash that they will not see ever again.

My comment is not superbly relevant, but this reminds of the philosophy of “degrading with grace” when it comes to CSS design for websites. I stumbled upon that many moons ago when I read a tutorial on CSS by one of the guys at Webmonkey (I wonder if that’s still up).

I would just add a number 7 on the list:
7. Application crash/error causes harm or physical damage (for instance through the machinery it controls, or in medical devices, or for instance a power outage)

As a few has pointed out, failing fast and visibly is very good during testing, where as maybe not as good during production use (depending on what “production use” actually is). Anyway, assertions, available in most programming languages is a good tool for that job. Fails fast and visibly during debugging/testing, but usually not even activated in production use.

What the?

Hasn’t anyone heard of “log files”. When something goes wrong you immediately log the error message (one meaningful to the developers) and then try your damnest to recover from the problem. Then when/if the application does crash, instead of crashing somewhere obscure, you have a detailed trace of the problem.

I think the more sensible solution is:

  1. When in doubt (ie, something larger is at stake such as corrupted data) fail fast and loudly. This might mean switching the app to a “read-only” state if such is possible, though (it’s often useful to have a hobbled-but-not-dead system to poke at to see what precisely caused the error).
  2. When you can fix it, and are sure you can fix it, fix it. But, first, log it.
  3. When in dev qual, QA, and beta testing, all logs should be open and being perused constantly, while the application is running. After beta, in production, these logs can be ignored, until the customer calls tech support wondering why XYZ is acting funny.

In general, this fixes all errors. And, no, I would NOT want my car to “fail fast”! The timing gets jostled a bit on a country road and it just stops? No thank you! I want my car to (as it does) “overcome” any issue it can, AND let me know about it both effectively and non-disruptively (throwing it into Park while hurtling down the freeway at 70 mph is very effective communication, but slightly disruptive). That’s why it makes funky noises when something’s wrong, and lights come on, and little diagnostics get recorded to the onboard computer for a technician to decode with his horrendously over-priced code reader.

Log. Log. Log.

A theme I see growing more and more is using exceptions to control program flow and then sometimes failing to properly deal with the exception.

I am surprised how I have to constantly make the statement “If a condition that would cause an exception can be caught and dealt with, then it should be caught and dealt with before it throws the exception.” I find this very CS101. Exceptions should be left for the unknown, not the known.

Too, I think sometimes developers fall into the trap of thinking that they do not need to validate data at deeper levels because a higher-level method would have “found it already”. Every piece of data has to be validated at every level. This will slow down the program a bit, but it’s much better than relegating your end-users to sitting around and twiddling their thumbs while you go out on bug patrol.

And one more thing…

The excuse of “we weren’t given the time to do it right” doesn’t wash.

Giving your manager a 90% bug-free app and being overdue will get you a better performance review than giving your manager a 50% bug-free app on-time.

I would rather be labeled as a slow programmer than a bad (or worse, a half-a$$ed) progammer.