Just Try Again

It's funny because it's true:

"I know", said the Departmental Manager, "Let's have a meeting, propose a Vision, formulate a Mission Statement, define some Goals, and by a process of Continuous Improvement find a solution to the Critical Problems, and we can be on our way."

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/07/just-try-again.html

Sure, that’s what they said about Skynet, too… :wink:

The problem with a lot of software, is it’s a seal box operating in the wild.

It’s sealed, so the user (or developer sometimes) can’t just peer in and say “Oh I see what’s going wrong there”

Also when a program goes bang, unless I’m there seeing and working with the problem, it makes it a lot harder to work out what’s going wrong.

For a new website I’ve been developing for the last year, one of the key components is that every single error that occurs on the website is logged with as much detail as possible.
In addition it was text message/page/call pre-selected person(s) out for certain critcal problems.

The site isn’t launched yet, but I’m hoping this will result in a much better experience for all concerned!

While it may sound like an unreasonable and funny approach to working on a car, software isn’t like a car. Take any analogy too far and it falls apart.

Re-running the software is a perfectly logical approach to troubleshooting. What was the cause of the problem? Does it happen every time? If so, why? If not, why not? Could it be some outside interference that only affected the program that one time, or is it something inherent to the program itself that will happen every time?

Once you do narrow down the cause, you can address it.

@Peter, one extension we’ve made to that “log every error” approach is to create customizable RSS feeds. All apps on a server log to a central reporter which sends out feeds. The feeds have minimal detail for security reasons, but the link takes you to a suitably secured page that displays the relevant info. Saves you from checking religiously and also reminds you to go look when needed.

While it may sound like an unreasonable and funny approach to working on a car, software isn’t like a car. Take any analogy too far and it falls apart.

Right; there are no physical consequences to trying software again, which is why the joke is funny.

I do think we’re (or at least I am) occasionally guilty of blindly trying again without doing any kind of postmortem…

I hope the software you run is never as dangerous as a car!

Once you do narrow down the cause, you can address it.

And if you can’t reliably reproduce the problem, how can you be sure you’ve actually fixed it?

While I agree with the posts here, I think we’re missing out on a key issue. Another reason that I think software developers like to see an error repeated is to make sure their users are actually reporting what they are seeing. I’ve been in countless situations where well-meaning users call/email to report an issue only to have said issue be a non-issue. I’m sure most level 1 support folks can attest to trigger happy users calling up when the slightest “out of the norm” thing happens.

Ill admit that my first attempt is often to reproduce the error in a controlled environment (my own). The more complex the problem, the less chance of this succeeding though.

I have no problem saying that I do this more out of laziness when it is simpler to reproduce than to analyze the relevant code. Then again there are probably more moving combined parts in your average enterprise app than there are in any car. You dont have only one or two engineers with an understanding of that system, but thats rarely the case for those of us in software.

The ‘log everything’ approach can work if you spend enough time refactoring (I know I never log everything while designing the code - the ‘should never fail’ always will), but has the obvious problem that the log anaylzer becomes a critical piece of software in itself to wade through the mountains of information spewed from any long-lived app.

Since the systems I work on tend to be distributed workflows, I switched over to a multicast socket scenario. That way, I can (if I want) have a listener that records to a database, another that jumps in mid-stream to display current system activity on screen, and another that escalates conditions to email/pager notification.

But then of course, that logging system needs to be thoroughly tested…