The Problem With Logging

For me, simple runtime control of loggers is the killer feature in frameworks like log4j/log4net. Changing log levels becomes more like opening/closing a tap.

I use debug level log entries during development to verify process flow. In production our log entries are turse but the ops guys can drop to debug for a given package and get a clear rundown of the order of events.

Assertions would serve the same purpose as logging but with the added benefit of helping to document the assumptions you’ve made rather than obscuring them. Heavy logging makes the original intent of the code very difficult to detect. I have yet to see a code base where it worked well.

Im with Jeff on this, sure logging CAN help, but how often do you Really open up a log file and look through it?

When i was working on a site where we had a lot of users doing some i.e. transactions between accounts etc. we needed good logging, however, the logs were easily over-logged and didn’t help at all, we had to try forcing the site into the same scenario to get the same log-output. Even though the logs were pretty clear and logged all data possible to log.

So, logging can be good, but often is excessive! I’d go with the Exception only-logging, and maybe store a trace.

Cue NP rant: Anyway, we only say a program is deadlocked because we can’t prove it isn’t. Nobody really understands deadlock, and sometimes it’s just unavoidable.

Jeff, YOU ARE DOIN IT WRONG.

Some people do understand the basic CS-undergrad topics well enough to avoid coding up deadlocks, or livelocks, or exponential algorithms, or XSS exploits, or faulty regexes to weed out lt;scriptgt; tags, or whatever newbie mistake you’re whining about today. If you don’t understand something, say so; don’t try to hold down the other newbies by feeding them your misinformation! (Didn’t somebody nearby blog about CrabMentality a while back?)

this resulted in – you guessed it – a classic out-of-order deadlock scenario

Wait – does this log component stupidly acquire a long-lived lock, or did you use it in a stupid way?

If your logging is deadlocking you, there is stupid going on. A logger only needs a lock to serialise information going into the log, so …

Option 1: use one log file per thread. No locking required.

Option 2: dedicate one thread to logging. Other threads enter a critical section just long enough to write to a buffer; the log thread commits the buffer to disk when it has the time.

Option 3: sacrifice performance and enter+leave the critical section on every log call.

Option 4: do something stupid and get deadlocked.

Logging means more code

Then use a language or addon (like AOP) that can inject logging automatically.

it actively obscures the code that’s doing the real work

You’re assuming logging doesn’t do real work. 8 years of maintaining applications that run at customer sites in other cities suggest otherwise.

If it’s worth saving to a logfile, it’s worth showing in the user interface.

No. The user interface is there to interact with the user. Event logging is there for the administrator to get a summary of what’s going on (alerts, errors, warnings, changes in configuration, access to sensitive data, etc). These all relate to the operational requirements of the application.

What you are describing - primarily - is debug logging: dumping enough stuff in enough detail to help a developer figure out what went wrong, and why.

Logging isn’t free.

No, it’s not. Which is why most sensible logging subsystems will have a runtime option to change the log detail level, and will start with a quick check to determine if the log statement in question needs processing or not. In C/C++ one would use macros for this, to avoid a function call if the particular log statement isn’t going to be executed.

The more you log, the less you can find.

YMMV. This depends substantially on how your log is presented (does it have a way to show stack depth, for example), what you are logging, how consistently you are logging, and whether you have tools to parse the logs.

it’s the brainpower needed to make sense of these giant logs.

Or the brainpower needed to log sensibly. Here are some approaches:

  1. Don’t allow a single concern or transaction to cross a thread boundary. Use one log file per thread, or include the thread id in the log, and it is easy to reconstruct the transaction.

  2. Tag each transaction with a system-wide unique ID, and make sure it gets logged. It is now easy to reconstruct the transaction.

Good luck getting everyone on your team to agree on the exact definitions of FATAL, ERROR, DEBUG, INFO, and whatever other logging levels you have defined.

It doesn’t matter. So long as there is agreement on what goes to the event log (read by the administrator, not the developers) and what goes to the development log (development only) … and that’s a customer requirements problem.

The whole problem here is that you see logging as some optional extra. That’s kindof like a civil engineer dismissing drainage as an optional extra - sure it’s not the core of the building, but you’re going to end up knee-deep in shit if you leave it out! Design to accomodate it from the outset and it will perform as expected.

I was also not a big fan of logging, but two years of work in the company that produces software which works with unstable custom hardware proved me wrong. It’s better to log everything I can than waste time on some obscure bug.

I also support the guy who recommended to use AOP for decreasing amount of code written: method entrance, exit and exception logging could be easily done this way.

On the one hand:
Logging is great for multithreaded scenarios, where debugging can’t help you because of the context switching.

On the other hand:
Seeing a method that looks like this:
public void DoSomething() {
m_Logger.Trace( Entering DoSomething() );
object result = CallDatabase();
if( object == null )
{
m_Logger.Trace( result is null );
}
else
{
m_Logger.Trace( result is not null );
}
m_Logger.Trace( Exiting DoSomething() );
}

This is just awful, and for so many reasons, and this doesn’t include exception handling.

Regarding sending messages to the GUI, this only is practical if there is also a way to save that data out to a file or to print it. Asking a user who has called support to start reading error messages, no matter how well formatted, is a recipe for missing something important.

Anyone work with Sharepoint and try to dig through the logs to find out what went wrong? It is an absolute nightmare. Even when you look at the screen to limit the event categories, you’re presented with a choice of about 150 items…
Anyways…
I’ve used logging in applications where I am not in control of the infrastructure, yet still need to debug problems. My favorite approach is to have logging configurable so that if you’re encountering a problem, turn logging on, reproduce the problem, find out what went wrong, and then turn it back off.

Log the exception stack trace.
Log any critical database changes according to the business requirements.
Log access to protected or critical systems.

I generally only log information that would help me reproduce user actions (such as button clicks and information entered). I think logging the start and end of every logical block of code is extremely excessive. Now granted you may need to do that in some rare debugging instances, but in general you shouldn’t have to do something like that.

We have been using for a long time a combination of logs and traces.

  1. Logs are those very important exception events that you want/need to keep. They are stored locally on an embedded device and they might as well be small and clear. The conventions for ERROR, WARNING and INFO must be crystal clear and in some cases it takes an approbation or a review to have the right to log something. The other problem with logs is recurring conditions. You do not want to end up with your whole log filled with timeout exception, could not connect to xxx. Thus, it takes extra attention to only log when the condition occurs, not when it repeats, and also when it clears.

  2. The second level are the traces. They behave similarly than the OutputDebugString function. Those are run-time trace information about the internal workings of the system. If you are not there to see them, you loose them. Each component that traces register itself with the system and can creates as many level of traces he wish. We end up with a tree of checkboxes for all the tracers in the system and you check only those you wish to see. Because we know that executing all that code takes extra time and mangles a bit with the multi-threading timing, trace code is conditionally called only if someone has checked the relevant checkbox in the tree. Thus, no CPU or very very little is spent executing those. We have some nice mechanisms of doing this for all kind of situations (even decoding complex objects can be conditionally run) without obfuscating the relevant source code.

We found that this second level of traces was VERY helpful on deployed systems to identify problems and fix issues that sometimes were configuration related or inter-compatibility issues with the external systems. It was also essential in finding complex bugs/problems that can be reproduced on live systems (but may be difficult to reproduce in house) while having a minimal impact with the live system (read attaching a debugger and tracing step-by-step).

I’m into embedded and real-time things and into applications that need to run for very long times without being interrupted. I usually reserve a buffer somewhere in memory and log copious details to it. When the buffer gets full, I write out anything above a certain level to permanent storage (if available) and purge the rest. If the application crashes, I can dump out everything that’s in the buffer and get all the gory details.

Logging to permanent storage (note I’m not using the word disk) or through syslog tends to be unacceptably slow on the systems I work on while more and more memory tends to become available.

Logging is important for the following reasons:

  1. Users lie. They don’t mean to, but they report what they think happened and what steps they remember taking. Logs are unambiguous and don’t distort reality to match their preconceptions.

  2. Some applications need to log events for regulatory reasons. Application logs serve as proof that certain events happened.

  3. Applications don’t always behave the same in the debugger as they do in a production-like environment. The additional overhead of a debugger can change the behavior and/or timing of control and data flow in a way that makes the problem disappear.

My own informal approach to logging is to map the following to the log4net levels:

FATAL - An unrecoverable error just occurred and this is the last-gasp attempt to document what happened before everything went boom.

ERROR - A recoverable hard error (network loss, database constraint violated, etc.) just occurred and this is an attempt to identify where it happened and what caused it for later investigation.

WARNING - A recoverable semantic error (invalid login credentials, invalid application operation attempted, etc.) occurred and was corrected.

INFO - Something interesting happened and it needs to be documented. Interesting depends on the type of application and the customer requirements.

DEBUG - This is the data I need when I get the call at 2 a.m. and need to diagnose the problem immediately. It’s typically state information that will allow me to identify exactly where the data/flow stopped being good and transitioned into crappy.

I also completely misuse the NDC stack in log4net by pushing the routine name onto it when I enter the routine and popping it off just before I exit. Sue me. I know that’s not what it was intended for, but it allows me to see at a glance where I am in the call stack and has occasionally identified events recursively calling each other.

The whole point of non-regulatory logging is to capture enough information to shorten the time and mental distance between a problem appearing and knowing enough about it to start resolving it. It’s just as much of an art as debugging is and is learned largely by trial and error (anybody else ever run out of disk space because their log files got to large, or am I the only one?) and resistant to fully automated approaches. For me, it all goes back to predicting what information I will need when I need to resolve an issue rapidly and can’t depend on eyewitness accounts of what happened.

If a tree falls in the forest and it wasn’t logged, can you prove it?

I have a logging class that is called as if I’m going to do far too much logging, but only actually writes entries above a level set as a constant. Most of the time it logs only fatal errors, but at the flip of a constant it becomes really anal and logs everything it gets.

There’s obviously some wasted processing going on most of the time, and I can’t get any great detail until after the problem has been reported, but I find the pros easily outweigh the cons. Most of the benefits of light/no logging, and a simple tweak of a constant to get all the benefits of heavy logging.

Our software doesn’t use logging to log errors - if there is an error, the user wants to know without looking into the log file, so it will display the error to the user. Our apps log if a user comes and says I did this and that and then bang, error and we try and it works flawlessly. Then we need the log. We don’t need it to see what error the user got, it is on the user screen in bold letters, we know the error. We also don’t need it to find out what the user did, he just told us. But since we don’t get the same error doing the same thing we need to find out what circumstances led to this error. And if we don’t log everything that goes on in the system, we will never have a chance to find the circumstances. Exactly the one little piece of information we may not log is exactly the rare, obscure situation that caused the error for the user. So log as much as possible.

Regarding the log becoming too big to find something, just write a tool that filters it. Just because everything is there in the log doesn’t mean your log viewer needs to display everything! Make it filter by log level, make it filter by which part logged the info, make it filter by time stamp, by string, whatever. With a good set of filters you can extract exactly the info you need to know from the log, hiding all the stuff you don’t care for at the moment. But what is not in the log is not there, it is information you don’t have. And we can’t always modify hundreds of code lines and release an update for our software products whenever users get a certain error and we are not logging enough to find out what’s going on.

Good stuff.

Logging can be addictive at first, but as you point out there is no free lunch. With all our fancy abstractions in code, eventually it still usually boils down to a physical database read/write.

Having the ability to turn logging on/off at different levels is of course highly valuable.

Regarding deadlock, yes it can happen with Log4Net and I ran into the exact same problem Jeff described, but almost 3 years ago.

You can argue back and forth about what the real problem is, but Log4Net treats the logger as a singleton and serializes (locks) access the the output file such that each record is written atomically. Otherwise you could get half a log statement written and then another one written in the middle of it.

If your DB layer also has serialized access to a shared resource (i.e. a connection) then the scenario Jeff discussed will cause a deadlock. But not in the DB where it would be detected and interrupted, in your code where it will sit idly doing nothing.

I’d say the problem is not understanding what’s going on in the details of the APIs you are using (which is why it happened in my case).

Let me put on my SysAdmin hat…

Code [for logging] is the enemy? Wow does that sound like developer speak.

While sorting through a gig of log data on an application you wrote sounds like a heinous job to you, not having any log data other than exceptions sounds even worse to me. While I dread the task of being handed a volume full of Websphere application logs and being told Find the place where the hacker broke in, I dread not having those logs even more.

As a former boss of mine, one of the few who did his job very well, once said Sometimes the message is unimportant, but how many times it appears is very important.

How hard is it to write a decent logging object so your code doesn’t get crowded with catch bracketology? [Answer: not hard]

Sounds like a poor design in the logging software. I have written logging code for decades and never had to hold a lock to get it done. With routines that can be intrusive in other code you have to work hard to keep things local.

I love the way people are comparing debugging to logging…when the example that Jeff points out would never be caught while debugging. It’s these high multi-user systems where sometimes the only way to figure out what’s wrong is to have good logging. Excessive logging is bad…just as bad as no logging…but I’ve got to believe there’s a happy medium in there somewhere. The world isn’t black and white.