Hey Now Jeff,
Algorithms are so fun, another post I enjoyed reading learned from.
Coding Horror Fan,
Hey Now Jeff,
I am a big believer in returning to the basics on a regular basis. Thanks Jeff - this is a good reminder. Gordon’s suggestion that he would not hire a would-be senior developer who cannot demonstrate some knowledge of Big-O notation got me thinking.
There is a popular interview question to the effect of “What’s your favorite sort?”, sometimes followed by a request to code it up. Some years ago I was talking with a very good, mathematically-inclined developer who said that he was tempted to answer “Ascending”. He explained that almost never coded his own sort algorithms, instead preferring to use the functionality of the well-tuned libraries and controls available to us.
In my world of general business applications, knowing how to find the right sort algorithm (including a basic understanding of Big-O) is far more important than being able to code it from memory. The problem just does not come up that often, and I can easily find several sort algorithms for any given language (from dependable sources, usually with the Big-O plainly stated for me, complete with explanations of the best case and worst case scenarios).
Damn it! I wish someone told me this 4.3 million years ago!
J Stoever, an Alpha running the TRS-80 algorithm (48n) would have overtaken the Alpha running the Alpha algorithm (n^2) in seconds at most. The TRS-80 is only given as a dramatic illustration of how faster hardware does not help fix bad algorithms.
After some thinking, I think this article touches a much broader issue that it didn’t quite do justice. For one thing, testing with a huge amount of data is only a subsection of what should generally be a central part of any testing: assuming the worst case scenario. Not testing with big datasets is really no different than not testing with internet latency, or not confirming that user input is in the correct format.
On the other hand, care should be taken to avoid optimizing for the worst case scenario - just like (mentioned above) quicksort isn’t all that awesome if you get the (normal, in regards to computer programs) case of presorted data, there are other giant holes you can fall into if you confuse “testing for extreme cases” with “optimizing for extreme cases”.
I don’t think you have to be a optimization freak. You can always replace to the slower components with faster alternatives as the number of users increase.
Incidentally, an Alpha is a superfast mid-90s processor that DEC used to put out and NT used to run on.
A TRS-80 is an early 80s hobbyist computer. It’s about 10,000 times slower than an Alpha.
Re: M’s little toy-tossing rant
Before posting a comment in future, you may want to just review it and measure it against the standard of, “How useful is this to anyone who might read it?”
Against that standard, your post was completely useless apart from the last paragraph, where you made a helpful and constructive suggestion.
Please don’t waste my bandwidth telling the world that you have done some computer science courses. Anyone reading this blog, because of it’s highly technical level, is just as smart as you - we just don’t talk about it.
I think there is a corollary to the big-O notation problem, which is that it can be just as easy to hide performance problems because it is an analysis of complexity, not of runtime speed. It treats O(100n) == O(n) since it is assumed that the size of n will eventually swamp any constant terms. The problems is, I’ve seen too many systems where performance on some piece of the system is abysmal and the developer says “But I used quicksort!!!” or “But I used a B-tree!!!” That’s all fine and good, but then I look at the code and see that the system is fetching things from the database that could be cached in memory or is using strings as comparison keys when ints could be used. Quicksort won’t help you if your comparison algorithm is painfully slow. Using the best algorithm/data structure for the job is the least you can do. Next you have to actually think about your application data and how you can simply make your program do less.
I’d sad non-scalable software is the norm, not the exception. There are lots of excellent programmers who know to avoid exponential algorithms, but they have good jobs as far away from marketing as possible. There are far more AWFUL programmers out there peddling their shoddy wares for a quick buck. It’s simple: if there are more than 4 or 5 competing software packages that do the same thing, at least 75% of those are going to be absolute garbage; such is the result of today’s market forces and management overload, the people doing the purchasing either don’t have the eye, or the time to properly examine their options and the result is that the incompetent coders make a sale.
To most experienced developers, bottlenecks are immediately apparent during the design phase. We just know, before the requirements meeting is over, exactly where and how the app will fail and that’s where we focus the bulk of our effort. If you know that the key feature of an app is a nonsensical 14-table SQL join or a write-heavy backend, no amount of brilliant code-level optimization will make the app faster… the key is to break it into manageable chunks or sometimes completely rework the functionality.
If it weren’t for these true engineering problems, programming would not be a career.
Maybe I’m getting old, but it seems that this degree of subtlety is getting forgotten these days. Many developers are so used to calling the “sort” method on a collection that they’ve forgotten – never even knew in some cases – the characteristics of common algorithms.
I remember working with one developer who “enhanced” a program we were working on by changing a call to a heapsort (I think) to quicksort, basically because the latter had the word “quick” in it so it must be faster. Except the list was usually sorted…
But yes, even if you don’t understand you should at least test!
@DirtyDavey: “If you’re treating SSNs as character strings for sort/search, you can create fake sensitive data that cannot be mistaken for the real thing by using letters A through J instead of digits, e.g. SSN = ABC-DE-FGHI. Those should perform identically to real SSNs.”
…except that they will never compare equal to any real SSN, won’t pass input validation (which should be checking that the user enters digits, not letters), will always sort to the end of a list of real SSNs instead of somewhere in the middle, etc. The right answer is to replace or ignore the legal team.
Jeff, the “TRS-80 algorithm” is clearly “go out and buy n TRS-80s”, but what is the “DEC Alpha algorithm”?
I was just going to congratulate you on the use of your sorting animations: I’ve known about the Big O differences for ages, but I’d never SEEN the difference put so plainly. Then, I saw that you linked to a source for those animations and followed the link. Amazingly, it’s a site advertising Cormen’s “Introduction to Algorithms.” That’s the very book I’d been thinking of while reading your article. Heck, I even wrote a review of it on Amazon: “Academic Masterpiece, Practical White Elephant”
Anyway, good job.
I’d hope that anyone with an actual degree from a CS institute would know about Big O.
I agree wholeheartedly with testing with large datasets - but always think really hard what sort of numbers you’re actually going to encounter.
Everything is fast for small N - therefore, if small N is all you will ever see, everything is fast I was doing work on optimizing our draw ordering sort for transparencies. We were just calling sort every frame, which is much slower than doing a proper octree traversal. After a little testing as to whether I should even be attempting this, I found that there were never more than 17 items on our largest reasonable datasets that needed to be sorted at any give time. We were taking .1 ms on the sort and 6 ms to draw them… I decided to work on other problems
Quicksort’s worst-case behavior is O(nn), not O(nlog(n))!
This is not a niggling detail!
Many apps have been brought to their knees because foolish programmers didn’t understand Quicksort’s behavior.
Fix your app first, then fix your incorrect blog post!
You should use Heapsort instead of Quicksort. Heapsort is O(n*log(n)) worst-case.
Even if one doesn’t generally implement or even USE a sorting algorithm, I’d expect all programmer to know the basics. You may call me an elitist or whatever, but there has to be a limit to who can call themselves programmers.
Sorting algorithms are, after all, one of the basics that are used to demonstrate programming in the first case. Quick sort is usually one of the first non-trivial recursive algorithms you are taught.
You don’t really need a CS degree to know the basics of sorting and complexity. You don’t have to be a genius to understand it either.
This stuff is taught in high schools. It is taught in many beginner programming books (but should be in ALL of them).
This is why it’s so unbelievable. How can anyone actually manage to get a job as a programmer and not know the basic facts of sorting? It’s like getting a pizza delivery job without knowing traffic laws.
Talking about testing with large data, we did a live data test in parallel with our existing legacy over the last two weeks with our new .NET web service that handles raw medical device data. Our net admin had kittens when the new service filled up his brand new server’s disk array in 3 days time. We now have a 2TB disk array on order and we’ll go live (maybe) when it’s installed.
This is one reason that the C++ STL explicitly describes the performance/complexity characteristics of things like insertions, searches, comparisons, including specifying the general implementation (though it of course leaves many details unspecified as well). It’s worth checking that each time you are about to decide which algorithm or container template or method to use from the STL (I keep the O’Reilly STL Pocket Reference handy at all times).
“With generic implementations it will happen when you try to sort data that is already sorted - and that is a pretty likely occurrence.”
Only if you pick the first or last item as the pivot. I doubt many library implementers do that these days (glibc uses median-of-three).
This was all too mathematical for me but it has inspired me to find sample code for the various types of sort in C# and VB.NET.
I think programmers tend to be too socially isolated. We need a really good social networking site for developers so we can get the benefit of the “wisdom of the crowd”. More social networking by programmers would apply peer pressure to use best practices. I’m currently working on a project created by mainframe programmers who have loaded the entire database structure into arrays (every table name, column property, etc) has to be referenced using jagged arrays with obscure names and it is driving me crazy!