There Ain't No Such Thing as the Fastest Code

I am so glad there are those of you that love this stuff. I love to write programs, but once they are written I’m done with them. I guess it’s the ADHD, but I don’t care to look back on what I did to see if I can make it better/faster/more efficient. When I finish, I am just glad that the project is over so I can start something new.

But that is why I’m thankful I have programmers that work for me to take my ideas and prototypes and improve the code.

I have a lot of turnover on my team (we are a place where new grads cut their teeth before moving on) so I am more concerned with readable/understandable code than optimization. It’s a sacrifice I am willing to make since our average programmer only stays about 18-24 months. I need them to open the apps and begin working on day 1… not after 2 months of figuring out what previous programmers thought was a great optimization trick.

I’d like to echo Richard C Haven’s point in the second post there. And this is an excellent example in which to do it.

See, the most common reason to want a word count (as far as I know) is for submitting some written work to an editor who wants an 800-word article or a 120,000-word novel. And, if you’re doing that, what you really want is the same count that they would get if they counted the words.

Which is, as one can probably realize from a few moments of thinking about how this might have been done before computers, not an exact count of contiguous sequences of letters and apostrophes.

Nope. If you submit a book manuscript to a science-fiction editor (a genre in which I happen to know a good bit about how this works, though I think it’s probably the same in most other genres), the word count that they expect is the one that would be computed by typing out the manuscript double-spaced on a standard 10-character-per-inch typewriter on paper with standard margins, counting the pages, and multiplying by a factor that assumes about six characters per word.

(And there are good reasons for wanting that, rather than the “true” word count, too. Spaces don’t cost money to produce, but pages do, and they want a number that they can multiply by other factors to get about how many pages of standard hardcover or mass-market paperback this manuscript will take to print.)

So, yeah. I suspect I could write a sufficiently accurate approximation of that algorithm (for a flat text file) in about four lines of C and have it be a lot faster than the highly-optimized assembly code mentioned here. :slight_smile:

I had that book a number of years ago. I hope the Canadian I sold it to on eBay got his money’s worth. I believe I sold it for $50US, which was, IIRC, about how much I paid for it new.

that book was a MONSTER too, like NYC phonebook big.

it’s a good book.

I’m a mainframe programmer. Performance on a single transaction or task can be small, but if you have millions of them, they add up.

Here’s an example. We had a CICS region bogging down. At first we thought it was a transaction that was taking longer than a second. Come to find out it was a sub-second transaction performed thousands of times a day. It was unnecessarily loading a DB2 table just to see if the customer had a particular product. We took that out, and saved ourselves hundreds of thousands of dollars a year on MIPS cost.

By all means, the program should work as required. But, after that, you can save your company millions if you perform program performance tuning.

While I’m at it. I write in both assembler and COBOL. I have found that my assembler knowledge adds to my understanding of how the computer actually works. When we get further and further away from that understanding, I feel we risk designing software code that not just runs slow but may have to be re-written.

Man - I guess I am a new-school junior programmer who did not get his start writing x86 assembler, but these comments are boggling the mind. I expected everybody to recoil at the idea of all this talk of optimization.

I guess bit-twiddling at this level doesn’t interest me as much as the process of creation. I can tell that this is a quality book, but I’m scanning it for the more general pearls of wisdom. My eyes glaze over when I hit the program listings and the discussion of the microcode implementations of x86 instructions.

To clarify: It’s important to understand performance, and algorithmic complexity, and behavior of hardware and OSes if you’re going to write quality code, but the subject matter of the examples, and the minute level of optimization being striven for, is killing me here. =)

I have that book. It’s one of the most enjoyable books I’ve ever read.

It’s a shame that the book seems out of print; I’d like a copy to look at for fun but $85 is a bit steep. Maybe someone will convert it as well?

Wow, I used to rip Basic code from Byte magazine back in the 80s (on Apple //c BTW). Nice to see they’re still around.

I agree code optimization is important. Even in high level languages, increasing concurrency and knowing your code isn’t the bottleneck is important.

I have most of his articles printed out in a binder on my shelf. It’s one of the few collections of technical material that has survived the yearly gleaning process…

The complexity of our development stacks now have technical writers covering primarily the intricacies of a particular technology. Abrash is a great writer who excels at exposing the gestalt of a problem rather than getting tripped up in the technology.

It will be nice to have a digital copy of some of the articles. Thanks for the link!

The Abrash book is on a topic that’s definitely far and beyond what the vast majority of the software developers, dealing with dialog boxes and web dev, could dream of tackling.

A definite classic.

In 2002 I’ve took part in a similar competion about generating all primes up to 1_000_000_000. There were tons of entries all using the Sieve of Erathostenes. The second best contestant had spent God knows how many hours for an optimized assembly version. It rocked all other programs by a wide margin.

I however spent just 5 minutes searching on google to get the best algorithm out there … found one by Atkin Bernstein (http://citeseer.ist.psu.edu/452169.html). Ported it to windows (silly requirement they had) and submitted the program with references to the paper and Bernstein’s sample implementation (http://cr.yp.to/primegen.html).

Naturally it was the fastest submission. The morale of the story is that you should search for a better algorithm first before getting into low level machine optimizations. Unfortunately most programmers prefer the machine level optimizations as they do not have the inclination to improve their theoretical background. Sad.

Everything should be done in assembly language.

Ah, drool… this is one of the few areas of software work i find fascinating. Assembly, fast graphics. I started on the 8-bit 6800, 6502 and such, LEDs and hex keypads. Abrash is a hero.

I love working at that low level. But it is so hard to find work - one can’t just breeze into any ol’ town and find good well-paying steady assembly-level work there.

And yes, as others have explained, understanding how the computer works at that level does help with writing efficient code and with troubleshooting, even if not in a directly applicable way.

another example of how knowing the low-level software and hardware helps - where i work, we process a lot of images. These need to be calibrated. One of our key calibration programs has a 4-deep nested loop to synthesize some image-sized calibration data. Runs slow, of course. I rearranged it taking into account caching, row/column access etc. and it ran about twice as fast, at least on a good day. Now users can’t so easily get up for a fresh cup of coffee while calibrating images. I guess that makes me an anti-hero :wink:

Understanding of the underlying machine language is useful even for high level programmers. It doesn’t mean necessarily writing code in assembly language, but avoiding pitfalls of the architecture. An excellent example is developing code for Windows Mobile devices. The majority of WM devices have CPUs which use the ARMv5 instruction set which does not include a DIVIDE instruction. When a divide operation is compiled into the code, it gets turned into a software algorithm for divide which takes 50 times as long as a multiply.

Great book references.

People (coders) are limited by a couple of things:

  1. you cannot code an algorithm you don’t know (yet)
  2. you cannot employ the power of a language technique you don’t know (yet)

And, picking the right technique for the right problem !

I agree with the sentiment but to be pedantic, there is such a thing as the fastest algorithm for solving a problem. Binary search trees are mathematically proven to be the fastest way of finding an element by a comparison of keys. But in most practical situations you can tweak the problem so that other solutions are possible (who says you should be limited to only using comparisons of keys? – sometimes you can just use a hash table).

For some algorithms a lower bound has been proven (“the fastest algorithm can be no better than this: …”) but the best known algorithm is slower. Last I looked matrix multiplication was in this category.

“know what you do and how it goes to intermediate language (IL) is a big thing to learn first for speeding up”

Optimising to the IL level is largely a pointless unless you know about the JITer, which most people don’t. I’ve seen far too many discussions about which technique is better come down to “but it comes out to 4 less IL instructions”, only for that argument to be trashed in proper tests because of the JITer optimisations. Even then you can spend ages pouring over the assembly produced by the JITer, optimising it, and it will be completely different on someone else’s system. Don’t get me wrong, knowing how your code gets translated into IL is incredibly useful, but people rely on it too much to promote various ultimately futile micro-optimisations.

Joseph Garvin:

Not quite true; the math involves a theoretical machine. And these days the real hardware only simulates the theoretical machine. In some cases, you can run your application on two recent processors from the same vendor and get radically different results.

Case in point: CS majors learn to mathematically prove that a binary search on an array is faster than a linear search. That assumes that all instructions take equal and constant, time. But on a pipelined processor, the cost of branching depends on whether or not a branch succeeds. If the branch prediction works, branching takes 0 time, if not, the processor needs to cancel every command in the pipeline and start over.

For a binary search, each branch is taken half the time, so a branch predictor has no pattern to work work on. So half the time it’s wrong. Which means that for small arrays, binary search is much slower than linear search. I remember reading in Dr Dobbs that “small” on one particular Pentium processor was about 36 items.

To address your example, BSTs are the fastest, assuming that memory access and branching are free. Most modern computers use heavy pipelining and caching, so neither assumption is safe. In fact, an n-way search tree with linear search at each branch point is likely to be faster.