There Ain't No Such Thing as the Fastest Code

I tell ya, a lot of modern big firm game programmers should read this book;)

Back around 1998 when I was at Metrowerks as their junior engineer on the x86 C/C++ compiler for CodeWarrior, I read everything I could from Abrash, and used a lot of that knowledge to tweak the peephole optimizer in the compiler. I really appreciated it, and for a little while, the CodeWarrior tools did a little better than Microsoft’s VC++ on quite a few benchmarks.

Talking about caching, Herb Sutter made a great presentation to the Northwest C++ Users Group about how the hardware of a modern PC works, the differentials in bandwidth and latency between different components, the efforts hardware makers have gone to in order to hide that latency, and how that can affect your program.

The presentation was made available on Google Video and is linked from the meeting page at You should probably read the slides along with the video because they’re not that clear - a PDF is downloadable from the same page.

Micro-optimising your code by rewriting in assembly may well not do any good if you’re memory or I/O bound. Find out if you are first.

I think there’s a big point you’ve missed in your post. You’re right that there’s no such thing as “the fastest code”. But there definitely is such a thing as “fast ENOUGH code”.

The geek in me loves optimising code and making it faster. Then faster again. Then a little bit faster still. But at some point, making the code faster starts to trade off against time and readability.

Let’s take as a simple example a solution to that prime numbers problem posted above :

int prime( int n )
for( int x = 2; x n; x++ ) {
if ( ! ( n % x ) ) {
return 0;
return 1;

int main()
for( int n = 2; n 100; n++ ) {
if ( prime( n ) ) printf( “%d\n”, n );

Well that runs pretty fast. It does primes from 2 to 100 in no time at all. It scales to 10,000 without too much hassle, and does primes to 100,000 in about seven seconds on my PC. But it’s by no means the fastest code.

For starters, any number apart from 2 that’s got bit 0 reset can’t be prime (since it’s even), so we could test for that. We could cache each prime number in a lookup file so when the program’s run again, it remembers earlier results and produces faster output. This could be preloaded with some values first. We could unravel the main loop, saving some time putting things on the stack. And what’s with that time consuming printf - why not write the output to a binary file. We could rewrite this in assembler and save even more time.

But you can’t deny all of these things will take longer to implement. The lookup improvement - that’s full of gotchas (What happens if the file’s not there? What happens if it gets corrupted or deleted? Does the routine need to be thread safe? Has the running user got permission to see it) that might need to be handled.

Is it worth it? Maybe, if you’re going to be calculating prime numbers to a million on a regular basis. But it’ll take you longer and you won’t get more readable code out of it.

I’d say there is a time and a place for optimization.

If you’re doing programming C for a piece of hardware with minimal resources, it’s REALLY important.

If you’re trying to query a data source with a few million rows and join it with multiple tables with millions of rows, it’s REALLY important.

If you’re making a web site to display someone’s timesheet for a company of 500 people, it’s not so important. In that case the limited resource is the cost of the employee doing the development, and the faster the development the better.

It’s a judgement call that has made some people rich and others poor over the years. MS Excel killed Lotus 123 with 10 programmers using C++ and concentrating on a GUI instead of Lotus with 100 programmers optimizing assembly code.

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

if you reading this and then thinking it dont apply to me i program in .net think again. know what you do and how it goes to intermediate language (IL) is a big thing to learn first for speeding up, then learn where your objects are created aka unboxing and boxing then look at the other algorithim optimisations… then start to look at database layer speed… then network operations aka webservice calls asp architecture… just in case they are watching…

To no-fun:

Optimizing code using low level assembly is not as vital in most development areas, I even remember reading an article title as: “Is C the new assembly?”, I do not remember where I read it but it was basically saying that everbody who programs in some high level languages (C#, Java…) when they require high performance they turned to C, much like when everbody programmed in C usted to turn to assembly for performance. So the point is, who still programs in C still might need assembly.

This guy needs to update his material. 20 years ago I worked with programmers who had more of a challenge trying to fit their operating system into 256K RAM. THAT was the real challenge. The days of 2-4G RAM have nurtured a generation of lazy, sloppy programmers writing inefficient, bloated code. Vista is a classic example. We should be getting better at this, not worse.

This was one of my favorite books when I was about 19. It inspired me to learn Pentium UV pipeline optimisation. I d-loaded all the optimisation docs from intel, and spent a couple of days writing one of those old school flame effects. I was able to time the number of cycles it took to calculate each pixel, and it matched what I expected from my code. I was very pleased with myself when I managed to get it down to something like 5 cycles per pixel. I knew I had the fastest code.

Just to feel even more smug, I decided to write it in C to see how much faster my code was than some compiled code.

The C was faster. I never wrote any intel assembler again after that.

This is great. I recall voraciously reading Michael Abrash’s work and used to snuggle up in the corner to read about 3D B-trees to simulate the world of Doom.

Well, many years ago, I had to write a real-time system. I was interested in the benefits of coding it in PL/M, but concerned about performance.

So, I write some code in PL/M and looked at the assembly it generated - just about identical to what I would have done!

I haven’t used assembly language since - but 20 years further on, I still believe that having at least a basic understanding of how things work down at that level is key to being a good developer.

And in fact, once you’d been round the loop and coded a few basic libraries, assembly development wasn’t actually as slow as its reputation would lead you to believe.

Similarly, COBOL, which is well out of fashion these days, is probably still the best batch-processing language there is. I worked on a C++ job a few years back, whick took about 4 months. I could have written that in COBOL in two weeks!

Bottom line, don’t let fashion get in the way of using the right tool for the job - you can write crap in any language!

Unfortunately, it doesn’t matter if you optimise your code nowadays - you’re gaining next to nothing for all your efforts.

I mean, I had a discussion with a chap about singleton performance in C#, he wrote 2 little test apps that showed the differences between the 2 techniques we discussed and I ran it - one took 6 seconds on my laptop, the other 1.5 seconds. This showed a difference but not a big one considering the iterations required to show a count in seconds. However - I then ran it on the big superfast dual-proc server… and both apps took 20 seconds to run. Stupid C# decided it knew best (even when I ran it with CPU affinity) and “un-optimised” it for the worst case on 2 cores, even though it had been designed for multi-threading anyway!

Another case in point, a colleague decided he could optimise a module in our server app to make everything go lots faster. He spent ages doing it, and it speeded up his tests on his PC. Then we ran it overnight on the test environment… and the whole application ran slower than before. The cause was threading issues - easy to make it go fast on 1 thread, but on a n-tier server farm the locks and memory usage make for a different story.

So - optimise by all means, but make sure you’re optimising the right bits, in the right way, or its just a waste of time.

Great post. I enjoy most of your posts but I particularly like how this one goes back to your original “coding horror” theme.

I had already read a few pages of The Graphics Programming Black Book (and enjoyed it very much) before I read the following paragraph;

“But you know what you’re going to do? You’re going to click through anyway and read some of it.”

I bought this book when I was at a bookstore around 4 years ago when I was in college. I lent it out to a friend and got it back the next year. This book is amazing.
I almost feel sad that a lot of the work today is done on higher level languages like .NET or Java (I work in a .NET place). This book speaks to the inner geek that got me first interested in programming. This book has given me lots of insight and a huge dose of humility when it comes to programming.
For people who think something like this isn’t relevant, let me reintroduce you to the iPhone and its SDK that is coming out soon. People are going to want to create some amazing apps on limited hardware. It’ll take performance minded people in order to pull off sophisticated applications on the limited devices we’re going to see in the future.

Wow! I wish I had known about this book about a year ago when I was coding an implementation of the RTCP protocol in assembler. Very limited CPU and memory but high demands for speed.

Great tip, thank you! -Downloading the pdf as I’m writing this.

Ha, that book has been sitting on my bookshelf ever since I found it in a clearance bin sometime early this century. You’ve inspired me to open it!

As an example about prime speed … just a faster prime :

#include stdio.h
#include math.h

int is_prime(int c) {
if(c % 2 == 0 || c % 3 == 0) {
return 0;
int i = 5;
int limit = (int) sqrt((float)c) + 1;
while(i = limit) {
if(c % i == 0) {
return 0;
i += 2;
return 1;

int main() {
int i;
for(i=0;i 10000000;i++) {
if(is_prime(i)) {
printf(“Is prime %d \n”, i);
return 0;

// compile with -lm
// it checks every integer (by intention, just to show that it works …

#shell# time ./fastprime /dev/null

real 0m10.022s
user 0m9.985s
sys 0m0.000s

this is pretty fast and yet readable, just pure C and a bit of math logic

always remember that code heavyness must be in balance with the task and the goal, this code here is simple and yet finds all primes up to 10 million in 10 seconds.

wc is all I need to count words!