On Managed Code Performance

My personal turning point on the importance of managed code was in September 2001, when the NIMDA worm absolutely crushed our organization. It felt like a natural disaster without the "natural" part-- the first notable port 80 IIS buffer overrun exploit. We got literally zero work done that day, and the next day wasn't much better. After surveying the carnage first hand, I immediately saw the benefit of languages where buffer overruns weren't even possible.

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/03/on-managed-code-performance.html

Ok, so yeah, Abrash is all about Pixomatic (fast x86 software 3D rendering) into late 2004. You have to sign up for a free account, but his 3 part DDJ series on Pixomatic is really interesting reading:

In this three-part article, I discuss the process of optimizing Pixomatic, an x86 3D software rasterizer for Windows and Linux written by Mike Sartain and myself for RAD Game Tools (http://www .radgametools.com/). Pixomatic was perhaps the greatest performance challenge I’ve ever encountered, certainly right up there with Quake. When we started on Pixomatic, we weren’t even sure we’d be able to get DirectX 6 (DX6) features and performance, the minimum for a viable rasterizer. (DirectX is a set of low-level Windows multimedia APIs that provide access to graphics and audio cards.) I’m pleased to report that we succeeded. On a 3-GHz Pentium 4, Pixomatic can run Unreal Tournament 2004 at 640480, with bilinear filtering enabled. On slower processors, performance is of course lower, but by rendering at 320240 and stretching up to 640480, then drawing the heads-up display (HUD) at full resolution, Unreal Tournament 2004 runs adequately well, even on a 733-MHz Pentium III.

The difference between today’s low-level Pentium 4 optimizations and the older optimization techniques he used on ye olde Pentium 1 are… uh, profound. Sort of a case study in what’s possible, even if it doesn’t ultimately make much sense IMO. It is amusing to try the software renderer in UT2004, though… download the free UT2004 demo and give it a shot! :wink:

From Part II:

I mention this in the context of the bilinear filter because that was where that lesson was driven home. You see, I came up with a way to remove a multiply from the filter code—and the filter got slower. Given that multiplication is slower than other MMX instructions, especially in a long dependency chain such as the bilinear filter, and that I had flat-out reduced the instruction count by one multiply, I was completely baffled. In desperation, I contacted Dean Macri at Intel, and he ran processor-level traces on Intel’s simulator and sent them to me.

I can’t show you those traces, which contain NDA information, but I wish I could because their complexity beautifully illustrates exactly how difficult it is to fully understand the performance of Pentium 4 code under the best of circumstances. Basically, the answer turned out to be that the sequence in which instructions got processed in the reduced multiply case caused a longer critical dependency path—but there’s no way you could have known that without having a processor-level simulator, which you can’t get unless you work at Intel. Regardless, the simulator wouldn’t usually help you anyway because this level of performance is very sensitive to the exact sequence in which instructions are assigned to execution units and executed, and that’s highly dependent on the initial state (including caching and memory access) in which the code is entered, which can easily be altered by preceding code and usually varies over time.

Back in the days of the Pentium, you could pretty much know exactly how your code would execute, down to the cycle. Nowadays, all you can do is try to reduce the instruction count, try to use MMX and SSE, use the cache wisely and try to minimize the effects of memory latency, then throw stuff at the wall and see what sticks.

great, great stuff!

Mike worked at MS on Xbox up until some time in 2001, it appears.

Well, here’s one thing he has worked on somewhat recently-- RAD Game Tools Pixomatic software renderer, circa 2002, last updated 1-2005 (!)


And yes, UT 2004 DOES use the Pixomatic renderer if you switch to software rendering. Be sure to turn the resolution way, way down before doing this, or you’ll be sorry… like I was :wink:

citethe first notable port 80 IIS buffer overrun exploit./cite

The problem isn’t the language allowing buffer overruns, the problem is using a closed source web server. Security through transparency is much better than security through obscurity, hell, if it was open source and your own developers looked at it, one of them may have fixed the bug before you were affected by it.

given enough eyeballs, all bugs are shallow.

This may or may not be a cheap way to plug your old company’s quake.net project, but damn if that isn’t a cool project.

From what I hear, Abrash has been working on Larrabee. Infact, he’s giving a talk at GDC 2009:

This may or may not be a cheap way to plug your old company’s quake.net project, but damn if that isn’t a cool project.

To me it’s not the GC itself that is the biggest issue with Microsoft’s managed languages, but rather the IDisposable interface. The usage pattern of a class that implement that interface is so different (and imposes so much on code that uses such a class) from classes that do not that I find myself implementing IDisposable JUST SO I WON’T HAVE TO REDESIGN ENTIRE SECTIONS OF MY APPLICATION if I find out later that my class needs it. This tends to cause a cascade whereby most of the classes in my project implement IDisposable just to be on the safe side.

If I knew up front which classes would need to implement IDisposable this would not be such a problem. But because I develop iteratively, I do not always have this information a priori. Failure to implement or utilize IDisposable where necessary may result in bugs ranging from the minor to the major; locating and changing the use of classes that are newly IDisposable seems a perfect opportunity to do this.

Although I do not take the memory usage hit from the destructor where the use of IDisposable is unnecessary, the simple need to implement it almost everywhere is highly annoying to me. I get all of this in trade off for GC and bounds checking? I’d rather just try to write good code, honestly. I also lose the ability to execute code when leaving a lexical scope (unless I implement IDisposable, and then it depends on the good graces of the code using my class). VB.NET and C# may be fine for scripting a few components together, but for anything bigger (or that might need to grow bigger) than that, please give me a real language.