CPU vs. GPU

Intel's latest quad-core CPU, the Core 2 Extreme QX6700, consists of 582 million transistors. That's a lot. But it pales in comparison to the 680 million transistors of nVidia's latest video card, the 8800 GTX. Here's a small chart of transistor counts for recent CPUs and GPUs:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/11/cpu-vs-gpu.html

We can only hope that we see more non-gaming apps offshored to the GPU. I need a good reason to buy one of those 600 dollar monster cards after all.

Your analysis helps to explain how/why videogame consoles are usually at an advantage over PCs for at the least first few years they’re out.

Game systems are a lot like IBM’s Deep Blue in the sense that they’re designed solely to excel at a specific application: games. To this end, they come with a custom hardware bus, and whatever coprocessors are required in order to squeeze every last ounce of performance from each cycle. You’re essentially seeing the same thing happen with these advanced video cards: though they might run at relatively lower clock rates and other resources, that entire circuit board is one highly optimized piece of hardware, which in the end can outperform the entire host system - at certain kinds of tasks.

It’s not about proving that a gaming PC can match or beat a PS3 or 360 - of course they can, that’s generally the case when you throw enough money, memory and MegaHertz at a problem. It’s more about giving the consoles their due and recognizing that they’re designed differently from the ground-up.

A good example is a sorting competition sponsored by Microsoft Research. A single Nvidia card won against standard microprocessor cores. See http://research.microsoft.com/barc/SortBenchmark/

it’s just a little nit, but Deep Blue had ~500 specialized chess processors, so each specialized chip could do less than half a million moves a second.

“But I also expect quite a few computing problems to make the jump from CPU to GPU in the next 5 years”

It’s at this point we start wondering why AMD bought ATI. Prolly not so they could sell graphics cards.

Yep, those GPUs are great. However, let’s not get carried away. Modern CPUs excel at scalar, branch-heavy code with random memory access patterns . Any one of those would make an 8800GTX crawl.

  1. CPUs aren’t scaling very well right now. GPUs are scaling well beyond Moore’s Law speed.

Moore’s Law applies to transistor count and has nothing to do with performance.

Just like the FPU was merged into the CPU, I expect we’ll see the GPU merged into the CPU as well. In fact, Intel already produces chipsets with (somewhat primitive) graphics chips. This move, if taken to fruition, could edge NVidia out of the market, were it not that NVidia designs so much faster than Intel. ATI could create the same chipset for AMD.

The last I heard, openGL might also be on the cutting block. There was one, and only one, reason that openGL drivers were included in the last generation of NVidia processors: the guys who write Doom said that they’d not consider a DirectX implementation.

I did read that article yesterday… toghether with this one… Supossedly DirectX 10 will require a tenfold less of instructions sent to the graphics card in order to have the same job done.

a href="http://tomshardware.co.uk/2006/11/08/what_direct3d_10_is_all_about_uk/page6.html"http://tomshardware.co.uk/2006/11/08/what_direct3d_10_is_all_about_uk/page6.html/a

GPU’s are getting amazingly powerful!! If I was in Intel I would get worried… AMD is and has already bought ATI. That combo will enrich both makers on their own base processor development and will yield even more impressive performance. (Warning: wishful thinking going on)

Nice to see the @home projects using the spare power of these home monster processors.

Moore’s Law applies to transistor count and has nothing to do with performance.

I get what you’re driving at here, but to imply that # of transistors per CPU has no correlation with performance is absurd.

http://en.wikipedia.org/wiki/Moore’s_law

The most popular formulation is of the doubling of the number of transistors on integrated circuits (a rough measure of computer processing power) every 18 months. At the end of the 1970s, Moore’s Law became known as the limit for the number of transistors on the most complex chips. However, it is also common to cite Moore’s Law to refer to the rapidly continuing advance in computing power per unit cost.

“But I also expect quite a few computing problems to make the jump from CPU to GPU in the next 5 years”

why? with the quantity of cores increasing, why wouldn’t you just throw those computing problems on another core of the CPU?

Incidentally, it’s not entirely correct to say 3DNow, MMX etc. is like Stream processing. These are single-instruction-multiple-data (SIMD) operations like adding two matrices to one another.

Stream processing involves multiple-instruction-multiple-data processing. Each processor is fed a (necessarily) small group of instructions called a “kernel”. So each processor is in effect executing a loop consisting of a small group of SIMD instructions. The NVidia 8800 has 128 processing elements, so that’s a lot of loops running at once!

The only fly in the ointment is that - the last time I heard - graphics cards didn’t use IEEE floating point so you have to be very careful about round-off errors and so on.

with the quantity of cores increasing, why wouldn’t you just throw those computing problems on another core of the CPU?

  1. CPUs aren’t good at these kinds of problems. As mentioned in the post, SIMD instructions are quite slow relative to what a GPU can do. CPUs are somewhat parallel, whereas GPUs are massively parallel, to the tune of 48 or 96 “processors” on today’s cards right now. Plus a GPU has many times the memory bandwidth of any CPU.

  2. CPUs aren’t scaling very well right now. GPUs are scaling well beyond Moore’s Law speed. Right of the top of my head, I can tell you that the last three nVidia cards I owned were truly 2x faster than each other [in games], and they were all released less than a year apart.

When was the last time you bought a CPU that doubled the speed of your applications? Probably never, unless your last CPU is of 2001 vintage.

the arstechnica article is more (get it??) thorough.

http://arstechnica.com/articles/paedia/cpu/moore.ars/1

basically, everything you think is true, isn’t. using some of that Transistor Budget for a full BCD adder/multiplier would finally put a steak in the hart of the MainFrame. and we could all go back to writing COBOL G/L programs.

To answer the question “when was the last time you bought a CPU that doubled the speed of your applications?”, check out this Sysmark 2004 graph:

http://www.tomshardware.com/2004/03/18/spring_speed_leap/page25.html

The highest value in the “Office Productivity” chart is 204, which means we need a sysmark 2004 score of 102 to prove a true doubling of speed across all applications in SysMark 2004. The slowest processor on the list, the Athlon XP 2600+, has a score of 140. Nobody has benchmarks that go back far enough, but I’d presume a system around the level of a Pentium 4 1.8 GHz or so would dip down to 102.

That review was posted in March 2004, and the P4 1.8 GHz was introduced in July 2001. So it took about three years of CPU speed improvements to double performance in typical office applications.

You migth be interrested (yet another) microsoft research project, Accelerator.
http://research.microsoft.com/research/downloads/Details/25e1bea3-142e-4694-bde5-f0d44f9d8709/Details.aspx

My bad: the SysMark 2004 scores are calibrated to a reference system, a P4 2.0 GHz. See page 14 of this PDF:

http://www.bapco.com/techdocs/SYSmark2004WhitePaper.pdf

Thus, a system which scores 200 on the Sysmark 2004 office benchmark will be twice as fast as that system. Duh! The Pentium 4 “Extreme Edition” 3.2 GHz scores 197 on the Tom’s Hardware page.

http://www.tomshardware.com/2004/03/18/spring_speed_leap/page25.html

Pentium 4 2.0 GHz - August 27th, 2001
Pentium 4EE 3.2 GHz - November 3rd, 2003

Thus, it took 26 months-- over two years-- for CPU speeds to double actual real world performance in typical office-style applications. At least according to SysMark 2004, which is a fairly solid real-world benchmark.

It’s absurd that a conventional microprocessor uses about 100 million transitors to execute a single stream of instructions – 30 years ago, microprocessors were able to execute a stream of instructions with less than ten thousand transitors!

For the past 15 years we’ve seen processors use pipelined and superscalar architectures to discover “hidden parallelism” in a single stream of instructions, but that’s a losing game.

People saw that Moore’s Law was going in this direction back around 1980: the Japanese government predicted that a “Fifth Generation” computer architecture would involve massive parallelism on a single chip. They launched a ten year effort to develop a programming language, hardware architecture and software environment for parallel programming.

Many people think the Fifth Generation project was a failure. Those people are wrong. The Fifth Generation project delivered working hardware and software, and achieved good parallelism for some tasks. There were two reasons why the world didn’t care: (1) The world was losing interest in the “Artificial Intelligence” and Logic Programming paradigm it was based on, and (2) Commodity hardware was improving so rapidly in performance to leave them behind.

Today’s multi-core processors are the beginning of the real fifth generation. Parallel chips power GPUs and the PS3, as well as advanced radios and network routers. Soon we’ll be putting billions of transitors on a die, and software people like us will be struggling to keep up!

isn’t this the reason the ps3 used a mutliple core system with specific tasks ssigned to different cores such as video rendering and 3-d functions. so it looks like sony already paid ibm to combine them

Of course, I don’t remember where I saw it, but the multi-core issue boils down to two problems: how to exploit parallelism for inherently single task problems (most business software) without running afoul of thread stomping; and its flip side, which is that the trend in coring is to run at lower clock rates. The MIPS of such a machine still goes up, but only if the code can effectively exploit multi-threading. That’s going to be the trick.

As to 5th generation; Prolog turned out not to be a general purpose language. Although the Amzi! folk still keep chugging along.

The irony is that Codd invented the RDMBS just before Prolog was created which implemented what amounted to a database: a row in a table is a rule, a rule in Prolog is a row.