CPU vs. GPU

Your example of chess benchmark is not valid. IBM’s hardware had very simple position evaluator. This allowed them to evaluate an incredible amound of positions per second. Fritz (the software you gave the link to) uses better evaluator which is slower but produces better result. That’s why Fritz with its 8 million positions per second is better than IBM’s machine with its 200 million positions per second.

buggyfunbunny – You can write imperative programs in prolog just fine; the question is “would you want to?”

Imperative programs can be confusing in prolog; often you need to use a logical failure to implement the flow of control when an imperative operation succeeds, or use logical success when an imperative operation fails. Your head gets twisted into knots quickly.

Prolog is more powerful than a relational database, because it can do reasoning to chain multiple rules together. As a result, it’s less scalable than a relational database – like all AI systems, it starts to crumble when there are more than about 10,000 rules.

Warren discovered that Prolog could be executed quicker than most people would expect, but it’s never going to be as fast as Fortran for numerical work.

Early on people had hope that Prolog could be parallelized, but it turned out that Prolog’s semantics are too powerful for parallelization. The Japanese invented a language, KL1, with weaker semantics. They built a KL1 runtime that got excellent pararellizaton for some workloads, and moderate paralellization for others. It never compelling enough to catch on in the real world.

I see the multi-core transition going in two directions: running today’s applications faster and with less power, and for tomorrow’s applications.

I don’t think there’s a lot of pressure to speed up word processors. Spreadsheets can certainly be parallelized, as can databases. Today’s business apps are increasingly database-driven web sites, and these take to parallel computers like ducks to water. (Sun’s Niagara processor wipes the floor with the competition when it comes to web apps.) Games and other multimedia applications benefit marvelously from multi-core systems, as do scientific applications.

Multi-core systems (and other parallel processors) will open up all kinds of new applications:

  • software radio – digital radio systems that do d/a at RF rates and are fully programmable
  • network processors – programmable network routers, intrusion detection systems, SAN routers, TCP offload engines
  • “perception engines” – parallel processing will enable new applications in machine vision, machine learning, speech recognition, pattern recognition, et al. Many of the aims of the 5th generation project ~will~ be realized this time around, but by different means. Rule-based programming is dead, replaced by machine learning techniques that are structured more like scientific codes (matrix math for the support vector machine) or like database systems (multidimensional search for k-nearest neighbors)

Mr. Houle – You can write imperative programs in prolog just fine; the question is “would you want to?”

No. But I’ve had to work on an application made with a Prolog mutant (not Amzi!), which was built by COBOL coders. Let’s just say it was the worst of both worlds.

I found the multi-core discussion, again. Trip over to Artima. A number of threads running at the moment; java-centric so you MicroSofties be warned. The issue applies to any function (as opposed to OO) based threading semantics. Holub discussed this years ago. Got generally roasted for being “too negative”, but that’s still the core issue.

I hope you’re right about seeing that jump to doing more on GPUs. I hope that for a very specific reason: I work for Peakstream (www.peakstreaminc.com) and we specifically write software that schedules large matrix calculations on GPUs :slight_smile:

Our trick to good performance in these operations is pretty simple: do large, SIMD-type operations, and then have what amounts to a JIT compiler to get everything scheduled on one or more GPUs and/or CPUs. Well, okay, it’s simple to say. Writing it takes a bit more work.

For people wanting to do more general purpose programming on the GPU: I am the originator of an open-source shader meta-programming framework that generates shader code from C# at runtime. This eliminates the need for a shading language. There’s an alpha release and a getting started guide over at my website.

Do let me know what you think.

Yeah, I had the idea to do this as soon as GPUs startd getting powerful, but it seemed too tough to get the GPU to execute arbitrary code, so I abandoned the idea.

You are wrong in stating that “Ten year old custom hardware is still 25 times faster than the best general purpose CPUs” . You are comparing a single processor made today with a massively-parallel machine that had thirty CPUs and 480 specialized chess chips. The fact of the matter is that it is today’s CPUs that are orders of magnitude faster that 10 year old hardware, not the opposite.

Neural circuit simulations are an unbelievable fit with GPU computation architecture - our research group, Evolved Machines, applies large scale neural circuits to sensory problems and is working with the G8800 now - will be announcing something soon -

Two of ATI’s upcoming R600 video cards in a SLI configuration deliver 1 Terafop of performance:

http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,51_104_543~116238,00.html

A teraflop is one trillion floating point operations per second. In 2002, it took 50 of the world’s fastest computers to build a 1 teraflop machine:

http://www.zdnet.com.au/news/business/soa/Australian_astronomers_get_1_Teraflop_supercomputer/0,139023166,120267896,00.htm

My dollar is on the GPU!

Jeff,
You can’y fairly compare ATI’s R600 GPU to earlier supercomputers when talking about operations-per-second for two reasons.

  1. When measuring FLOPS of supercomputers, they are almost always referring to 64-BIT DOUBLE PRECISION floating point operations, NOT single precision operations which is what graphics card makers (and Sony regarding their CELL BE) love to throw around.

  2. Just as importantly, the architectures are completely different… The supercomputers achieving a teraflop are using general purpose processors executing complex code, whereas the massively parallel GPU is executing massively parallel graphics operations. Can these really be compared?

Fortran?? Sorry pal, the 1970s ended in approximately 1982 more or less, when Roland released the TB-303. Think a bit and C for yourself…

If you are talking numerical processing, let’s talk machine-language libraries running over specialized hardware. This is not a language issue, unlike producing specialized administrative and business software, where you need to deal with large databases and model your client’s (often contradictory and incomplete) needs in less time then you would like to.

Just waiting for Nvidia Cruda/ATI driver for MSSQL, so i’ll upgrade my web db server with 3d graphics card, or use xbox for parallel processing :slight_smile:

Basically, we have a few number of commands that deals with bunch of data. It’s like made for SIMD processors, like GPU, where whole graphics hw engine can be programmed.

Didn’t I read some time ago that some folks on MIT or other university adopted graphics hw for db tasks? Have to find link…

The bulk of the transistor counts on modern CPUs is cache memory, and so provides something of a false picture of CPU complexity. I’m not sure what the bulk of the transistor count is on GPUs.

“The last I heard, openGL might also be on the cutting block. There was one, and only one, reason that openGL drivers were included in the last generation of NVidia processors: the guys who write Doom said that they’d not consider a DirectX implementation.”

That’s not credible, given the existence of NVidia and ATI graphics chips on Macs.

iisn’t this the reason the ps3 used a mutliple core system with specific tasks ssigned to different cores such as video rendering and 3-d functions. so it looks like sony already paid ibm to combine them/i

No, PS3 has a relatively conventional NVidia 3D chipset in addition to the cell processors. The purpose of the 3D hardware is to draw the pretty pictures, and the purpose of the cell processors is to let them claim the machine runs at 2+ TFLOPS.

It’s absurd that a conventional microprocessor uses about 100 million transitors to execute a single stream of instructions – 30 years ago, microprocessors were able to execute a stream of instructions with less than ten thousand transitors!

It’s absurd that you need a multiply operation to multiply two numbers – early microprocessors were able to use the add instruction in a loop to do multiplication!

As previously noted, a modern CPU core is ~15 million transistors exclusive of L2 cache. The thousand-fold increase in transistor count from the good old days includes making all the registers 4x wider, making instructions that took 4 to 8 cycles execute in a single cycle, adding a floating point unit, pipelining the system, adding branch predictors, and in general making the system over 10,000 times faster.