Sometimes It's a Hardware Problem

One of our best servers at work was inherited from a previous engagement for x64 testing: it's a dual Opteron 250 with 8 gigabytes of RAM. Even after a year of service, those are still decent specs. And it has a nice upgrade path, too: the Tyan Thunder K8W motherboard it's based on supports up to 16 gigabytes of memory, and the latest dual core Opterons.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/08/sometimes-its-a-hardware-problem.html

Funny coda to this story: this server was shipped to us as-is directly from the CPU manufacturer. Although this is clearly a motherboard (Tyan) problem, it’s still funny!

In 4 1/2 years of building computers in a store many years ago, I’ve only once seen a bad Intel CPU; the most common cause of hardware crashes were motherboards, followed by RAM. However, during that same time, I did see quite a few dead (never malfunctioning) AMD chips.

So the link to gamepc is just to showcase the specs? They seem to have decent machines. Have you had any dealing with them?

I once got a phone call that a server I was responsible for had stopped responding. I VPN’ed to it, and it was processing transactions just fine. I immediately suspected that it was a router problem because we’d seen the same problem before with the routers. (We used Tibco Rendezvous for messaging, and it uses broadcast UDP packets. Our routers understood this and intelligently forwarded packets appropriately.) Infrastructure checked the router in question and it was fine. I then logged into my test server, which was on the same segment and it thought that the production server was dead, too.
I then logged back into the production server, and checked the event log. One of the CPUs had died. Windows continued chugging along merrily, although it would no longer send out UDP packets. Rebooting fixed the network issue, and we replaced the dead CPU over the weekend. It still amazes me that a CPU dying would cause such an odd error.

Nice work Inspector Gadget!

Very interesting. Now if only there was a way to test hard disks as thoroughly (although we all know how long that would take).

Josh:
SpinRite for hard drives!
http://www.grc.com/spinrite.htm

Those are the hard to find problems, hardware is the very last place you look at when something goes wrong in an app

any good tools to test network cards?

I forgot to mention that we used CPU-Z to identify the mainboard and the SPD/memory timings:

http://www.cpuid.com/cpuz.php

Eber, as for testing network cards, I use pcattcp:

http://www.codinghorror.com/blog/archives/000339.html

I’ll concede that Steve Gibson stirs up hysteria at times, but the basic premise of SpinRite is sound:
If you lose data on your hard drive to a bad sector, and you want to recover that data, SpinRite does the trick. Is that worth $90? It depends on your valuation of the lost data.

The second assertion of being able to predict hard drive failure is plausible as well. In NAND flash memory, you can certainly tell when a block is beginning to fail, as the number of error corrections begins to increase disproportionate to the rest of the device. It seems reasonable to assume the same predictive conclusions can be made for a hard drive.

I’d forgotten all about Prime95. Thanks for the reminder!

As for SpinRite, it’s all junk science. I think it was Steve Gibson’s laughable assertions about raw sockets that first tipped people off about his motivations, but in any event, his claims about SpinRite make about as much sense as ouija boards and autointoxication remedies. Marketing rhetoric couched in pseudoscience and buzzwords, with lots of testimonials but no hard evidence.

It’s like acupuncture for your computer. At best it’s nothing but a placebo effect, at worst it could do serious damage by virtue of re-exposing bad sectors.