To ECC or Not To ECC

Dan Luu doesn’t think that blindly following Google’s examples is an automatic winner:

Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you’re going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not something that was done in 1999.

Meanwhile, Joe Chang has a sick burn on Windows 95:

I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system.

It is a very good piece from someone who was there, and well worth reading, but a bit… hand-wavy for my tastes. He does not address all three studies, and does not acknowledge that we are not exactly copying Google from 2000, we are using commodity parts that reflect the state of 2016 computing – which is considerably more advanced than 2000, mostly due to massive integration (hey where did those dual CPU slots and network cards go?) and everything going solid state.

The Puget Systems data was also ignored. But it jibes with what I have experienced. For example in 2011 I knew about so many consumer SSD fails I wrote a whole article about it. Today, we used 24 consumer ssds for the last 3 years in our servers and had exactly zero fail.

Also, this is interesting.

Introducing "Yosemite": the first open source modular chassis for high-powered microservers - Engineering at Meta

But that solution didn’t work well because the single-thread performance was too low, resulting in higher latency for our web platform

Where else have I heard this… oh yes :wink:

Throwing umptazillion CPU cores at Ruby doesn’t buy you a whole lot other than being able to handle more requests at the same time. Which is nice, but doesn’t get you speed per se.

Also, this is interesting. IEEE publishes study by University of Toronto on RAM corruption issues - far more common than previously estimated.

DRAM’s Damning Defects—and How They Cripple Computers DRAM’s Damning Defects—and How They Cripple Computers - IEEE Spectrum

After reading this blog post and looking at the amount of dicussion about this issue, one could easily argue that the cost of making a decision between ECC and non-ECC is greater than just buying something and living with the consequences.

“As a whole, hardware appears to be continuing the trend of becoming more and more reliable.” https://www.pugetsystems.com/labs/articles/Most-Reliable-Hardware-of-2015-749/

Should I now wait 2 years for enterprise tech bargins. After 15 years has enterprise tech as a sector become moot? Or jump with both feet into 2016 skylake?

DEF CON 19 - Artem Dinaburg - Bit-squatting: DNS Hijacking Without Exploitation

Great stuff. It seems that the 2007 study you linked to is at odds with the findings of several of the other papers. Specifically, it (a) assumes that there is no correlation between soft and hard errors, and (b) uses the Poisson distribution to compute the upper bound of soft errors. Given that the field study with the largest pool of machines finds that the occurrence of a single soft error greatly increases the likelihood of further soft errors, and also often is the precursor to a hard error, both of these assumptions are suspect.

Without meaning to muddy the waters, I am also struck by the fact that ECC seems to be discussed in isolation, rather than as part of a greater effort to preserve data integrity. Surely a system that is important enough to have its data protected by ECC should also be using an atomic COW file system such as ZFS or BTRFS? Of course, I am approaching this as someone looking to put together a single workstation PC, rather than a server farm where it may well be that everything fits into memory…

1 Like

I would love to have ECC ram if Intel was not so greedy and stopped artificially crippling it’s sanely priced chips (disabling ecc support, disabling ht on i5s, locking multipliers, randomly disabling virtualization extensions, …), included support for the now enabled functions in all chipsets and made it mandatory for motherboard manufacturers to support them too (with buggy implementations not counting). Giving us a few more pcie lanes/sata ports/usb 3 ports, support for more ram (128 gb would be nice) higher tdp desktop chips, adding a few more cores (so an i3 would now have 4c/8t, an i5 6c/12t and an i7 8c/16t) and offering a desktop cpu without the igpu and more cores/cache instead (lets call it an i8 and give it 12c/24t and double the l3 cache). Finally ending the scamming with mobile parts and calling glorified i3s i5s/i7s.

And moving towards an extra memory channel or two for the arch after Skylake (this would also mean doubling the max memory to 256 gb and moving to 6 to 8 memory channels for the expensive cpus which have quad channel memory now and increasing their core count by the same %).

Oh and reversing the price creep by a 30% price cut across the board.

And then I wake up.

1 Like

Hey, you’ve gotta have your product differentiation, or you might not capture all that economic surplus…

Also, just because no post is complete without a SwiftOnSecurity tweet:

And that’s proven how, exactly? Lots of problems can manifest as bad data.

If there is a run of memtest that fails years later after the initial build, I’ll accept that as a valid answer, otherwise… voodoo computing.

Hi, this article is still very interesting in 2016 :slight_smile:
I am by no means a pro or even a programmer, and I’m aware that personnal experience may have not much value compared to wide scale tests.

But here is what happened to me last year.
I own an Acer consumer laptop with 2x8Gb Kingston modules, I7-4702MQ. Pretty crappy on the power supply side by the way, but that’s not the point.
I had never bothered to think of what benefit ECC could bring.
One day, I did a full system wipe and copied back all my files to the internal hdd (getting it partitionned more conveniently in the process).
All went fine… Apparently. I soon noticed corruption on some frames in many of my videos.

The culprit was one of the ram modules. But the only affected files were bigger than 1Gb. Reproduced the problem with Teracopy, it would happen 1/3 of the time, by copying over and back a big film to my external drive with integrity check.

The memory modules worked fine one by one, and even both in the opposite slots. The MB was failing ? No.
Back in their slots, problem back. 24 hours memtest86 went all clear, but the files were still damaged while copying.
I then cleaned the slots with a brush and compressed air, and voilà ! Even being very cautious with my laptop, dust and moisture had made me loose two days troubleshooting.

The sfc scan reported large corruption as well.

After that day, my computer still works fine one year later, but I do consider buying ECC ram capable rig to avoid damaging part of the backup (including many family films that I consider valuable). And I do periodic integrity checks on my backup external drive now.

Thankfully the damaged videos are just on a few frames and still watchable, but many of them are damaged. So even a careful noob with standard consumer use may run into such issues.

Sorry for the tl ; dr effect xD

1 Like

Hi Jeff, Have you made any price comparisons lately (even better, TCO comparisons) between your custom build servers and an equivalent cloud subscription? (you made one a few years ago, I’m curious to know if any data has changed in favor of cloud) Also, another question I wanted to ask, do you run your web sites in VMs or on the physical server itself? You can technically run 4 VMs on a server with 64 GB RAM (or so I’m told). Thanks for your informative posts.

This post is misleading!
ECC RAM and Checksum filesystems (like btrfs or zfs) exist for a reason.
Concerning RAM I’ve seen defective RAM (Kingston non-ECC) pass trough memtest86 with no errors, but failing an heavy build with gcc.
I’ve seen corrupted files resultant of a simple filesystem copy.
Resuming… hardware fails, checksum is the way to go.

Unless everything is “ECC” you run risks across the whole computer system. It’s only really good where math is being done for long periods. I agree checksums is the real answer to solving errors in which ECC would protect from.

I’m annoyed by how indecisive this whole discussion is. How about list all of the ways data can become corrupted. In memory, over network, on disc, on ssd, in cpu? Then list the protections required for preventing it in every known case… People don’t seem to know much about this kind of stuff - they just reference Google’s massive HDD reliability study and make assumptions - and we’re left without many good ECC products anyway. No wonder no one feels like they’re necessary. The question is, if you have all of your other bases covered, then how much extra are you willing to pay for the last base: ECC… However, maybe it is not the last base. Can we have the benefits of fast unbuffered memory with the same reliability as ECC using some unpopular method or strategy?

1 Like

Take a look at the 2018 data from Puget

It’s quite thorough.

@codinghorror Not very thorough… Doesn’t really have AMD, and that is of interest for ECC - as another commenter above said, “I would love to have ECC ram if Intel was not so greedy and stopped artificially crippling it’s sanely priced chips.” Ryzen + AMD may be the best bet for many of us, but I haven’t seen them support registered ECC. Furthermore, in terms of being thorough, I’m underwhelmed by the analyses of all experts these days. I realize few have the knowledge to put together a truthful analyses of data corruption, bit rot, and the like throughout the layers of complexity that can potentially distort. This would be of interest, for example, by anyone who is concerned about geomagnetic reversals or large coronal events happening. During such events, we would find out who really understands how to maintain data integrity long-term.

I think you either misread or didn’t read what I linked; that’s not “analysis of experts”, that’s actual return and failure rate data based on live shipped systems.

1 Like