To ECC or Not To ECC

jb24 · March 13, 2016, 4:08am

Should I now wait 2 years for enterprise tech bargins. After 15 years has enterprise tech as a sector become moot? Or jump with both feet into 2016 skylake?

darix · March 14, 2016, 6:11pm

DEF CON 19 - Artem Dinaburg - Bit-squatting: DNS Hijacking Without Exploitation

PlusOneCharisma · June 29, 2016, 3:45am

Great stuff. It seems that the 2007 study you linked to is at odds with the findings of several of the other papers. Specifically, it (a) assumes that there is no correlation between soft and hard errors, and (b) uses the Poisson distribution to compute the upper bound of soft errors. Given that the field study with the largest pool of machines finds that the occurrence of a single soft error greatly increases the likelihood of further soft errors, and also often is the precursor to a hard error, both of these assumptions are suspect.

Without meaning to muddy the waters, I am also struck by the fact that ECC seems to be discussed in isolation, rather than as part of a greater effort to preserve data integrity. Surely a system that is important enough to have its data protected by ECC should also be using an atomic COW file system such as ZFS or BTRFS? Of course, I am approaching this as someone looking to put together a single workstation PC, rather than a server farm where it may well be that everything fits into memory…

camilus · August 23, 2016, 11:57am

I would love to have ECC ram if Intel was not so greedy and stopped artificially crippling it’s sanely priced chips (disabling ecc support, disabling ht on i5s, locking multipliers, randomly disabling virtualization extensions, …), included support for the now enabled functions in all chipsets and made it mandatory for motherboard manufacturers to support them too (with buggy implementations not counting). Giving us a few more pcie lanes/sata ports/usb 3 ports, support for more ram (128 gb would be nice) higher tdp desktop chips, adding a few more cores (so an i3 would now have 4c/8t, an i5 6c/12t and an i7 8c/16t) and offering a desktop cpu without the igpu and more cores/cache instead (lets call it an i8 and give it 12c/24t and double the l3 cache). Finally ending the scamming with mobile parts and calling glorified i3s i5s/i7s.

And moving towards an extra memory channel or two for the arch after Skylake (this would also mean doubling the max memory to 256 gb and moving to 6 to 8 memory channels for the expensive cpus which have quad channel memory now and increasing their core count by the same %).

Oh and reversing the price creep by a 30% price cut across the board.

And then I wake up.

womble · September 6, 2016, 10:56pm

Hey, you’ve gotta have your product differentiation, or you might not capture all that economic surplus…

Also, just because no post is complete without a SwiftOnSecurity tweet:

codinghorror · September 6, 2016, 11:01pm

And that’s proven how, exactly? Lots of problems can manifest as bad data.

If there is a run of memtest that fails years later after the initial build, I’ll accept that as a valid answer, otherwise… voodoo computing.

Jhkh · October 15, 2016, 10:40pm

Hi, this article is still very interesting in 2016
I am by no means a pro or even a programmer, and I’m aware that personnal experience may have not much value compared to wide scale tests.

But here is what happened to me last year.
I own an Acer consumer laptop with 2x8Gb Kingston modules, I7-4702MQ. Pretty crappy on the power supply side by the way, but that’s not the point.
I had never bothered to think of what benefit ECC could bring.
One day, I did a full system wipe and copied back all my files to the internal hdd (getting it partitionned more conveniently in the process).
All went fine… Apparently. I soon noticed corruption on some frames in many of my videos.

The culprit was one of the ram modules. But the only affected files were bigger than 1Gb. Reproduced the problem with Teracopy, it would happen 1/3 of the time, by copying over and back a big film to my external drive with integrity check.

The memory modules worked fine one by one, and even both in the opposite slots. The MB was failing ? No.
Back in their slots, problem back. 24 hours memtest86 went all clear, but the files were still damaged while copying.
I then cleaned the slots with a brush and compressed air, and voilà ! Even being very cautious with my laptop, dust and moisture had made me loose two days troubleshooting.

The sfc scan reported large corruption as well.

After that day, my computer still works fine one year later, but I do consider buying ECC ram capable rig to avoid damaging part of the backup (including many family films that I consider valuable). And I do periodic integrity checks on my backup external drive now.

Thankfully the damaged videos are just on a few frames and still watchable, but many of them are damaged. So even a careful noob with standard consumer use may run into such issues.

Sorry for the tl ; dr effect xD

shankarab · January 23, 2017, 7:41pm

Hi Jeff, Have you made any price comparisons lately (even better, TCO comparisons) between your custom build servers and an equivalent cloud subscription? (you made one a few years ago, I’m curious to know if any data has changed in favor of cloud) Also, another question I wanted to ask, do you run your web sites in VMs or on the physical server itself? You can technically run 4 VMs on a server with 64 GB RAM (or so I’m told). Thanks for your informative posts.

redowk · March 16, 2017, 4:37pm

This post is misleading!
ECC RAM and Checksum filesystems (like btrfs or zfs) exist for a reason.
Concerning RAM I’ve seen defective RAM (Kingston non-ECC) pass trough memtest86 with no errors, but failing an heavy build with gcc.
I’ve seen corrupted files resultant of a simple filesystem copy.
Resuming… hardware fails, checksum is the way to go.

Tom_Dee · May 5, 2017, 8:32pm

Unless everything is “ECC” you run risks across the whole computer system. It’s only really good where math is being done for long periods. I agree checksums is the real answer to solving errors in which ECC would protect from.

ascetic_tweeter · February 8, 2019, 8:13pm

I’m annoyed by how indecisive this whole discussion is. How about list all of the ways data can become corrupted. In memory, over network, on disc, on ssd, in cpu? Then list the protections required for preventing it in every known case… People don’t seem to know much about this kind of stuff - they just reference Google’s massive HDD reliability study and make assumptions - and we’re left without many good ECC products anyway. No wonder no one feels like they’re necessary. The question is, if you have all of your other bases covered, then how much extra are you willing to pay for the last base: ECC… However, maybe it is not the last base. Can we have the benefits of fast unbuffered memory with the same reliability as ECC using some unpopular method or strategy?

codinghorror · February 17, 2019, 11:20am

Take a look at the 2018 data from Puget

It’s quite thorough.

ascetic_tweeter · February 18, 2019, 1:01am

@codinghorror Not very thorough… Doesn’t really have AMD, and that is of interest for ECC - as another commenter above said, “I would love to have ECC ram if Intel was not so greedy and stopped artificially crippling it’s sanely priced chips.” Ryzen + AMD may be the best bet for many of us, but I haven’t seen them support registered ECC. Furthermore, in terms of being thorough, I’m underwhelmed by the analyses of all experts these days. I realize few have the knowledge to put together a truthful analyses of data corruption, bit rot, and the like throughout the layers of complexity that can potentially distort. This would be of interest, for example, by anyone who is concerned about geomagnetic reversals or large coronal events happening. During such events, we would find out who really understands how to maintain data integrity long-term.

codinghorror · February 18, 2019, 2:59am

I think you either misread or didn’t read what I linked; that’s not “analysis of experts”, that’s actual return and failure rate data based on live shipped systems.

jamesdh · December 13, 2019, 7:36pm

@codinghorror curious to know why you made the exception for your database servers? Concern for data corruption?

codinghorror · December 16, 2019, 11:40pm

Compromise with sysadmins. I agree that if the additional incremental cost is small (10-15%) then it’s cheap insurance. There is value to ECC, particularly on very large memory systems like database servers.

codinghorror · June 1, 2020, 6:07am

Updated for 2019

Motherboards are your biggest risk item. Solid state drives, at least Samsung, are incredibly reliable. As is memory of both ECC and non-ECC types, though ECC does indeed have an edge.

Jurgen_Riederer · August 26, 2020, 12:43pm

Great Post!! Thanks a lot!

feelthhis · December 15, 2020, 2:53am

“A recent academic study [1] of 1.5 million HDDs in the NetApp database over a 32 month period found that 8.5% of SATA disks develop silent corruption.”

“Another very large academic study [2] looked at failure characteristics for entire storage systems, not just the disks. In the 39,000 storage systems analyzed, the protocol stack (firmware) accounted for between 5% and 10% of storage failures.”

Source: NEC
https://web.archive.org/web/20131029210013/http://www.necam.com/docs/?id=54157ff5-5de8-4966-a99d-341cf2cb27d3

Not entirely sure if this is relevant to the discussion; if it’s not, feel free to remove.

codinghorror · December 21, 2020, 9:59pm

I’ve definitely had a change of heart on this topic; over time we have seen RAM fail, both ECC and non-ECC. So I’d advise ECC whenever possible on servers.

It’s basically a game of statistics. Though my mini-pcs I have colocated for 5+ years have never had issues, the more servers you have, and the more time goes on, the more likely it is to happen. Cheap Insurance is the right call – if you’re colocating just a server box or two, I wouldn’t agonize over it, but if you plan to colocate dozens, all ECC all the way.