To ECC or Not To ECC

On one of my visits to the Computer History Museum – and by the way this is an absolute must-visit place if you are ever in the San Fransicso bay area – I saw an early Google server rack circa 1999 in the exhibits.


This is a companion discussion topic for the original entry at http://blog.codinghorror.com/to-ecc-or-not-to-ecc/
2 Likes

For the record, a full memtestx86 run on a 64gb skylake system takes a little over 3 hours:

That’s the one that was whirring away as I was writing this blog entry. Tiefighter26 to be precise.

At least those 40mm fans are spinning pretty slow through the memtest. Not so much for mprime / prime95 when massive CPU load kicks in and they spin up to 10k… my poor poor ears.

I agree with the primary thrust of the debate over ECC vs Non-ECC memory. I like ECC memory because it removes the potential issue of memory errors leaking in. My desktop, for instance, has ECC memory (with an AMD consumer processor, I wasn’t too crazy then) so that any work I do with software won’t have random failures. Compare to my current laptop, which has nice Corsair memory but has a tendency to cause GCC to seg fault randomly. I suspect either the memory or the CPU is bad, but I haven’t been able to run memtestx86 yet. I also know that passing isn’t a guarantee that my memory is fine, and thus I’d prefer to have it just to avoid the uncertainty.

That being said, I generally recommend systems without ECC, as most people are cost sensitive, and even a minimal bump in price is enough to put them off. I have a feeling that we don’t see ECC everywhere, for that exact reason. The failure rate is small enough that most people don’t care, even if it could drastically hurt them in the future.

Oh, and I do think a lot of systems do actually have ECC memory, just not the main ram. On my Phenom ii desktop, it reports all three L caches are ECC protected. So it isn’t so clear that it isn’t useful either.

The prevalence of soft (bit flip) errors is likely significantly higher than it was in 2009, or even 2012. In Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (covering the “row hammer” issue), which tested modules from a few different years, the error rate picked up significantly starting 2013, likely due to increasing density in the manufacturing of memory.

The paper uses pathological access patterns (still, only reads) to trigger errors on purpose, but there are legitimate things software might do similarly and trigger the error. Can’t be too specific, but I’ve seen these errors in production, and it’s only because we had ECC ram that we were even able to tell that the strange behavior was due to a memory error – I also think that being able to tell that memory errors are happening is much more valuable than any of the reliability benefits.

Also turns out to be a security issue: http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

1 Like

When I read the title, I thought it was going to be a nostalgia post… Such an old topic! :stuck_out_tongue:

Bad news though, ECC may not protect you from rowhammer:

Correct errors. Server-grade systems employ ECC modules with extra DRAM chips, incurring a 12.5% capacity overhead. However, even such modules cannot correct multibit disturbance errors (Section 6.3). Due to their high cost, ECC modules are rarely used in consumer-grade systems.

Section 6.3 says:

While most words have just a single victim, there are also some words with multiple victims. This has an important consequence for error-correction codes (ECC). For example, SECDED (single error-correction, double errordetection) can correct only a single-bit error within a 64-bit word. If a word contains two victims, however, SECDED cannot correct the resulting double-bit error. And for three or more victims, SECDED cannot even detect the multi-bit error, leading to silent data corruption. Therefore, we conclude that SECDED is not failsafe against disturbance errors.

The x=2 case is still pretty large :frowning: but point taken, it does improve the odds a lot in your favor.

Nice post.

I would consider single-threaded benchmarks vs clockspeed though – Xeon CPUs generally perform better per-clock because of larger caches – and the single-socket models are right there with the i7-6700k in terms of single-threaded performance for a similar price:

Data from: http://www.cpubenchmark.net/singleThread.html

Of course benchmarks are usually crap, yada yada, definitely YMMV.

But the hitch if you believe these numbers is the E3 cpus only support 32G of RAM, so it’s not totally you can have your cake and eat it too with single-threaded performance and having ECC and having it with 64G of RAM.

And regarding memory errors – it’s probably also likely in many many cases that a bit flipped here or there will go largely unnoticed – a lot of data in ram is in fact just that: data. You’re not going to notice a few bits flipped in a jpg or maybe even most data. There’s some interesting reading on this topic coming from the ZFS community since they store very rigorous checksums on disk for data blocks – this is mostly I think hard drives flipping bits, but they do talk about memory and ECC. Perhaps loosely applicable in the discussion here.

I love shots of Google’s home grown servers, and TPB’s makeshift ones, since I did something like that myself.

Photo gallery HERE.

Back in 2005 I built this rolling box to hold a cluster of bare-bones PCs, mostly just running Seti@Home. I was hoping to save on setup time after my next move, and the dual dryer hose connections would be connected to the window so it would be pulling in cold air in the winter, then I’d switch to blowing hot air out in the summer. The motherboards are screwed onto little frames that slide and hangs like a filing folder. Vertical orientation still seems like the best option for ventilation. I couldn’t put 6 PCs in the box as originally intended because I’d made the box slightly too small.

Yep, these are what we had – uncorrectable errors with ECC memory caused by row hammer. Luckily there are mitigations. Sandy Bridge allows you to double the memory refresh rate (though, per the paper, this helps a lot but can’t entirely save you, has a significant perf hit, and makes things run noticeably hotter), Ivy Bridge and newer support a feature they call pTRR (selectively refreshes possible victim rows), which helps a lot more and is less expensive, as long as you’re using DIMMs that support it (many manufactured since 2013 do, even with the issue only being public knowledge as of this paper being published in Aug '14), and if you’re using DDR4 a similar feature exists in the memory hardware itself.

So row hammer might not be the thing that bites you, since it’s pretty well mitigated now, the larger point is that RAM today has highly different characteristics from RAM a few years ago.

1 Like

Not true as of the Skylake Xeon E3 era as I noted in my blog post; this restriction was relaxed at last. Another big reason to be a Skylake fan like me.

Cool, any good links citing statistics or data or research beyond personal anecdotes? ZFS is great.

Also I should have mentioned, compare the i7-6700k and e3-1280v5 … notice how eerily similar they are, except one has 10% higher clock rate and costs half as much. And cannot use ECC memory…

http://ark.intel.com/compare/88195,88171

Why if I did not know any better I would say someone is trying to artificially segment the market :laughing:

Very interesting, so DDR4 is immune to the row hammer? It was good to read the paper closely and learn more about row hammer; it definitely demonstrates a more plausible bit flip scenario than random alpha particles.

While I agree that single bit soft errors in memory are quite uncommon, they do happen often enough in hostile environments like mobile devices. Here’s a DefCon talk illustrating a viable exploit leveraging soft bit errors to perform DNS-oriented exploits.

On a server in a datacenter, there isn’t a huge penalty or chance for bit errors though. Provided all external communications are authenticated, the worst possible effect would be garbage in the output, or a segfault on the server software (due to return address or code corruption). However, the common case is… Nothing. And, like you said in the post, hard errors and multi-bit errors dwarf single bit errors in a datacenter or home environment by a massive margin, either of which defeat ECC memory.

1 Like

Well, looking at that cobbled together Google 1999 server rack, which also utterly lacked any form of ECC RAM

However, from:

The Datacenter as a Computer
An Introduction to the Design of Warehouse-Scale Machines
Luiz André Barroso and Urs Hölzle
2009

Page 79:

At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google’s Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked – instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be indentified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.

1 Like

Very interesting, so DDR4 is immune to the row hammer?

It’s a lot less likely to be observed. The root problem, reads (or more specifically, the recharging of the cells in the read row) affecting neighboring cells in the memory chips, still happens. pTRR and TRR both require knowing which rows could potentially be “victim” rows of each other row, and after some number of reads of a row refresh its potential victim rows out of cycle. Its effectiveness will depend on the quality of that mapping and whether it’s tuned to be aggressive enough – which can be a bit at odds with the increasing cell density and read latency

So – definitely not impossible. As for the likelihood, it’s hard to say – MemTest86 has a rowhammer test now so you can at least check for it, but rowhammer also managed to exist in the wild for a long time before it became public knowledge, there could be some completely different manner of gremlin lurking in everyone’s hardware now :scream:

1 Like

Ah, good data – yeah, I think you’re on it – I would be surprised if these weren’t the same die binned on a bad cache bank and speed. So, basically, for a bit more money, you get ECC in an E3 cpu.

Re ZFS, I don’t have good data outside of a first level google search – and those articles read like yours, but arguing the opposite for ECC.

I’m a pragmatist – I think you’ll be served fine by the i7 based systems you’ve built, and I salute you for not blindly following the ec2 cloud cult and building your own servers to begin with – great stuff.

I’ve also been evaluating skylake, and I’m a little skeptical about those rosy performance numbers.

You linked to two javascript benchmarks on anandtech; kraken and octane. Those post considerable performance gains, but there are lots of warning signs indicating that there’s something fishy going on.

If you look at a full comparison between the 6700k and the 4790k, you’ll note that most benchmarks show little to no change. There are exceptions: integrated graphics has definitely improved (but you won’t care), and there are new instructions that improve targeted workloads (e.g. OpenSSL) - but again, that’s not going to help the ruby interpreter. Also, exceptionally, the JS benchmarks have significantly impoved: Krakenjs for instance achieves a score of 735ms vs. the 4790k’s 938ms. A whopping 28% faster at roughly the same clockspeed!

But… my 4770k currently achieves 819ms running at 4.0Ghz (turbo off) in krakenjs despite lots of processes running, which is considerably better than anandtech’s 4790k score (which nominally runs at 4.0 but turbo’s to 4.4) of 938ms, even though that processor should probably be turbo-ing the whole benchmark and is otherwise identical. It should be up to 10% faster, not 15% slower - it’s quite possible the javascript engine was updated in the meanwhile. Similarly, the 1091ms score for the 4770k in the image is again slower that you’d expect based on clockrate alone, even compared to the 4790k. That fits the theory that JS engine improvements play a role since the 4790k was tested later (it’s much newer).

The reason I focus on kraken is that octane v2 is an interesting benchmark in that it touches a broad spectrum of javascript use cases, including compiler latency. That makes it great for looking at a broad spectrum of issues that a real browser needs to handle, but it unfortunately also makes it very sensitive to browser engine details. There are large differences between browser engines here (even versions of the same engine), particularly when you look in more details at the subscore. Also, the run-to-run variation is much higher than in the simpler, mostly number-crunching krakenjs. Krakenjs scores are generally comparable across modern js engines (they tend to differ by much less than 50%), and that’s probably a better bet for comparing CPUs (on a tangent, the iPad pro is quite interesting there). Regardless, on octane v2 data is less clear, but the trend is the same as for kraken: differences in anandtechs scores cannot be easily explained by any other processor performance benchmark, nor can I replicate them at home or work on similar machines.

Now, if you disregard the anomalous JS results for a moment, there are other benchmarks that look at least vaguely like a ruby workload. For example, the dolphin emulator benchmark hopefully has some similarity - I think I’d pick that as my best bet barring an actual ruby benchmark. And you’ll note that there the skylake advantage is much smaller; it’s just 5% faster. Office productivity may also be roughtly similar; there we observe a 4% advantage. 7-zip compression shows a full 11% advantage, but that’s highly memory-system sensitive, and I’m not so sure it’s representative of your use case. Decompression is juts 6% faster. Redis is 4-17% faster (1x 4%, 10x 17%, and 100x somewhere in between). Agisoft photoscan’s CPU mapping speed is another plausible best case, with a 17% performance advantage.

I really don’t think 33% is realistic. I’d expect no more than a 5 to maybe 10% advantage at identical clockspeeds - of course, that’s not quite the case for you, so that might add another 10% for you. In your case, you might hope for 20% improvement - but that’s a little misleading because you’re then comparing non-top-of-the-line haswells with top-of-the-line skylake.

I’m really curious what the difference actually turns out to be. Can you post some actual ruby benchmark numbers?

One (sort of obvious) thing I’m like to point out about memory errors is that a large part of memory addresses, especially is GUI environments, hold non-critical data, like images where being one color bit off isn’t likely to product a bad outcome. I think many computers have memory with a few bad cells but if those cells always hold the Windows logo, it won’t result in an error.

Bad news though, ECC may not protect you from rowhammer:

Then I don’t want either. The fact that they’re selling memory with this problem is an example of everything that’s wrong with the computing industry. Give me memory that does its job, I don’t care how slow it is or what lack of it there is. I much rather work around memory limitations than check the entire codebase to see which parts are vulnerable to something like rowhammer.

End-to-end Data Integrity for File Systems: A ZFS Case Study Y. Zhang, 2010

  1. In-memory data integrity in ZFS

In the last section we showed the robustness of ZFS to disk corruptions. Although ZFS was not specifically designed to tolerate memory corruptions, we still would like to know how ZFS reacts to memory corruptions, i.e., whether ZFS can detect and recover from a single bit flip in data and metadata blocks.

Our fault injection experiments indicate that ZFS has no precautions for memory corruptions: bad data blocks are returned to the user or written to disk, file system operations fail, and many times the whole system crashes.

And at the risk of “appeal to authority” here’s ZFS co-founder and current ZFS dev at Delphix, Matthew Ahrens from a thread on Hardforum; see bolded sections re: the ability of ZFS to mitigate in-memory corruption with a specific debug flag (at a performance cost) as well as the bottom line re: ECC

There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.

2 Likes

It’s not a matter of what happens when things are looking good. It’s a matter of server running in not exactly good conditions.
Google has redundancy at almost every level. Even if one server went down, there would be couple others taking over.
In normal cases it’s really sufficient to have a blink, a short, barely noticeable power interruption to affect you in a bad way.
Laptops won’t take advantage of ecc in this case. They have their own, quite reliable power supply. For other computers that’s frequently enough to cause enough damage to reboot the system.
I understand that in most cases this would affect cpu, too, but that’s not always the case. And it hurts, when a blink erases portion of the data that is happily written to your drive. Especially when this very write causes filesystem corruption, that you won’t find until late.

For most people ecc doesn’t matter. For some still should. It matters in enterprise area, where single device reliability is a must.