To ECC or Not To ECC

Y0ssar1an · November 20, 2015, 4:29am

Well, looking at that cobbled together Google 1999 server rack, which also utterly lacked any form of ECC RAM

However, from:

The Datacenter as a Computer
An Introduction to the Design of Warehouse-Scale Machines
Luiz André Barroso and Urs Hölzle
2009

Page 79:

At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google’s Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked – instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be indentified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.

Crate · November 20, 2015, 9:42am

Very interesting, so DDR4 is immune to the row hammer?

It’s a lot less likely to be observed. The root problem, reads (or more specifically, the recharging of the cells in the read row) affecting neighboring cells in the memory chips, still happens. pTRR and TRR both require knowing which rows could potentially be “victim” rows of each other row, and after some number of reads of a row refresh its potential victim rows out of cycle. Its effectiveness will depend on the quality of that mapping and whether it’s tuned to be aggressive enough – which can be a bit at odds with the increasing cell density and read latency

So – definitely not impossible. As for the likelihood, it’s hard to say – MemTest86 has a rowhammer test now so you can at least check for it, but rowhammer also managed to exist in the wild for a long time before it became public knowledge, there could be some completely different manner of gremlin lurking in everyone’s hardware now

mattbillenstein · November 20, 2015, 10:26am

Ah, good data – yeah, I think you’re on it – I would be surprised if these weren’t the same die binned on a bad cache bank and speed. So, basically, for a bit more money, you get ECC in an E3 cpu.

Re ZFS, I don’t have good data outside of a first level google search – and those articles read like yours, but arguing the opposite for ECC.

I’m a pragmatist – I think you’ll be served fine by the i7 based systems you’ve built, and I salute you for not blindly following the ec2 cloud cult and building your own servers to begin with – great stuff.

Eamon_Nerbonne · November 20, 2015, 11:18am

I’ve also been evaluating skylake, and I’m a little skeptical about those rosy performance numbers.

You linked to two javascript benchmarks on anandtech; kraken and octane. Those post considerable performance gains, but there are lots of warning signs indicating that there’s something fishy going on.

If you look at a full comparison between the 6700k and the 4790k, you’ll note that most benchmarks show little to no change. There are exceptions: integrated graphics has definitely improved (but you won’t care), and there are new instructions that improve targeted workloads (e.g. OpenSSL) - but again, that’s not going to help the ruby interpreter. Also, exceptionally, the JS benchmarks have significantly impoved: Krakenjs for instance achieves a score of 735ms vs. the 4790k’s 938ms. A whopping 28% faster at roughly the same clockspeed!

But… my 4770k currently achieves 819ms running at 4.0Ghz (turbo off) in krakenjs despite lots of processes running, which is considerably better than anandtech’s 4790k score (which nominally runs at 4.0 but turbo’s to 4.4) of 938ms, even though that processor should probably be turbo-ing the whole benchmark and is otherwise identical. It should be up to 10% faster, not 15% slower - it’s quite possible the javascript engine was updated in the meanwhile. Similarly, the 1091ms score for the 4770k in the image is again slower that you’d expect based on clockrate alone, even compared to the 4790k. That fits the theory that JS engine improvements play a role since the 4790k was tested later (it’s much newer).

The reason I focus on kraken is that octane v2 is an interesting benchmark in that it touches a broad spectrum of javascript use cases, including compiler latency. That makes it great for looking at a broad spectrum of issues that a real browser needs to handle, but it unfortunately also makes it very sensitive to browser engine details. There are large differences between browser engines here (even versions of the same engine), particularly when you look in more details at the subscore. Also, the run-to-run variation is much higher than in the simpler, mostly number-crunching krakenjs. Krakenjs scores are generally comparable across modern js engines (they tend to differ by much less than 50%), and that’s probably a better bet for comparing CPUs (on a tangent, the iPad pro is quite interesting there). Regardless, on octane v2 data is less clear, but the trend is the same as for kraken: differences in anandtechs scores cannot be easily explained by any other processor performance benchmark, nor can I replicate them at home or work on similar machines.

Now, if you disregard the anomalous JS results for a moment, there are other benchmarks that look at least vaguely like a ruby workload. For example, the dolphin emulator benchmark hopefully has some similarity - I think I’d pick that as my best bet barring an actual ruby benchmark. And you’ll note that there the skylake advantage is much smaller; it’s just 5% faster. Office productivity may also be roughtly similar; there we observe a 4% advantage. 7-zip compression shows a full 11% advantage, but that’s highly memory-system sensitive, and I’m not so sure it’s representative of your use case. Decompression is juts 6% faster. Redis is 4-17% faster (1x 4%, 10x 17%, and 100x somewhere in between). Agisoft photoscan’s CPU mapping speed is another plausible best case, with a 17% performance advantage.

I really don’t think 33% is realistic. I’d expect no more than a 5 to maybe 10% advantage at identical clockspeeds - of course, that’s not quite the case for you, so that might add another 10% for you. In your case, you might hope for 20% improvement - but that’s a little misleading because you’re then comparing non-top-of-the-line haswells with top-of-the-line skylake.

I’m really curious what the difference actually turns out to be. Can you post some actual ruby benchmark numbers?

ldil · November 20, 2015, 3:40pm

One (sort of obvious) thing I’m like to point out about memory errors is that a large part of memory addresses, especially is GUI environments, hold non-critical data, like images where being one color bit off isn’t likely to product a bad outcome. I think many computers have memory with a few bad cells but if those cells always hold the Windows logo, it won’t result in an error.

dudewhat · November 20, 2015, 4:05pm

Bad news though, ECC may not protect you from rowhammer:

Then I don’t want either. The fact that they’re selling memory with this problem is an example of everything that’s wrong with the computing industry. Give me memory that does its job, I don’t care how slow it is or what lack of it there is. I much rather work around memory limitations than check the entire codebase to see which parts are vulnerable to something like rowhammer.

chrisperedun · November 20, 2015, 4:50pm

End-to-end Data Integrity for File Systems: A ZFS Case Study Y. Zhang, 2010

In-memory data integrity in ZFS

In the last section we showed the robustness of ZFS to disk corruptions. Although ZFS was not specifically designed to tolerate memory corruptions, we still would like to know how ZFS reacts to memory corruptions, i.e., whether ZFS can detect and recover from a single bit flip in data and metadata blocks.

Our fault injection experiments indicate that ZFS has no precautions for memory corruptions: bad data blocks are returned to the user or written to disk, file system operations fail, and many times the whole system crashes.

And at the risk of “appeal to authority” here’s ZFS co-founder and current ZFS dev at Delphix, Matthew Ahrens from a thread on Hardforum; see bolded sections re: the ability of ZFS to mitigate in-memory corruption with a specific debug flag (at a performance cost) as well as the bottom line re: ECC

There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.

Ender · November 20, 2015, 5:34pm

It’s not a matter of what happens when things are looking good. It’s a matter of server running in not exactly good conditions.
Google has redundancy at almost every level. Even if one server went down, there would be couple others taking over.
In normal cases it’s really sufficient to have a blink, a short, barely noticeable power interruption to affect you in a bad way.
Laptops won’t take advantage of ecc in this case. They have their own, quite reliable power supply. For other computers that’s frequently enough to cause enough damage to reboot the system.
I understand that in most cases this would affect cpu, too, but that’s not always the case. And it hurts, when a blink erases portion of the data that is happily written to your drive. Especially when this very write causes filesystem corruption, that you won’t find until late.

For most people ecc doesn’t matter. For some still should. It matters in enterprise area, where single device reliability is a must.

David_Yaw · November 20, 2015, 8:12pm

You might want to update your memory test: Memtest86+ v5.01 supports more CPUs, and can run as a multi-threaded test, which should stress the memory even harder.

codinghorror · November 20, 2015, 10:58pm

One thing I noticed: letting the CPU directly control clock speed switching, rather than the OS, helps some benchmarks a bit. This “Speed Shift” is new to Skylake.

Compared to Speed Step / P-state transitions, Intel’s new Speed Shift terminology, changes the game by having the operating system relinquish some or all control of the P-States, and handing that control off to the processor. This has a couple of noticable benefits. First, it is much faster for the processor to control the ramp up and down in frequency, compared to OS control. Second, the processor has much finer control over its states, allowing it to choose the most optimum performance level for a given task, and therefore using less energy as a result.

Results

The time to complete the Kraken 1.1 test is the least affected, with just a 2.6% performance gain, but Octane’s scores shows over a 4% increase. The big win here though is WebXPRT. WebXPRT includes subtests, and in particular the Photo Enhancement subtest can see up to a 50% improvement in performance.

This requires OS support. Supposedly the latest version of Win10 supports it.

In general I trust AnandTech a lot for benchmarks. They used Chrome 35 for each of the JS benchmark tests. With my mildly overclocked i7-6700k and Chrome 36 x64 on Windows I get 45,013 on Octane and 655 on Kraken which is consistent with what AnandTech saw.

codinghorror · November 20, 2015, 10:58pm

Hmm, I am using Ubuntu 14.04 LTS booter, which has a memory test menu option at startup. I’ll see if I can locate a newer version to use in the future.

Billy_ONeal · November 20, 2015, 11:55pm

I think this depends on the kind of computation you’re doing. Statistics I’ve heard on soft-errors of this kind are on the order of 1 bit error per terabit-year. If you’re building a cluster that’s going to crunch on a single problem for months at a time, where one bit error during the calculation will destroy months of work, and where multiple terabytes of RAM are in use on that calculation, ECC is absolutely worth it. That’s why getting ECC memory support into nVidia’s GPUs was such a big deal for projects like the Titan supercomputer, which has 693.5TiB of RAM running 24/7.

For web server applications like this where the worst thing that happens is one page loads incorrectly, I wouldn’t worry about it so much.

_KaszpiR_ · November 21, 2015, 12:18am

Intel® Product Specification Comparison
Why if I did not know any better I would say someone is trying to artificially segment the market

http://www.cpu-world.com/Compare/492/Intel_Core_i7_i7-6700K_vs_Intel_Xeon_E3-1280_v5.html
As always with Intel, you pay extra for Xeons because they are from better production process. they can run on the same specs but with lower voltage, thus generating less heat. It’s noticeable in huge scale in datacenters when you got thousands of chips.
Also desktop processors rarely run 24/7 with 100% cpu load

codinghorror · November 21, 2015, 3:26am

Lots of people doing folding at home, seti at home, or Bitcoin mining run their CPUs at 100% all the time.

Also, as you can see from Intel’s own documentation, and as I already posted above, and you quoted… the chips are virtually identical.

One is a 91w chip that runs at 4.0Ghz the other is an 80w chip that runs at 3.7Ghz. Pretty easy to see where the watts come from, higher clock speed. And one costs about half as much, for more speed. But no ECC.

kolyshkin · November 21, 2015, 11:23am

With respect to memory testing, from my experience running memtest86 (BTW you mistakenly call it memtestx86 in the article, which disturbs my inner perfectionist) alone is not good enough to validate your RAM is fine. In addition to that, I usually also run memtester and cpuburn together — and in some cases it reveals memory errors despite memtest86 results being fine. Just my $0.02

dakull · November 21, 2015, 12:48pm

Please note that the Xeon doesn’t have an iGPU.

EDIT:

Also it’s interesting to normalize single threaded performance on frequency:

6700K has a +7% freq. boost compared to 3770K
in Cinebench single threaded benchmark it’s ahead of 3770K by +28.5%
in the end 6700K leads the 3770K at the same freq. by 19.5% a CPU released 3 years ago

Really interested what the differences are in real-life i.e. Rails benchmarks on both platforms - something tells me it won’t scale to the full +28.5% …

Eamon_Nerbonne · November 22, 2015, 10:46am

codinghorror:

One thing I noticed: letting the CPU directly control clock speed switching, rather than the OS, helps some benchmarks a bit. This “Speed Shift” is new to Skylake.

Examining Intel's New Speed Shift Tech on Skylake: More Responsive Processors

Compared to Speed Step / P-state transitions, Intel's new Speed Shift terminology, changes the game by having the operating system relinquish some or all control of the P-States, and handing that control off to the processor. This has a couple of noticable benefits. First, it is much faster for the processor to control the ramp up and down in frequency, compared to OS control. Second, the processor has much finer control over its states, allowing it to choose the most optimum performance level for a given task, and therefore using less energy as a result.

Results

The time to complete the Kraken 1.1 test is the least affected, with just a 2.6% performance gain, but Octane's scores shows over a 4% increase. The big win here though is WebXPRT. WebXPRT includes subtests, and in particular the Photo Enhancement subtest can see up to a 50% improvement in performance.

This requires OS support. Supposedly the latest version of Win10 supports it.

In general I trust AnandTech a lot for benchmarks. They used Chrome 35 for each of the JS benchmark tests. With my mildly overclocked i7-6700k and Chrome 36 x64 on Windows I get 45,013 on Octane and 655 on Kraken which is consistent with what AnandTech saw.

As you yourself note in this quote, the advantage in kraken is just 2.6%, i.e. a tiny fraction of that 28% gain. Also, since anandtech has had 6700k scores up for a while and speedshift isn’t yet in mainstream windows, it’s likely not included in these scores. That’s further corroborated by the fact that the most affected benchmark - WebXPRT - being virtually identical in anantechs 4790k and 6700k benchmarks.

Additionally, if speedshift were the explanation for the gain, you’d expect my personal kraken and octane benchmarks to corroborate anandtech’s scores - instead, my slower 4770k scores considerably higher than anandtechs 4790k.

I’m not sure if it really is due to JS engine benchmarks, but that does fit the facts. In any case, anandtechs scores for 4790k are much, much too low, which makes skylake look better than it really is, by comparison. I’m betting on JS engine improvements partially because moderns browsers try to make it difficult to avoid updating the browser, so it’s plausible they got a faster JS engine without ever intending too.

Regardless: even if JS performance is much faster, almost all other benchmarks don’t mirror this. You might get lucky and find that ruby perf is like JS, but that’s a long shot. It’s much more likely that ruby perf will be like the vast majority of other benchmarked workloads: largely unchanged.

ultimape · November 22, 2015, 8:20pm

Curious if you’ve compared prime95 to something like linpack. From my limited experience testing hundreds of various laptop machines for stability, battery refreshing, and runtime statistics, linpack (intelBurnTest?) seems to kill them a lot faster. They just released a new version: https://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download

IMHO, prime95 etc are good for benchmarking - having a consistent test across systems is a good idea - but for burn in and stability testing, I much prefer the one made by the chip manufacturer.

Recently I located the source of an overheat fault with my little brother’s rig using furmark and linpack simultaneously that I couldn’t recreate with prime95/mprime running all night. Only other way was random 2 hours of gaming. Furmark+IntelBurnTest got the the failure state in under 10 minutes. It may be a chip-dependent issue, but I remember reading that linpack uses up proportionally more of the on-board circuitry in the CPU to get it’s computation done due to the way it does it floating point calculation - hence a more robust heat generation tool.

I would love to be proven wrong about it, and I have no idea if it makes a difference for servers

codinghorror · November 23, 2015, 10:40pm

Whenever I have built and overclocked systems, prime95 has been 100% reliable in detecting whether they are stable for me. If prime95 fails overnight, not stable, if it passes… no CPU stability issues at all. I have no idea whether it is the “ultimate” tool but for CPU overclocks it has been incredibly reliable at detecting CPU instability for about a decade now…

Here are some numbers @sam_saffron recorded for Discourse:

build master docker image
build01 16:00
tf21     5:33

This is not a great test since our build was running in a VM on an eight core Ivy Bridge Xeon which has a lower clock speed, and a RAID array of traditional hard drives. Nowhere near an apples-to-apples comparison. But, 3x faster!

running Discourse (Ruby) project unit tests
tf9   8:48
tf21  4:54

This is running the Discourse project unit tests in Ruby. It’s a perfect benchmark scenario as tiefighter9 is exactly the 2013 build described in this blog post and tiefighter21 is exactly the 2016 build described in this blog post. And everything runs on bare metal, Ubuntu 14.04 x64 LTS.

As you can see here, tiefighter21 is almost 2x faster: 528s for the 2013 Ivy Bridge server build, and 294s for the 2016 Skylake server build. Our new Skylake based Discourse servers are 1.8x faster at running the Ruby unit tests in the Discourse project, to be exact.

I hope that data answers your question definitively since you both kept asking over and over and not believing me

dakull · November 24, 2015, 1:52pm

@codinghorror wow, that’s just like wow. I wonder if this has any relevance:

EDIT:

I don’t want to be a nit picker it’s just really hard to grasp that switching to Skylake would yield a 1.8x increase solely because of the CPU when the difference between tick/tock on the Intel side barely added +5% performance gains per iteration.