Whenever I have built and overclocked systems, prime95 has been 100% reliable in detecting whether they are stable for me. If prime95 fails overnight, not stable, if it passes.. no CPU stability issues at all. I have no idea whether it is the "ultimate" tool but for CPU overclocks it has been incredibly reliable at detecting CPU instability for about a decade now...
Here are some numbers @sam_saffron recorded for Discourse:
build master docker image
This is not a great test since our build was running in a VM on an eight core Ivy Bridge Xeon which has a lower clock speed, and a RAID array of traditional hard drives. Nowhere near an apples-to-apples comparison. But, 3x faster!
running Discourse (Ruby) project unit tests
This is running the Discourse project unit tests in Ruby. It's a perfect benchmark scenario as tiefighter9 is exactly the 2013 build described in this blog post and tiefighter21 is exactly the 2016 build described in this blog post. And everything runs on bare metal, Ubuntu 14.04 x64 LTS.
As you can see here, tiefighter21 is almost 2x faster: 528s for the 2013 Ivy Bridge server build, and 294s for the 2016 Skylake server build. Our new Skylake based Discourse servers are 1.8x faster at running the Ruby unit tests in the Discourse project, to be exact.
I hope that data answers your question definitively since you both kept asking over and over and not believing me