Is Your Computer Stable?

ajoy39 · February 16, 2016, 4:35am

I really don’t recommend people run Prime95 anymore. The program is a power virus, it draws more current through your CPU that any other program ever will and sometimes more than it’s actually rated to draw as well. It likely won’t kill your CPU but failing prime95 doesn’t automatically mean your CPU isn’t perfectly fine. Asus’s realbench works well if you are installing windows, if your stress testing on linux the stress and stress-ng packages are probably your best bet http://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/

VPhantom · February 16, 2016, 12:25pm

Here’s one for you, unfortunately my desktop since 2014: memtest86+ is fine and I’ve excluded the use of the RAM area that some bad BIOSes sometimes play with. Still, once every month or so, sleeping corrupts at least one tiny portion of RAM somewhere. Could be in the kernel (panic ensues) or user programs, or cached FS data. Upon waking that fateful day, Xorg will crash, or Xterm won’t start, or the screen will be bright red, startx crashing on new attempts, though SSH is fine.

Is it the power supply, the motherboard, the memory or any slightly incompatible combination thereof?

We’ll never know: I now power it off for the night like an 80’s Amiga.

ajkerrigan · February 16, 2016, 4:36pm

This is all solid advice, and over time we all develop pet tools and preferences. From my own experience the most relevant addition is SpinRite for hard drive testing/maintenance/recovery. I use it when I get a new drive to get a good initial feel for it, and periodically from then on (though not often enough) to make sure things are still looking good. Fortunately I’ve rarely needed it for recovery.

I find that SpinRite identifies and fixes issues before SMART reports anything on its own, and before I see I/O errors or performance issues at the OS level.

It was not immediately apparent to me that SpinRite would continue to be useful as I switched from spinning rust to SSDs. I use it differently on SSDs (running read-only “Level 2” tests instead of write-heavy tests) but it still helps those drives help themselves, and gives SMART a kick in the pants.

LexBarringer · February 16, 2016, 7:19pm

I’m coming from an electrical, electronic and software engineering background. There is something I need to tell people about their power supplies aka PSUs. Just because it shows 650 Watts doesn’t mean it’s 650 Watts all the time, that’s peak power you see as that number. Depending on what the manufacturers definition of duty cycles time vs. when it shuts down and when it starts up again. This has to do with switching power supplies for which the PC PSUs are that type.

Now, average power is significantly lower than peak power. There is something else most people don’t realize, when you’re using a PSU you have to look at the rails (power supply outlets / plugs). Are they shared or independent?

Shared means that everything on say a +12 Volt rail of say 22 Amps is being used by the entire system. Should you have a short circuit or a device uses more power than it’s rated for, it will put a strain on your power supply. Most modern power supplies have short circuit and brown out (lose a phase from the main power line for which you also lose main voltage) protection.

Independent means that each +12 Volt rail path is it’s own circuit of a said amount of current. This is the most preferred version of a power supply for graphics cards.

Now, what’s the difference between peak and average power?

Simple, the Peak power is 1.414 times larger than the average power is.

So if you have, average (continuous power) of 450 Watts being drawn, you’ll need a 450W * 1.414 = 636.3 W Peak power supply.

However, this is too simplistic.

What you need to find out is how much current of each device you’re going to use and what voltages are used, match the current up with each voltage. Then add up all the current in Amps per voltage rail.

Then multiply it by 1.414 you get the peak power for that specific rail, you do this with all the voltage rails current supplies. Then you add all the peak power rails up to get the final number.

However, there is more to this story. Couple this with high efficiency power supplies (a bad idea, getting much worse) and the general age of the power supply. In time all items fail, whether or not they’re used or not. Some passive components can last for over 40 years mind you, if sealed correctly.

Now, the higher the efficiency of a power supply, the duty cycle is dropped considerably, meaning it uses higher capacitor values to offset the duty cycle being dropped. When the duty cycle is off, it’s not draw power from the power line, which is ideal in saving power but the higher the efficiency the more stress you put on the passive components. Capacitors are notorious for going bad when you need them most (look up capacitor plague). Most high efficiency power supplies need high micro-farad capacitors at a high voltage. You don’t really find this in solid capacitors which are highly stable even when stressed including very hot and humid environments. They use the old radial polarized aluminum electrolytic type capacitors. By default they also use the industry standard 20% tolerance type capacitors, they should be using a bare minimum of 5% tolerance. These capacitors are usually the failure point at which time they become “leaky”, not in a physical sense but in an electrical sense, means they don’t hold onto to their charge until it’s ready to be discharged by the circuit. Most high efficiency power supplies don’t sense something is wrong, it has a specific duty cycle and sticks to that. When you have leaky capacitors in the PSU, the peak power drops considerably, closer to that of average power. This is more than likely the cause to your PSU problem.

Here’s a trick I can impart, instead of calculating the power outright calculate the amps at each rail you need with all the components.

Let’s say in hypothetical needs we’re only going to calculate a +12 Volt set of rails since your graphics cards with the 6 or 8 pin power supply needs it.

Let’s say your card needs 250W of power on the +12 Volt rail and you have one card of it.

250W / 12V = 20.833333333A, round it to 21A (Amps)

You have bunch of drives including a blu-ray burner uses a total of 50W @ 12 Volts.

50W / 12V = 4.166666667, round it to 5A

Your motherboard uses 10W @ 12 Volts

10W / 12V = 1.2Amps, round it to 2A.

You have 6 case fans using 36W @ 12 Volts

36W /12V = 3 Amp, keep it at 3A.

Add all of these together for the +12 Volt current load of average power.

21+5+2+3 = 31 Amp at 12 Volts.

Now take this 31Amps and multiply it by 1.414 which is 43.834, round it to 44 Amps at +12 Volts, this is the bare minimum you should have to power your cards in your system, look for power supplies that have this available on their +12V power rails, make sure it’s independent current, not shared.

Now, I didn’t do the rest of the voltage and current for all the items but you get the idea.

Now, to factor in for all the age and other problems, like high efficiency power supplies (I recommend nothing above a gold certification, they burn out too fast).

Take that number you have from peak current, the 44 Amps and multiply it by 1.5, which will land you at 66 Amps. Now this sounds insane but it really isn’t because just because you have 66 Amps total available to you, doesn’t mean it will draw that amount. That also means that the power supply’s capacitors will have an easier time, not under a lot of stress, the rest of the power retention circuitry won’t be stressed because you have more than enough current to go around. As the PSU ages much of the components no longer are in specification which leads to the power supply losing the ability to as design for said output voltages and current (what your components and motherboard uses).

The bigger less power efficiency PSUs age more gracefully and stay in specification when they’re stressed less by your components.

So, you’re decision to go to 1200 Watts may have been well warranted but do pay attention to the current on the rails vs. that of the peak 1200 Watts next time. The more you understand about design theory vs. practice and what you have in your system, what amount of power it needs (on average).

When you read the current from a PSU that’s not average power, it has it in peak power & current.

If all +12 Volt rails say 25 Amp and they’re 6 of them that’s 150 Amps peak current and it’s average (constant) available current is, 106.05, round down to 106A. So, if you system needs more than that, you’ve got a problem with that power supply, need not necessarily a bigger one but better, different amounts of current per independent node on the rail of +12V.

In good practice, it’s good to keep each individual node at a given voltage separate when determining if it can power your graphics cards, then check against the total amperage of the PSU to make sure it can and will power your other items in addition to that without a problem.

I know this is a long explanation but it does help a lot of people in the end, not all power supplies are created equal, not just quality but current on the individual rails are different, too.

LexBarringer · February 16, 2016, 7:29pm

Is this full sleep mode (suspend to RAM) or just hibernation (low power mode) you can hear fans pulsing sometimes.

Is this an Intel or AMD based computer?

Are you overclocking at all?

If overclocking, save all the setting to a file on the HD or write them down. Then reset the BIOS to standard / optimized settings. Try again and see if it works. If it does, I know what the problem is.

LexBarringer · February 16, 2016, 11:48pm

As I was saying on Twitter, I don’t use the memtest on Linux simply because you can have an incompatibility of a RAM stick or you may have problems on the board itself, meaning not enough drive current for said stick of RAM. You may need to increase the voltage inside the CPU for driving the memory at a certain said speed (overclocked) and be stable. If this isn’t taken into consideration you may accidentally consider a RAM stick or more bad. Many people don’t look at increasing it.

For example on the AMD platforms, the NB VID voltage, you might assume means, Northbridge video voltage, which it doesn’t. It’s the voltage that is for the memory drivers inside your CPU, if it’s at stock and say you’re CPU only recognizes 1333 MHz memory under normal circumstances. You want to overclock and be stable?
You have cooling and everything else tweaked but still unstable on the memory, bump the voltage up a bit but keep the same setting elsewhere. If the system seems a bit more stable but crashes a little while later, go into the BIOS and bump that voltage up a bit again. Test again. If it fails, keep doing it until your stable.

Depends on the CPU, I see the default driver core to be the same as the operational voltage of the CPU. If you have over-volted the CPU to say, 1.4 volts, do the same with the NB VID voltage.

Then run the prime numbers test, in this case mprime after you boot into Linux.

You should be a lot more stable and get a slight bump in performance in bandwidth.

Believe it or not, the internal parameters of the CPUs is that the memory drivers can only put out so much current based on a specific voltage they’re being driven at. It’s called current limiting, the reason why this is done, the higher the amount of current being used with a voltage that is lower than the current being used (as a ratio), the less resistance is available to act as a buffer from damage on the die.

When you increase the voltage in the CPUs, it’s not the voltage that creates the heat, it’s the current and the activity within the chip that creates all that extra heat. Also, it’s the type of substrates they use, currently the wiring they use is high purity copper. However, there are better metals that conduct electricity better which leads to a cooler operational temperature inside the chips themselves. Those are known as silver and platinum.

When you increase the voltage, you also allowing the current limit from the previous stock setting to rise a bit, so you have more current to drive the memory. That’s how it works.

On another note:

The way I test memory is to use a hardware tester, I use the following and I also have a DDR4 tester from another company.

http://www.memorytesters.com/

Read the entire page and scroll down, you will see what I mean.

senz · February 17, 2016, 10:29am

Hey, great summary. But most of non-storage stability tests can be achieved just with running a modern game for some time (it loads cpu, when there is no gpu, or both, its actively using memory).
But about HDD tests: SMART are useful only for aging disk, there is no use in it for “fresh” drives, because lack of statistics. I recommend using WHDD util (it can address disk with ATA or syscal API), it does same thing as badblocks one, but much faster, and also shows you latency of block read (and write), that can indicate upcoming problems. It also detects and shows HBA, that is useful for analysis of used drives.
Prime calc utils are more for thermocompound and cooling test, to see if its not overheating, and keeping temp stable, imo.

P.S.: another thing about HDD/SDD, if youre readying one for utilisation or sale, I highly recommend to wipe it (not just format). There is fast and secure *nix utility called shred, can wipe files, partitions, disks.

backpackhasjetz · February 18, 2016, 12:19am

You may want to look at something like StableBit Scanner, which is an online (no, not web) drive scanner that is meant to catch issues before SMART does as well. I’ve only had luck in the past with SpinRite on very old spinning disks that were still spinning/accessible. Otherwise, it’s worthless (and you’ll have to take it to professionals anyway…)

ajkerrigan · February 18, 2016, 1:13am

Interesting, I’ll have a chance to check that out soon. I just learned of a couple wonky drives in my family, and SpinRite is generally my go-to for drives that are on the way out. I’ve tried a number of other tools over the years, but so many came up short or were too situational to see much use (such as only working for specific drive makes or filesystems). SpinRite earned my confidence, but it’s always nice to have multiple tools available. I’m looking forward to giving StableBit a shot. Thanks for the recommendation!

codinghorror · February 19, 2016, 6:40am

Wattage is almost irrelevant these days, unless you are running multiple GPUs. It would be difficult to build a system that would really use over 300w even playing a game. (Remember, most games barely use more than dual cores, much less four.) You should measure power use like I do using a kill-a-watt device, they’re about $20. The real issue is quality of that power supply not the wattage number printed on the box.

Gets hairy, requires multiple motherboards. It does happen, of course, but how would you know unless you had another motherboard of the exact same model to test against?

Definitely can be issues, but 100% driver related in my experience. Not a pure stability test, but a driver quality test (and interaction between the drivers and the OS quality test).

Good thing we run both in these tests, then! It is true that the prime95/mprime test that exercises some memory is a better one than the one that fits entirely in the CPU cache, and that’s the one I recommended in the blog entry.

Remember that prime95/mprime isn’t intended as a CPU torture test. The goal is to find Mersenne Primes. The torture test part of it was a happy accident, sort of like how the drug we know as Viagra was never originally intended to deal with ED issues…

I’d also be very careful recommending an old tool that was clearly designed for spinning rust hard drives, not SSDs. They are very different beasts and have radically different failure modes.

VPhantom · February 25, 2016, 7:06pm

(Sorry I didn’t notice your response sooner!)

This is suspend to RAM, with good old “pm-suspend” (Linux), fanless/SSD system (no moving parts) and no overclocking:

Motherboard: Gigabyte Z97MX-Gaming
CPU: Intel Core i7-4790S CPU @ 3.2GHz
RAM: 16GB of Crucial BLS2KIT8G3D1609DS1
PSU: Seasonic 400FL2

The motherboard+CPU+RAM combo was featured at a fanless computing site I trusted, short of knowing any incompatibilities myself (there are always some) these days. No video card (stock Haswell is fine with me). I chose the Seasonic PSU as it was half the price of the NoFan (a.k.a. NoFen) and didn’t need importing to Canada.

I suspect some incompatibility between the PSU, RAM and/or motherboard, but can’t prove it. I boot with x86_reserve_low=640 so I can rule out the BIOS exceeding its memory limits. It only happens (noticeably) once every 30 sleeps or so, making it even more difficult to track. My guess is whatever power the RAM is given during sleep is either slightly dirty or just on the threshold of its needs, and once in a blue moon a few bits are underpowered just long enough to be lost. I use a decent “pure sine” UPS so I wouldn’t blame utility power.

LexBarringer · February 27, 2016, 7:09am

Okay, now that I know your motherboard uses the following; Intel Z97 Express Chipset. What I do know about this is that this is very common with motherboards that have this chipset regardless of the the manufacturer. It’s not a driver issue per se, it’s a BIOS glitch.

Contact Gigabyte and tell them what’s happening with your motherboard and wait for them to reply. They may have you download and reflash your BIOS or may ask to do a RMA of your motherboard.

What you can try for the time being is do this;

Get into the BIOS and set everything for optimized but don’t overclock your memory or the CPU, let it sit at standard. Save it and reboot, then get back into BIOS and turn off “Wake-On-LAN” or what ever it’s called in your BIOS. This can cause all kinds of havoc if this is enabled. Happens on Microsoft Windows and in Linux.

If everything works fine for the next ten times you do this. You can overclock your CPU (if you do it) to the speed you had it at. Then test it ten more times.

If it passes then slowly overclock your RAM don’t put it at maximum, go from default and say bump it in small increments.

If it survives one time when it’s suspended, get in and bump it up a little more. Doing the same thing. If it fails in overclock mode from memory and your memory is “rated” for a certain speed, you may have one or more faulty modules that can’t handle the stress.

Bump it down to a stable speed where you didn’t crash and go from there until you can figure out if you can RMA that memory to where ever you bought it.

As a rule of thumb, despite Gigabyte does have some really nice features on their motherboards they do tend to make crappy hardware (manufacturing wise, I’ve had to fix a lot of them in the past).

If you get another computer; I recommend MSI, EVGA or Asus (with the solid capacitors).

Also, if you’re putting something into suspend mode, don’t remove anything from the USB ports, there is a known hardware glitch there, too.

Hope this helps!

LexBarringer · February 29, 2016, 11:52am

I just wanted to let you know, that from the information you’ve given me, I decided to do a virtual build of your computer on pcpartpicker.com and check the size of the PSU versus the average power being used.

You’re fine!

http://pcpartpicker.com/user/Lex_B/saved/6wWgXL

There are only two scenarios I can think of that would cause this problem.

The first is the ROM BIOS needs an update and fix for the Suspend to Ram (S3 mode).

This is what is suppose to happen:

You ask to suspend to RAM.
The main tasks stop doing stuff in threads.
Your the kernel activates the S3 mode.
Your computer reduces the frequency of said RAM and CPU down to a manageable frequency, seen different frequencies (8, 25, 33, 66, 100, 133, 200 MHz). Then it drops the RAM and CPU voltage down from the overclock, to the base clock, lets it settle. Then drops the voltage down to suspension voltage and powers down into full S3.

Now, when you ask the computer to wake up, it does it in reverse (or at least in practice it’s supposed to do it that way). Each manufacturer tends to do it a little different despite they get the BIOS from AMI or Award, they can still customize it).

Something else that will freak out the S3 mode is if you overclock the Southbridge past 100 MHz (aka Auto), that’s a guarantee that it will freak out some motherboards if you try to use S3 while you do this.

Try to suspend to disk without the RAM and see what happens. If you’re okay, then it’s more than likely a clocking problem and that’s a BIOS bug. If it still screws up.

The second problem comes into play. That you have bad capacitors on your motherboard or your power management circuitry is bad, maybe both.

Let me know what it is that you find if you use suspend to disk and if it works for you.

LexBarringer · February 29, 2016, 11:54am

Were you using pcpartpicker.com to figure out how much something would cost for your ideal configuration?

VPhantom · February 29, 2016, 1:12pm

That part picker compatibility check is awesome, thank you for bringing it to my attention! Picking models will feel less random next time I build my system from scratch.

I will keep the BIOS update and suspend-to-disk possibilities in mind, but neither is trivial enough for me to attempt (Linux, no swap partition), so I won’t investigate further for now. I will eventually replace the motherboard when an upgrade becomes worthwhile (Skylake?), and hopefully sidestep the issue then (Asus, solid capacitors duly noted). The build dates back to August 2014 so an RMA doesn’t seem plausible even if I could pin this on the motherboard.

LexBarringer · March 1, 2016, 6:06pm

Yes, Skylake is an okay design. However, don’t bother going above 2666 MHz of DDR4, you’re just wasting energy and not getting much more bandwidth the higher you go.

2666 MHz is the break even point for decent CAS timing and bandwidth. You might be able to get a lower stable CAS at that speed, too. I would recommend it if you can afford such luxuries.

Actually, suspend to disk S1 has nothing to do with a swap partition. It’s based on how you’ve set up your /local and/or /home directories, to save the state information in a file that is by default not encrypted but it is compressed.

Swap partitions are only used by the operating system’s kernel during normal use not when it suspends to disk. Reason for this is simple, S1 mode is for the computer itself and will write a file based on the specifics of your motherboard and file states in memory.

Basically, what is saved is the timings for the components, the stuff loaded in memory (RAM) and what in the GP-GPU (if you have one) and being used.

The suspend to or hibernate modes are supposed to be transparent to any operating and file system implemented, let alone what types of partitions are present, wouldn’t make a difference.

LexBarringer · March 1, 2016, 6:27pm

The Z170 series Intel chipset is the best for your money because of the capabilities it affords you, the runner up is the H then the B series. The Z series is designed for gamers and overclocking enthusiasts.

There is one thing I will tell you though; many people want to use the Micro ITX boards but it’s exceedingly rare at this time to find a decent PSU for a small shoe box computer or a ITX tower computer (cube shaped), it’s a different size than the standard ATX PSU, it costs a lot more.

The best thing you can do is get a full E-ATX or ATX sized case, then get an ATX PSU and ATX motherboard. The reason why you go for the full sized motherboard is that it handles thermal events much better, more surface area to heat up and less hotspots which can cause premature component failure. Despite having water cooling or great air cooling, the underside of the CPU in the sockets still gets damn hot, never mind the North or South bridge. The microATXs are hit and miss when it comes to thermal dissipation.

Note: The hotter a component gets, the more energy (current in Amps) it takes to keep the status quo and stay working. When components are pushed closer together at a higher operational speed, they create more heat, usually the passive components like capacitors are the first to blow out or fall out of specification, which causes hell on the motherboard.

Also, just because a motherboard is expensive doesn’t make it better than a less expensive one.

The following motherboard manufactures with solid capacitors are a good buy:

ASRock
Asus
EVGA
MSI
Zotac

In terms of expense, ASRock and Zotac are the less expensive but still good quality. MSI, EVGA and Asus (Republic of Gamers / ROG) are the more expensive motherboards out there. You may or may not need that feature set.

I’m currently using a Z170-WS from Asus, it’s not a ROG motherboard but a workstation that runs the PCI-E v3.00 slots all at x16 electrical and data at the same time.

It’s interesting that there are motherboards that have x16 slots and if you run one or two slots you have x16 for those two slots but if you have more than 3 slots filled, it will drop to x8 speeds but still give the x16 electrical support to the cards.

One of the reasons for why that happens is the actual CPUs may only have a certain amount of PCI-E lanes available. I use a special network switch chip for the PCI-E on the motherboard, it introduces a slight increase in latency but it doesn’t kill my bandwidth. In practice I don’t notice the latency, even when playing games on it. Gamers are just chasing ghosts if they think that the increase in latency will affect them terribly.

In case you’re curious, I’m constantly making builds on the pcpartpicker.com website. From the most eco and cheap Intel based gaming computer to the most top heavy power processing workstation. If you want some of the links for my builds, let me know.

LexBarringer · March 1, 2016, 8:17pm

Here is an example of a system I was just working on, it’s under 2k USD but it’s a little power house, good for game development, rendering environments or videos, or gaming, too.

http://pcpartpicker.com/user/Lex_B/saved/NPRrxr

VPhantom · March 2, 2016, 1:08pm

I do know that pm-hibernate can use a regular file instead of a swap partition, however my entire / partition is under software RAID and dm-crypt (getting GRUB to get to /boot/ was a fun project). The bulk of my boot time is spent entering my passphrase twice (once for GRUB, once for Linux itself).

If I had more time I’d experiment with the kernel’s “resume” argument, to see if it is clever enough to mount the RAID and ask for my LUKS passphrase to get to its save file, but all the examples I found use encrypted swap partitions, not files. With my quick start time and the trouble of swapping out the motherboard (if it’s proven guilty) I’m not going to investigate this one any further for now. Maybe when I’ll be tempted by a new CPU but this one’s already quite the beast for my needs.

I learned quite a lot towards my next build though, thanks again!

LexBarringer · March 3, 2016, 10:40pm

You’re welcome. Well, had I knew you were using software based raid, I
would have suggested something else but that’s fine too.