Beyond RAID

Soon there will be a linux equivalent for the ZFS file system:

BTRFS
http://btrfs.wiki.kernel.org/index.php/Main_Page

Due to the license of ZFS they couldn’t use the ZFS in linux.
For the time being there is a fuse ZFS implementation for linux to experiment with…

Oh, and to folk who think that a hardware RAID controller eliminates the write performance problem of RAID5… nope, sorry it doesn’t really work like that.

The big problem with all the parity/ECC RAID schemes (apart from RAID 2 or RAID 3 - which are rare these days) isn’t calculating the parity/ECC bits at all, not with the controllers and CPUs we have today.

The real problem is that typical writes seldom lay down a full, perfectly-aligned stripe covering all the blocks in a single stripe-width with new data, overwriting everything that was there before.

Instead typical writes change only some subset of the data blocks in a given stripe. But you still need to know what the other blocks in that stripe contain in order to calculate the new parity/ECC. At the very least you need to read the parity disk and the current content of any block that you’re replacing (presuming XOR) and possibly you need to read all the blocks from all the disks that aren’t going to be touched by the write and do the whole stripe calculation afresh.

That incurs seek and read delays from the spinning rust, and causes more data to pass over the IO channels, clogging them up relative to the ideal scenario.

This read-before-write problem is why people aren’t keen, even now, on parity/ECC RAID for OLTP-type scenarios, or even for heavily-used filesystems where you choose to allow the access times of files to be updated. For decision support databases, and other read-predominant (including metadata) scenarios, RAID 5/6/7/whatever is just a more resilient type of stripe and is a clear win.

BTW, apart from WAFL and ZFS, most storage subsystems all the way up to the user-level still happily believe that a block from a disk is what the disk says it is without referring to the other disks and/or calculating the parity bits even in a RAID.

They do this because disks are meant to be reliable and either fail-fast or at least meant to be honest and up-front about unrecoverable errors. By not being uber-paranoid you get more actual throughput and lower latency from the IO subsystem, which most-times seems like a good trade.

The bad news is that sometimes disks, controllers, IO busses, can and do suffer from Byzantine faults and you end up with unacknowledged garbage reaching your programs and users from supposedly protected storage. Sometimes you even resilver bad data over good in a mirror. And I have to tell you that it’s a pain in the arse to figure out what’s happening, how to fix it, and how to contain and recover from the business/science/whatever impact of that. Pick up your cane and vicodin, stick on the deerstalker and disguise and have at it.

And while ZFS RAID protects you against more hardware failures than ever before, and you can use ZFS and WAFL snapshots to help you recover more gracefully from user thinkos (“what do you mean, you didn’t want to delete your whole working directory?”). You still need those remote off-site backups to protect against fire/flood/hackers/mad-axeman/police-confiscation/etc…

Blah, sorry, I typed too much. Storage, it matters y’know?

Yeah, ZFS is awesome - I run RAID-Z2 on my storage box (6x1TB, 6x500GB), never had a problem. Incredibly resilient to corruption too; you can even overwrite the “boot sector” of the disks in question and ZFS recognizes it and doesn’t skip a beat.

Of course, I’d be even happier if my storage box wasn’t slow as hell (40-50 MB/s peak throughput), but having tested the arrays on another computer, it’s not really ZFS’ fault.

I’m not going back to hardware RAID anytime soon, at least.

It’s very important to note that one should be very wary of hardware RAID, while software RAID is often useful. Details at the bottom of:
http://www.pixelbeat.org/docs/hard_disk_reliability/#RAID

I think the benefits of raid and/or zfs are fairly obvious. What are your opinions on ecc memory?

I’m probably preaching to the choir on this, but for home use the Windows Home Server software is pretty great. RAID like redundancy without having to match drives sizes.

Try to compare RAID 6 with RAID 1+0 for a fairer comparison:

  • Both use 4 drives as a base configuration
  • Both deliver capacity 2*N for drives of N size.
  • Both handle a failure in a single disk flawlessly

But:
Raid6 has higher overhead but also tolerates a second disk failure.
Raid1+0 has lower overhead but only gives you a 33% chance of the second disk failure also being tolerable.

When you go to 6 disks (as 5 disks on raid1+0 is pointless) it becomes even clearer:
Raid6 still has tolerance for any two disks failing, and has storage capacity 4N
Raid1+0 still only tolerates 1 disk failing with 100% guarantee of certainty, 2 disks with 40% certainty and 3 disks (given that two already have failed) with only 25% chance of it being OK. The capacity remains at 3
N, which is less.

In other words, Raid1+0 gives you capacity N*(diskcount)/2 and redundancy of one disk (with if you’re lucky up to N/2 disks), where Raid6 gives you capacity N*(diskcount-2) and redundancy of two disks guaranteed.

Taking into account that the RAID6 overhead is about 5-10% of the CPU power on a modern software raid6 with 4 disks, Raid0+1 seems like the dumbest thing you can do IMO.

Peter, RAID 10 has significantly better write performance than RAID 5 or RAID 6, especially when it comes to small random writes. You are correct that the CPU usage isn’t the problem. These slow random writes can bring a database server to its knees.

To perform a random write on RAID 5/6 you have to:

  1. Read one full stripe from each disk
  2. Compute parity
  3. Write one full stripe to each disk

This is the “write hole” that ZFS fixes with RAID-z and RAID-z2. ZFS uses copy-on-write to eliminate the read-before-write of RAID 5 and 6. It basically just write the new data and marks the old as unused.

You might also be interested in something I only recently learned about. Linux MD RAID allows for odd numbers of drives in a RAID 10:

http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

I dont understand the aversion to having raid on your workstation. I’ve used a 2 disk (10,000rpm veloceraptor) striped raid on my desktop for a couple of years now and just wouldn’t go back to a single disk setup.

Nowadays the reliability of a hard disk is excellent and in 10 years I’ve only had one failure. Also if you’re machine is built on a single disk and that fails you’re still up the same creek and in effect your only halving what is a very minimal risk. I exercise regular backups to a third disk but would do that if even on a single disk setup.

The extra pep you get in your machine on everyday tasks is truly beyond what you’d get from faster memory or overclocking (apart from maybe gaming).

Try it, it’s totally worth it!

Check out this vid of ZFS in action using simple USB thumb drives and you get an idea of how cool ZFS can be…

http://video.google.com/videoplay?docid=8100808442979626078

“If you take four hard drives, stripe the two pairs, then mirror the two striped arrays – why, you just created yourself a magical RAID 10 concoction!”

Create X mirrored pairs, then strip the sets. If you stripe first a single drive loss results in one stripe going down, a second drive failure always results in the loss of the array. If you mirror then stripe you can survive a second drive loss provided it is not the partner of the first drive failure. You also don’t lose as much performance as the two stripes are both still online. With modern arrays a spare drive will automatically cover failed drives so don’t forget that aspect of things.

I’ve been using Raid-5 in my laptop for some time (Yes, it’s a rather big laptop that I use as my mobile desktop). It’s fast, it’s pretty secure and I wouldn’t go back to JBOD quickly.

I remember the times when Oracle specialists insisted on not using Raid-5 on database servers. Times are changing.

I’d like to emphazise that RAID6 has a much bigger performance impact than RAID5: XOR for RAID5 can be done very easily, even in hardware, but RAID6 needs Reed-Solomon-Codes that are still very hard to implement nowadays.

You forgot to mention that a RAID 1 can have superior read performance (if the controller/driver do it right) since two different read requests can be executed at the same time (2 disks with identical data, after all). In that regard, RAID 1 can even be faster than RAID 0 for some special cases.

Backup are important but sometimes it can be
impractical to backup 100’s of GBs multiple times a day.

That’s what Rsync is for. Unless you’re generating 100s of GB of new data every day…

Having a RAID on the desktop is a very good thing, provided you don’t do it using one of those shitty built-in desktop motherboard chipsets. You need a decent RAID controller for it to be worthwhile. I’ve seen bad desktop RAID setups run at less than half the speed of a 5400rpm hard drive plugged into USB on the same PC. Also, for anybody who wants to jump on the SSD (SLC please!) bandwagon, having a decent RAID controller is the only way you’re going to get awesome throughput and capacity.

I see the DROBO advertised here and there and that looks interesting, though it definitely isn’t cheap. I wonder if anyone here has any experience with them.

Doesn’t RAID1 theoretically also allow more disks seeks per second, and double read data rate? But even without that, IMO using RAID1 is useful even on the desktop: when one drive breaks, you can still keep working until the replacement drive has arrived, and don’t have the hassle of restoring from backup.

Btw. “many huge data centres each with hundreds of racks containing dozens of Sun Sunfires filled exclusively with IBM Deathstar 75GXP drives and plastic explosive”: nice comparison :smiley:

You should look at 3PAR Luns.

For the MTV generation in us all, maybe start with your point…
a new RAID product that you think is pretty cool.

The intro-to-raid primer seems a bit textbook for your post
as there appeared to be little of lessons learned until you mention RAID-Z and ZFS