Before we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.
The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.
ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.
Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.
You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.
Even if you use RAIDZ2 or RAID6, regular scrubs are important.
One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.
Answer from Dan Pritts on serverfault.comBefore we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.
The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.
ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.
Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.
You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.
Even if you use RAIDZ2 or RAID6, regular scrubs are important.
One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.
There is a little bit of a misconception at work here. A lot of the advice you're seeing is based on an assumption which may not be true. Specifically, the unrecoverable bit error rate of your drive.
A cheap 'home user' disk has 1 per 10^14 unrecoverable error rate.
http://www.seagate.com/gb/en/internal-hard-drives/desktop-hard-drives/desktop-hdd/#specs
This is at a level where your're talking a significant likelihood of an unrecoverable error during a RAID rebuild, and so you shouldn't do it. (A quick and dirty calculation suggests that 5x 2TB disks RAID-5 set will actually have around a 60% chance of this)
However this isn't true for more expensive drives: http://www.seagate.com/gb/en/internal-hard-drives/enterprise-hard-drives/hdd/enterprise-performance-15k-hdd/#specs
1 per 10^16 is 100x better - meaning 5x 2TB is <1% chance of failed rebuild. (Probably less, because for enterprise usage, 600GB spindles are generally more useful).
So personally - I think both RAID-5 and RAID-4 are still eminently usable, for all the reasons RAID-0 is still fairly common. Don't forget - the problem with RAID-6 is it's hefty write penalty. You can partially mitigate this with lots of caching, but you've still got some pain built in, especially when you're working with slow drives in the first place.
And more fundamentally - NEVER EVER trust your RAID to give you full resilience. You'll lose data more often to an 'oops' than a drive failures, so you NEED a decent backup strategy if you care about your data anyway.
Videos
Even given what one of the other answers here laid out, namely that ZFS only works with actual used blocks and not empty space, yes, it is still dangerous to make a large RAIDZ1 vdev. Most pools end up at least 30-50% utilized, many go right up to the recommended maximum of 80% (some go past it, I highly recommend you do not do that at all, for performance reasons), so that ZFS deals only with used blocks is not a huge win. Also, some of the other answers make it sound like a bad read is what causes the problem. This is not so. A bit rot inside a block is not what's going to screw you here, usually, it's another disk just flat out going bad while the resilver from the first disk going bad is still going on that'll kill you.. and on 3 TB disks in a large raidz1 it can take days, even weeks to resilver onto a new disk, so your chance of that happening is not insignificant.
My personal recommendation to customers is to never use RAIDZ1 (RAID5 equivalent) at all with > 750 GB disks, ever, just to avoid a lot of potential unpleasantness. I've been OK with them breaking this rule because of other reasons (the system has a backup somewhere else, the data isn't that important, etc), but usually I do my best to push for RAIDZ2 as a minimum option with large disks.
Also, for a number of reasons, I usually recommend not going more than 8-12 disks in a raidz2 stripe or 11-15 disks in a raidz3 stripe. You should be on the low-end of those ranges with 3 TB disks, and could maybe be OK on the high-end of those ranges on 1 TB disks. That this will help keep you away from the idea that more disks will fail while a resilver is going on is only one of those reasons, but a big one.
If you're looking for some sane rules of thumb (edit 04/10/15 - I wrote these rules with only spinning disks in mind - because they're also logical [why would you do less than 3 disks in a raidz1] they make some sense even for SSD pools but all-SSD pools was not a thing in my head when I wrote these down):
- Do not use raidz1 at all on > 750 GB disks.
- Do not use less than 3 or more than 7 disks on a raidz1.
- If thinking of using 3-disk raidz1 vdevs, seriously consider 3-way mirror vdevs instead.
- Do not use less than 6 or more than 12 disks on a raidz2.
- Do not use less than 7 or more than 15 disks on a raidz3.
- Always remember that unlike traditional RAID arrays where # of disks increase IOPS, in ZFS it is # of VDEVS, so going with shorter stripe vdevs improves pool IOPS potential.
Is RAID-Z as bad as R5, no. Is it as good as R1 or R10, usually no.
RAID-Z is aware of blank spots on the drives, where R5 is not. So RAID-Z only has to read the areas with data to recover the missing disk. Also, data isn't necessarily striped across all the disks. A very small file might reside on just a single disk, with the parity on another disk. Because of this RAID-5 will have to read exactly as much data as the space used on the array (if 1mb is used on a 5TB array, then a rebuild only needs to read 1 mb).
Going the other way, if most of a large array is full, then most of the data will need to be read off all the disks. Compared to R1 or R10 where the data only needs to be pulled off exactly one disk (per failed disk; if multiple disks fail only in situations where the array is still recoverable too).
What you're worrying about is the fact that with every sector read operation there's a chance you'll find a sector that wasn't written correctly or is no longer readable. For a typical drive these days that's around 1x10^-16 (not all drives are equal, so lookup the specs on your drives to figure out their rating). This is incredibly infrequent, but comes out to about once every 1PB; for a 10TB array there's a 1% chance your array is toast and you don't know it until you try to recover it.
ZFS also helps mitigate this chance, since most unreadable sectors are noticeable before you start trying to rebuild your array. If you Scrub your ZFS array on a regular basis, the scrub operation will pickup these error and work around them (or alert you so you can replace the disk if that's how you roll). They recommend you scrub enterprise-grade disks about one to four times a month; and consumer-grade drives at least once a week, or more.
I'm planning to get 2 or 3 Samsung PM9A1 drives. They're supposed to be extremely fast (7000MB read, 5200 write at 1M IOPS read, 850k IOPS write).
On those I'd be holding mostly running VM disks, the root operating system and possibly source code / compilation artifacts for a container.
Is RAIDZ1 going to somehow tank performance as opposed to mirroring or is ZFS fast enough to saturate the disks in either config?
The way I understand, you'd get perf of a single disk in either configuration, no?
I'm hoping to set up 3 separate EFI partitions, one on each drive, so that any one of them dying won't mean the machine can't boot anymore.
https://www.ibm.com/support/pages/re-evaluating-raid-5-and-raid-6-slower-larger-drives
Not specific to ZFS, but presents the advantages of RAID6 over RAID5 as probability of getting a URE during rebuild
As a noob, I keep hearing and reading about the dangers of running a RAIDz1 and losing all your data during the resilver process when a second drive fails. Here's my question. Who has actually experienced this? How often has it happened? I believe most people on this reddit are data driven folks and can weigh the real against the hype. I'd like know how often does a second drive fail?
I think it was a misunderstanding in the old thread. I was comparing the chance of failure for two disks in a row when using either Z1 parity raid or no RAID (as you stated in the comments in the other thread). In my eyes it was never about Z1 vs. striped pool of basic vdevs, because that game is essentially over after the first fault anyway, so Z1 is of course better.
But if you just compare multiple independent pools against a single pool with a single Z1 vdev, then the problem of increased load while recalculating the parity information persists.
On the comparison of Z1 vs Z2, which the answer by Michael was mainly about, the other two points apply. I should have been more clearly in the comments, but they are limited in space unfortunately. I hope this answer clears some of this.
I thought the same thing, but I didn't realize that a URE isn't just a bit flip, it spoils the entire pool.
If we simplify the whole thing, you have your disk with its controller chip on the bottom and your hardware (RAID controller) or software (e. g. ZFS) on the top.
If any error happens in the hardware and a sector cannot be read, the chip first tries to correct it on its own if possible (for example by reading the problem sector multiple times). If it still can't make anything out of it, it gives up (on normal disks, this can take minutes and stalls the complete system which waits for "successful" or "failure" message regarding the IO operation that is pending.
Some disks have a feature called TLER (time limited error recovery), which is a hard timeout that limits this error correction time to 6-9 seconds, because traditionally, most hardware RAID controllers dropped the whole disk after 9 seconds, so a single bad sector should not make the whole disk unavailable, but be corrected by a "good" sector on the other disks (a feature that a single disk on a desktop system could not rely on, so a long timeout would be preferable).
Now, let's look at the software side: if you configure your raid controller or ZFS file system with redundancy, for example by using mirrored disks or a mirror vdev as basis for your pool, your URE can be corrected. If you do not use redundancy, the data on this sector will be gone, which may be data you care about or just random old temp data or nothing, depending on your luck. The same applies to bit flips, although the chance of them happening seems to be more dependent on outside effects (like cosmic radiation).
Since RAID0 is not subject to UREs, the question is "what is more likely, a URE in RAIDZ or a disk failure in RAID0?"
I haven't accepted this answer because I don't think it adequately explains the relevant points, but I was planning on creating my own answer once I understand why UREs destroy the whole pool, if no one else gets to it first.
I suggest you read a basic explanation of ZFS pool layout. To summarize the most important bits:
- You can create virtual devices (vdevs) from disks, partitions or files. Each vdev can be created with different redundancy: basic (no redundancy), mirrored (1 to N disks can fail), parity raid Z1/Z2/Z3 (1/2/3 disks can fail). All redundancy works on the vdev level.
- You create storage pools from one or more vdevs. They are always striped, therefore the loss of a single vdev means the loss of the whole pool.
- You can have any number of pools, which are independent. If one pool is lost, the other pools continue to function.
Therefore you can reason the following:
- If possible, prefer Z2 over Z1 because of the increased load and big window of (negative) opportunity when rebuilding large drives (large being anything over 1 TB approximately)
- If having to choose between Z1 and multiple basic vdevs, prefer Z1 because of bit error correction which is not possible with basic vdevs
- If you can accept partial pool loss, segment your pool into multiple smaller pools backed by a single vdev each, so that you get checksum information and faster rebuild times on fatal faults
In any of the above cases, you need to have a backup. If you cannot or don't want to afford any backup, it is about what you are more comfortable to lose - some parts of the pool with higher probability or everything with lower probability. I personally would choose the first option, but you may decide otherwise.
What is implied in answer you quoted is that with increasing storage capacity chance of failure increases accordingly, not only for rebuilding operation but for normal activity as well. So, statistically speaking, RAIDZ1 is no more fault tolerant than Raid 0 when talking about modern 4TB drives, even though case is made prima facie that it is.
So some argue that RAIDZ1 is, in fact, not an increase in protection against data loss for large-capacity hard disk drives. This has less to do with mechanical failure of the drive(s), or at least not with critical failure. URE is, to put it simply (and very simplistically) is failure to read. Be it due to prolonged read from bad sector of the drive, disk running out of spare sectors - or any other cause - it's not really an issue. It will happen, like it or not. Let's then take the bad sector example - Normally this is handled by drive internally, but if there's enough of them or the drive will take it's sweet while to fix that the RAIDZ controller might interpret the delay as drive failure and eject the drive. Now, let's imagine it's the SECOND hard drive in the pool, and it happened while rebuilding... The only viable solution is to scrub the array for those errors - if caught early, the error will be just a burp - pool will recover the data easily. But this means putting quite a big load on drives, which then increases drastically the statistical chance of URE (remember: age, writes, volume of data all increase it a lot already, without increasing the reads by order of magnitude from normal operations; all for each drive separately).
Thus the answer to your question (is a RAIDZ1 an incremental improvement on no fault tolerance) is: not really. If we use logic of the quote you face 50% chance (I think) of enough disk failures for data to be unrecoverable within first two years of said disks operation..
That is why when in our company we were faced with the dilemma of server availability or storage capacity we bit the bullet and went for RAID6 on SSDs. Should be enough for couple of years and then probably upgrade, if needed.
I have 5 16tb Exos drives and am looking for opinions on whether I should set them up as 4 in a raidz1 with a hot spare, or with all 5 in a raidz2.
If I am understanding correctly, my total usable capacity would be the same either way?
I am leaning toward the raidz2, so if I am resilvering to replace a dead drive, my data will be safe if another dies during the process.
btw, any word yet on when the ability to add a disk to a pool will be released? I heard vague rumors of "soon" when I was first researching zfs last year