Before we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.

The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.

ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.

Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.

You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.

Even if you use RAIDZ2 or RAID6, regular scrubs are important.

One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.

Answer from Dan Pritts on serverfault.com
🌐
TrueNAS Community
truenas.com › forums › community discussion › community forum › off-topic
What's the stigma behind using RAIDZ1? | TrueNAS Community
July 9, 2023 - In a 6-drive RAIDZ1, you will lose your pool as soon as a second drive goes bad, a 6-drive striped mirrors can suffer up to 3 drive failures so long as the drives aren't in the same vdev.
Top answer
1 of 3
35

Before we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.

The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.

ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.

Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.

You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.

Even if you use RAIDZ2 or RAID6, regular scrubs are important.

One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.

2 of 3
5

There is a little bit of a misconception at work here. A lot of the advice you're seeing is based on an assumption which may not be true. Specifically, the unrecoverable bit error rate of your drive.

A cheap 'home user' disk has 1 per 10^14 unrecoverable error rate.

http://www.seagate.com/gb/en/internal-hard-drives/desktop-hard-drives/desktop-hdd/#specs

This is at a level where your're talking a significant likelihood of an unrecoverable error during a RAID rebuild, and so you shouldn't do it. (A quick and dirty calculation suggests that 5x 2TB disks RAID-5 set will actually have around a 60% chance of this)

However this isn't true for more expensive drives: http://www.seagate.com/gb/en/internal-hard-drives/enterprise-hard-drives/hdd/enterprise-performance-15k-hdd/#specs

1 per 10^16 is 100x better - meaning 5x 2TB is <1% chance of failed rebuild. (Probably less, because for enterprise usage, 600GB spindles are generally more useful).

So personally - I think both RAID-5 and RAID-4 are still eminently usable, for all the reasons RAID-0 is still fairly common. Don't forget - the problem with RAID-6 is it's hefty write penalty. You can partially mitigate this with lots of caching, but you've still got some pain built in, especially when you're working with slow drives in the first place.

And more fundamentally - NEVER EVER trust your RAID to give you full resilience. You'll lose data more often to an 'oops' than a drive failures, so you NEED a decent backup strategy if you care about your data anyway.

Top answer
1 of 3
12

Even given what one of the other answers here laid out, namely that ZFS only works with actual used blocks and not empty space, yes, it is still dangerous to make a large RAIDZ1 vdev. Most pools end up at least 30-50% utilized, many go right up to the recommended maximum of 80% (some go past it, I highly recommend you do not do that at all, for performance reasons), so that ZFS deals only with used blocks is not a huge win. Also, some of the other answers make it sound like a bad read is what causes the problem. This is not so. A bit rot inside a block is not what's going to screw you here, usually, it's another disk just flat out going bad while the resilver from the first disk going bad is still going on that'll kill you.. and on 3 TB disks in a large raidz1 it can take days, even weeks to resilver onto a new disk, so your chance of that happening is not insignificant.

My personal recommendation to customers is to never use RAIDZ1 (RAID5 equivalent) at all with > 750 GB disks, ever, just to avoid a lot of potential unpleasantness. I've been OK with them breaking this rule because of other reasons (the system has a backup somewhere else, the data isn't that important, etc), but usually I do my best to push for RAIDZ2 as a minimum option with large disks.

Also, for a number of reasons, I usually recommend not going more than 8-12 disks in a raidz2 stripe or 11-15 disks in a raidz3 stripe. You should be on the low-end of those ranges with 3 TB disks, and could maybe be OK on the high-end of those ranges on 1 TB disks. That this will help keep you away from the idea that more disks will fail while a resilver is going on is only one of those reasons, but a big one.

If you're looking for some sane rules of thumb (edit 04/10/15 - I wrote these rules with only spinning disks in mind - because they're also logical [why would you do less than 3 disks in a raidz1] they make some sense even for SSD pools but all-SSD pools was not a thing in my head when I wrote these down):

  • Do not use raidz1 at all on > 750 GB disks.
  • Do not use less than 3 or more than 7 disks on a raidz1.
  • If thinking of using 3-disk raidz1 vdevs, seriously consider 3-way mirror vdevs instead.
  • Do not use less than 6 or more than 12 disks on a raidz2.
  • Do not use less than 7 or more than 15 disks on a raidz3.
  • Always remember that unlike traditional RAID arrays where # of disks increase IOPS, in ZFS it is # of VDEVS, so going with shorter stripe vdevs improves pool IOPS potential.
2 of 3
11

Is RAID-Z as bad as R5, no. Is it as good as R1 or R10, usually no.

RAID-Z is aware of blank spots on the drives, where R5 is not. So RAID-Z only has to read the areas with data to recover the missing disk. Also, data isn't necessarily striped across all the disks. A very small file might reside on just a single disk, with the parity on another disk. Because of this RAID-5 will have to read exactly as much data as the space used on the array (if 1mb is used on a 5TB array, then a rebuild only needs to read 1 mb).

Going the other way, if most of a large array is full, then most of the data will need to be read off all the disks. Compared to R1 or R10 where the data only needs to be pulled off exactly one disk (per failed disk; if multiple disks fail only in situations where the array is still recoverable too).

What you're worrying about is the fact that with every sector read operation there's a chance you'll find a sector that wasn't written correctly or is no longer readable. For a typical drive these days that's around 1x10^-16 (not all drives are equal, so lookup the specs on your drives to figure out their rating). This is incredibly infrequent, but comes out to about once every 1PB; for a 10TB array there's a 1% chance your array is toast and you don't know it until you try to recover it.

ZFS also helps mitigate this chance, since most unreadable sectors are noticeable before you start trying to rebuild your array. If you Scrub your ZFS array on a regular basis, the scrub operation will pickup these error and work around them (or alert you so you can replace the disk if that's how you roll). They recommend you scrub enterprise-grade disks about one to four times a month; and consumer-grade drives at least once a week, or more.

🌐
Reddit
reddit.com › r/zfs › do i want raidz1?
Do I want RAIDz1? : r/zfs
February 27, 2022 - ... If your HDDs are 2TB or less in capacity, then Z1 is safe. But more than 2TB in capacity Z1 incurs a risk in the rebuild time before a possible second disk could fail. Best not use Z1 with disks larger than 2TB. Z2 is best for that scenario (larger than 2TB).
🌐
Reddit
reddit.com › r/zfs › raidz1 vs mirror on nvme
r/zfs on Reddit: RAIDZ1 vs mirror on NVMe
September 4, 2020 -

I'm planning to get 2 or 3 Samsung PM9A1 drives. They're supposed to be extremely fast (7000MB read, 5200 write at 1M IOPS read, 850k IOPS write).

On those I'd be holding mostly running VM disks, the root operating system and possibly source code / compilation artifacts for a container.

Is RAIDZ1 going to somehow tank performance as opposed to mirroring or is ZFS fast enough to saturate the disks in either config?

The way I understand, you'd get perf of a single disk in either configuration, no?

I'm hoping to set up 3 separate EFI partitions, one on each drive, so that any one of them dying won't mean the machine can't boot anymore.

Top answer
1 of 5
25
Is RAIDZ1 going to somehow tank performance as opposed to mirroring or is ZFS fast enough to saturate the disks in either config? You shouldn't expect to get the maximum rated speed out of those NVMe disks with ZFS either way. At SATA SSD speeds, the extra housekeeping ZFS does isn't a scale problem, even with itty bitty (x86) CPUs. At NVMe speeds, it starts being a factor. Beyond that... yes, you will see performance differences between RAIDz1 and mirrors, even on NVMe. Particularly when you're running VMs, which usually means small-blocksize operations. That's just not what striped parity RAID is good at. Keep in mind that the real killer stat here isn't throughput, anyway, it's latency. When you want to write a 64K record to mirrors, it goes in 64K chunks, on each disk in the vdev. When you want to write the same record to a 3-disk RAIDz1, it gets split up into two data chunks and one parity chunk. That's actually not too bad on a 3-disk RAIDz1, because you're still talking about 32K chunks, and you don't have any padding needed since 32K is evenly divisible by either 4K or 8K sector sizes. Still, now you're doing 32K ops instead of 64K ops, and you need to complete those ops on three separate devices instead of two before the op is complete. You're also tying up two of those three devices with every 64K read, where the mirror would only need to query one—and, again, you're waiting for both devices to return data before the read is complete, where you were only waiting on one device with the mirror. It's entirely possible that the NVMe will be fast enough to satisfy you either way. But there will absolutely be a difference in performance.
2 of 5
8
I think the principal values of ZFS are reliability, fault tolerance, and easy disaster recovery. If you're worried about speed, there are file systems that emphasize that more than others. I have two NVMe drives in a mirror and read/write performance has never been a sticking point. The biggest advantage of SSDs is massive reduction in random I/O latency and throughput relative to spinning disks. I also have three spinning disks in a raidz1 configuration. Performance is about as expected for spinning disks, and does not appear to be hindered by the ZFS arrangement. Regarding the ESP: if you're running Linux, you can configure the ESPs in an mdraid mirror. Use metadata version 0.9 or 1.0 because those versions put metadata at the end of the devices; newer versions put it at the beginning. With metadata at the end, your firmware will see all three partitions as independent, valid file systems. Your operating system, on the other hand, will treat them as a mirror and keep the contents in sync.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
Avoid IO Delay: RAIDz1 vs RAID5 | Proxmox Support Forum
December 21, 2024 - For all we know, you could be using SMR 5400rpm drives. Or lightweight shite like WD Blue spinners, those are desktop-class drives and tend to fail early. ZFS configuration also makes a difference, are you using ashift=12, do you have compression enabled (e.g. gzip-9 is going to absolutely kill your performance), do you have dedup on, what recordsize are you using per-dataset, etc? RAIDZ1 is mostly deprecated, any disks over ~2TB you should be using RAIDZ2.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
About ZFS, RAIDZ1 and disk space | Proxmox Support Forum
June 4, 2022 - And this padding overhead is indirect and only effects zvols. It basically means that everything you will store in a zvol will consume 166% space on the pool. To minimize that padding overhead you will need to increase your volblocksize but this comes with other problems like really bad performance of IO that reads/writes blocks that are smaller than the volblocksize. When using a 6 disk raidz1 with a ashift=12 you would need to increase the volblocksize to atleast 32K to only loose 20% instead of 50% of your raw capacity.
Find elsewhere
🌐
Reddit
reddit.com › r/zfs › raidz1 queston
r/zfs on Reddit: RAIDz1 Queston
February 14, 2023 -

As a noob, I keep hearing and reading about the dangers of running a RAIDz1 and losing all your data during the resilver process when a second drive fails. Here's my question. Who has actually experienced this? How often has it happened? I believe most people on this reddit are data driven folks and can weigh the real against the hype. I'd like know how often does a second drive fail?

🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
[TUTORIAL] - FabU: Can I use ZFS RaidZ for my VMs? | Proxmox Support Forum
January 1, 2025 - Click to expand... No, you could create a raidz1 or mirror with three disks, install proxmox and VMs onto it. That's not the best idea (it really should be split) for IOPS/VM-performance, but "some redundancy". General advice: get more disks. In the end losing money is not that bad as losing data.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
3 x 4TB Samsung SSD in ZFS raidz1 => poor performance | Proxmox Support Forum
April 20, 2021 - In that case it would most likely be a "system" related file which should not be that critical anyway as you could always rebuild the VM or LXC container as long as the data is intact on the storage pool. ... I use ZFS raidz1 also for rpool and there is fine, no performance issue.
Top answer
1 of 2
2

I think it was a misunderstanding in the old thread. I was comparing the chance of failure for two disks in a row when using either Z1 parity raid or no RAID (as you stated in the comments in the other thread). In my eyes it was never about Z1 vs. striped pool of basic vdevs, because that game is essentially over after the first fault anyway, so Z1 is of course better.

But if you just compare multiple independent pools against a single pool with a single Z1 vdev, then the problem of increased load while recalculating the parity information persists.

On the comparison of Z1 vs Z2, which the answer by Michael was mainly about, the other two points apply. I should have been more clearly in the comments, but they are limited in space unfortunately. I hope this answer clears some of this.

I thought the same thing, but I didn't realize that a URE isn't just a bit flip, it spoils the entire pool.

If we simplify the whole thing, you have your disk with its controller chip on the bottom and your hardware (RAID controller) or software (e. g. ZFS) on the top.

If any error happens in the hardware and a sector cannot be read, the chip first tries to correct it on its own if possible (for example by reading the problem sector multiple times). If it still can't make anything out of it, it gives up (on normal disks, this can take minutes and stalls the complete system which waits for "successful" or "failure" message regarding the IO operation that is pending.

Some disks have a feature called TLER (time limited error recovery), which is a hard timeout that limits this error correction time to 6-9 seconds, because traditionally, most hardware RAID controllers dropped the whole disk after 9 seconds, so a single bad sector should not make the whole disk unavailable, but be corrected by a "good" sector on the other disks (a feature that a single disk on a desktop system could not rely on, so a long timeout would be preferable).

Now, let's look at the software side: if you configure your raid controller or ZFS file system with redundancy, for example by using mirrored disks or a mirror vdev as basis for your pool, your URE can be corrected. If you do not use redundancy, the data on this sector will be gone, which may be data you care about or just random old temp data or nothing, depending on your luck. The same applies to bit flips, although the chance of them happening seems to be more dependent on outside effects (like cosmic radiation).

Since RAID0 is not subject to UREs, the question is "what is more likely, a URE in RAIDZ or a disk failure in RAID0?"

I haven't accepted this answer because I don't think it adequately explains the relevant points, but I was planning on creating my own answer once I understand why UREs destroy the whole pool, if no one else gets to it first.

I suggest you read a basic explanation of ZFS pool layout. To summarize the most important bits:

  • You can create virtual devices (vdevs) from disks, partitions or files. Each vdev can be created with different redundancy: basic (no redundancy), mirrored (1 to N disks can fail), parity raid Z1/Z2/Z3 (1/2/3 disks can fail). All redundancy works on the vdev level.
  • You create storage pools from one or more vdevs. They are always striped, therefore the loss of a single vdev means the loss of the whole pool.
  • You can have any number of pools, which are independent. If one pool is lost, the other pools continue to function.

Therefore you can reason the following:

  • If possible, prefer Z2 over Z1 because of the increased load and big window of (negative) opportunity when rebuilding large drives (large being anything over 1 TB approximately)
  • If having to choose between Z1 and multiple basic vdevs, prefer Z1 because of bit error correction which is not possible with basic vdevs
  • If you can accept partial pool loss, segment your pool into multiple smaller pools backed by a single vdev each, so that you get checksum information and faster rebuild times on fatal faults

In any of the above cases, you need to have a backup. If you cannot or don't want to afford any backup, it is about what you are more comfortable to lose - some parts of the pool with higher probability or everything with lower probability. I personally would choose the first option, but you may decide otherwise.

2 of 2
1

What is implied in answer you quoted is that with increasing storage capacity chance of failure increases accordingly, not only for rebuilding operation but for normal activity as well. So, statistically speaking, RAIDZ1 is no more fault tolerant than Raid 0 when talking about modern 4TB drives, even though case is made prima facie that it is.

So some argue that RAIDZ1 is, in fact, not an increase in protection against data loss for large-capacity hard disk drives. This has less to do with mechanical failure of the drive(s), or at least not with critical failure. URE is, to put it simply (and very simplistically) is failure to read. Be it due to prolonged read from bad sector of the drive, disk running out of spare sectors - or any other cause - it's not really an issue. It will happen, like it or not. Let's then take the bad sector example - Normally this is handled by drive internally, but if there's enough of them or the drive will take it's sweet while to fix that the RAIDZ controller might interpret the delay as drive failure and eject the drive. Now, let's imagine it's the SECOND hard drive in the pool, and it happened while rebuilding... The only viable solution is to scrub the array for those errors - if caught early, the error will be just a burp - pool will recover the data easily. But this means putting quite a big load on drives, which then increases drastically the statistical chance of URE (remember: age, writes, volume of data all increase it a lot already, without increasing the reads by order of magnitude from normal operations; all for each drive separately).

Thus the answer to your question (is a RAIDZ1 an incremental improvement on no fault tolerance) is: not really. If we use logic of the quote you face 50% chance (I think) of enough disk failures for data to be unrecoverable within first two years of said disks operation..

That is why when in our company we were faced with the dilemma of server availability or storage capacity we bit the bullet and went for RAID6 on SSDs. Should be enough for couple of years and then probably upgrade, if needed.

🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
RaidZ1 performance ZFS on host vs VM | Proxmox Support Forum
February 11, 2024 - It is COW system. If you don't care compression, encryption, snapshot, data integrity then use old file systems. Use #atop to see CPU and disk usage. Maybe it will show something interesting. ... If your VM writes in blocks of 12K then the raidz1 could write 16K in parallel to the drives (assuming ZFS is that simple).
🌐
RAIDZ Calculator
raidz-calculator.com › raidz-types-reference.aspx
RAIDZ Types Reference
RAIDZ levels reference covers various aspects and tradeoffs of the different RAIDZ levels.
🌐
Reddit
reddit.com › r/zfs › which is better raidz1 with a hotspare, or raidz2?
r/zfs on Reddit: Which is better Raidz1 with a hotspare, or Raidz2?
February 15, 2022 -

I have 5 16tb Exos drives and am looking for opinions on whether I should set them up as 4 in a raidz1 with a hot spare, or with all 5 in a raidz2.

If I am understanding correctly, my total usable capacity would be the same either way?

I am leaning toward the raidz2, so if I am resilvering to replace a dead drive, my data will be safe if another dies during the process.

btw, any word yet on when the ability to add a disk to a pool will be released? I heard vague rumors of "soon" when I was first researching zfs last year

🌐
Level1Techs
forum.level1techs.com › hardware hub › build a pc
Stripe or RaidZ1 - Build a PC - Level1Techs Forums
November 15, 2022 - So… I may have gone too far this time… Plan is to have 1 as OS + images, etc… 5 on zfs, and the SSD is for backups. Here is the question (I’m new to ZFS): I am optimizing for IO, as this is meant for a devops/ml workload and for me IO is super important. given that I plan to continuously back up the VMs and the raw data gets backed up externally, should i go Raidz1 or Stripe ?
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
[SOLVED] - Installation raidz-1 | Proxmox Support Forum
July 8, 2024 - Please be aware that RAIDz1 is terrible for running VMs on because of the low IOPS, massive padding (when volblocksize is small) and huge write amplification (which will wear your consumer drives quickly without PLP) .
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
The problem with RAIDZ or why you probably won't get the storage efficiency you think you will get | Proxmox Support Forum
February 26, 2024 - That leads beginners to believe, that because they don’t need the performance, they will get away with the increased efficiency. They would say stuff like „I only run 5 VMs“ or „only Plex files“ but don’t understand how big of a role geometry plays or what advantages a dataset could offer to them.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox backup server › proxmox backup: installation and configuration
[SOLVED] - Hardware Raid or ZFS | Proxmox Support Forum
January 2, 2025 - Hi Daz Answer is as allway it depends... when you're host is built with some extra RAM (ECC?) and CPU I'd prefer ZFS over an HW RAID. The snapshot and replication possibilities of ZFS are usefull and even if you're RAID-Controller might fail, you easy could export / import the ZFS-Pool on an other Machine later on. With four 3.84 Disks I'd go for a RAIDZ1 (simplified 1 Disk parity).
🌐
TrueNAS Community
forums.truenas.com › resources
The problem with RAIDZ - Resources - TrueNAS Community Forums
April 16, 2024 - This resource was originally created by user: Jamberry on the TrueNAS Community Forums Archive. Please DM this account or comment in this thread to claim it. The problem with RAIDZ or why you probably won’t get the storage efficiency you think you will get.​ Work in progress, probably contains ...