When I started my career in IT as a data center technician / low level systems admin, I saw more than once a disk replacement in a RAID-5 maintenance, while it was rebuilding, another drive gave up the ghost and guess what happens when you lose two drives concurrently in a RAID-5? *makes explosion sound* So yeah, same logic applies here. Answer from DeputyCartman on reddit.com
🌐
TrueNAS Community
truenas.com › forums › community discussion › community forum › off-topic
What's the stigma behind using RAIDZ1? | TrueNAS Community
July 9, 2023 - In a 6-drive RAIDZ1, you will lose your pool as soon as a second drive goes bad, a 6-drive striped mirrors can suffer up to 3 drive failures so long as the drives aren't in the same vdev.
🌐
Reddit
reddit.com › r/zfs › raidz1 vs mirror on nvme
r/zfs on Reddit: RAIDZ1 vs mirror on NVMe
February 2, 2021 -

I'm planning to get 2 or 3 Samsung PM9A1 drives. They're supposed to be extremely fast (7000MB read, 5200 write at 1M IOPS read, 850k IOPS write).

On those I'd be holding mostly running VM disks, the root operating system and possibly source code / compilation artifacts for a container.

Is RAIDZ1 going to somehow tank performance as opposed to mirroring or is ZFS fast enough to saturate the disks in either config?

The way I understand, you'd get perf of a single disk in either configuration, no?

I'm hoping to set up 3 separate EFI partitions, one on each drive, so that any one of them dying won't mean the machine can't boot anymore.

Top answer
1 of 5
25
Is RAIDZ1 going to somehow tank performance as opposed to mirroring or is ZFS fast enough to saturate the disks in either config? You shouldn't expect to get the maximum rated speed out of those NVMe disks with ZFS either way. At SATA SSD speeds, the extra housekeeping ZFS does isn't a scale problem, even with itty bitty (x86) CPUs. At NVMe speeds, it starts being a factor. Beyond that... yes, you will see performance differences between RAIDz1 and mirrors, even on NVMe. Particularly when you're running VMs, which usually means small-blocksize operations. That's just not what striped parity RAID is good at. Keep in mind that the real killer stat here isn't throughput, anyway, it's latency. When you want to write a 64K record to mirrors, it goes in 64K chunks, on each disk in the vdev. When you want to write the same record to a 3-disk RAIDz1, it gets split up into two data chunks and one parity chunk. That's actually not too bad on a 3-disk RAIDz1, because you're still talking about 32K chunks, and you don't have any padding needed since 32K is evenly divisible by either 4K or 8K sector sizes. Still, now you're doing 32K ops instead of 64K ops, and you need to complete those ops on three separate devices instead of two before the op is complete. You're also tying up two of those three devices with every 64K read, where the mirror would only need to query one—and, again, you're waiting for both devices to return data before the read is complete, where you were only waiting on one device with the mirror. It's entirely possible that the NVMe will be fast enough to satisfy you either way. But there will absolutely be a difference in performance.
2 of 5
8
I think the principal values of ZFS are reliability, fault tolerance, and easy disaster recovery. If you're worried about speed, there are file systems that emphasize that more than others. I have two NVMe drives in a mirror and read/write performance has never been a sticking point. The biggest advantage of SSDs is massive reduction in random I/O latency and throughput relative to spinning disks. I also have three spinning disks in a raidz1 configuration. Performance is about as expected for spinning disks, and does not appear to be hindered by the ZFS arrangement. Regarding the ESP: if you're running Linux, you can configure the ESPs in an mdraid mirror. Use metadata version 0.9 or 1.0 because those versions put metadata at the end of the devices; newer versions put it at the beginning. With metadata at the end, your firmware will see all three partitions as independent, valid file systems. Your operating system, on the other hand, will treat them as a mirror and keep the contents in sync.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
3 x 4TB Samsung SSD in ZFS raidz1 => poor performance | Proxmox Support Forum
April 20, 2021 - In that case it would most likely be a "system" related file which should not be that critical anyway as you could always rebuild the VM or LXC container as long as the data is intact on the storage pool. ... I use ZFS raidz1 also for rpool and there is fine, no performance issue.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
[TUTORIAL] - FabU: Can I use ZFS RaidZ for my VMs? | Proxmox Support Forum
January 1, 2025 - Beginners often confuse hardware RAID5/6 with BBU (which can cache sync writes) with ZFS RaidZ1/2 (with unfortunate block size alignment on consumer drives) just because both can deal with one/two missing drive(s). The performance behavior is indeed completely different (as well as the supported feature set) and RaidZ, as you already explained, is mostly unsuitable for VMs.
Top answer
1 of 3
35

Before we go into specifics, consider your use case. Are you storing photos, MP3's and DVD rips? If so, you might not care whether you permanently lose a single block from the array. On the other hand, if it's important data, this might be a disaster.

The statement that RAIDZ-1 is "not good enough for real world failures" is because you are likely to have a latent media error on one of your surviving disks when reconstruction time comes. The same logic applies to RAID5.

ZFS mitigates this failure to some extent. If a RAID5 device can't be reconstructed, you are pretty much out of luck; copy your (remaining) data off and rebuild from scratch. With ZFS, on the other hand, it will reconstruct all but the bad chunk, and let the administrator "clear" the errors. You'll lose a file/portion of a file, but you won't lose the entire array. And, of course, ZFS's parity checking means that you will be reliably informed that there's an error. Otherwise, I believe it's possible (although unlikely) that multiple errors will result in a rebuild apparently succeeding, but giving you back bad data.

Since ZFS is a "Rampant Layering Violation," it also knows which areas don't have data on them, and can skip them in the rebuild. So if your array is half empty you're half as likely to have a rebuild error.

You can reduce the likelihood of these kinds of rebuild errors on any RAID level by doing regular "zpool scrubs" or "mdadm checks"of your array. There are similar commands/processes for other RAID's; e.g., LSI/dell PERC raid cards call this "patrol read." These go read everything, which may help the disk drives find failing sectors, and reassign them, before they become permanent. If they are permanent, the RAID system (ZFS/md/raid card/whatever) can rebuild the data from parity.

Even if you use RAIDZ2 or RAID6, regular scrubs are important.

One final note - RAID of any sort is not a substitute for backups - it won't protect you against accidental deletion, ransomware, etc. Although regular ZFS snapshots can be part of a backup strategy.

2 of 3
5

There is a little bit of a misconception at work here. A lot of the advice you're seeing is based on an assumption which may not be true. Specifically, the unrecoverable bit error rate of your drive.

A cheap 'home user' disk has 1 per 10^14 unrecoverable error rate.

http://www.seagate.com/gb/en/internal-hard-drives/desktop-hard-drives/desktop-hdd/#specs

This is at a level where your're talking a significant likelihood of an unrecoverable error during a RAID rebuild, and so you shouldn't do it. (A quick and dirty calculation suggests that 5x 2TB disks RAID-5 set will actually have around a 60% chance of this)

However this isn't true for more expensive drives: http://www.seagate.com/gb/en/internal-hard-drives/enterprise-hard-drives/hdd/enterprise-performance-15k-hdd/#specs

1 per 10^16 is 100x better - meaning 5x 2TB is <1% chance of failed rebuild. (Probably less, because for enterprise usage, 600GB spindles are generally more useful).

So personally - I think both RAID-5 and RAID-4 are still eminently usable, for all the reasons RAID-0 is still fairly common. Don't forget - the problem with RAID-6 is it's hefty write penalty. You can partially mitigate this with lots of caching, but you've still got some pain built in, especially when you're working with slow drives in the first place.

And more fundamentally - NEVER EVER trust your RAID to give you full resilience. You'll lose data more often to an 'oops' than a drive failures, so you NEED a decent backup strategy if you care about your data anyway.

🌐
Reddit
reddit.com › r/zfs › raidz1 queston
r/zfs on Reddit: RAIDz1 Queston
April 3, 2023 -

As a noob, I keep hearing and reading about the dangers of running a RAIDz1 and losing all your data during the resilver process when a second drive fails. Here's my question. Who has actually experienced this? How often has it happened? I believe most people on this reddit are data driven folks and can weigh the real against the hype. I'd like know how often does a second drive fail?

🌐
TrueNAS Community
truenas.com › forums › truenas core › general discussion
The problem with RAIDZ | TrueNAS Community
December 13, 2023 - Storage efficiency is very bad, we get extreme IO amplification and fragmentation. We skip the other sizes and use a 1024k file. mirror: 1024k data blocks and 1024k parity blocks. 2048k total write to store 1024k. 50% storage efficiency (expected 50%). RAIDZ1 3-wide: Each stripe has two 64k ...
Find elsewhere
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
About ZFS, RAIDZ1 and disk space | Proxmox Support Forum
June 4, 2022 - And this padding overhead is indirect and only effects zvols. It basically means that everything you will store in a zvol will consume 166% space on the pool. To minimize that padding overhead you will need to increase your volblocksize but this comes with other problems like really bad performance of IO that reads/writes blocks that are smaller than the volblocksize. When using a 6 disk raidz1 with a ashift=12 you would need to increase the volblocksize to atleast 32K to only loose 20% instead of 50% of your raw capacity.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
Avoid IO Delay: RAIDz1 vs RAID5 | Proxmox Support Forum
December 21, 2024 - For all we know, you could be using SMR 5400rpm drives. Or lightweight shite like WD Blue spinners, those are desktop-class drives and tend to fail early. ZFS configuration also makes a difference, are you using ashift=12, do you have compression enabled (e.g. gzip-9 is going to absolutely kill your performance), do you have dedup on, what recordsize are you using per-dataset, etc? RAIDZ1 is mostly deprecated, any disks over ~2TB you should be using RAIDZ2.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
RaidZ1 performance ZFS on host vs VM | Proxmox Support Forum
February 11, 2024 - It is COW system. If you don't care compression, encryption, snapshot, data integrity then use old file systems. Use #atop to see CPU and disk usage. Maybe it will show something interesting. ... If your VM writes in blocks of 12K then the raidz1 could write 16K in parallel to the drives (assuming ZFS is that simple).
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
[SOLVED] - Installation raidz-1 | Proxmox Support Forum
July 8, 2024 - Please be aware that RAIDz1 is terrible for running VMs on because of the low IOPS, massive padding (when volblocksize is small) and huge write amplification (which will wear your consumer drives quickly without PLP) .
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
The problem with RAIDZ or why you probably won't get the storage efficiency you think you will get | Proxmox Support Forum
February 26, 2024 - That leads beginners to believe, that because they don’t need the performance, they will get away with the increased efficiency. They would say stuff like „I only run 5 VMs“ or „only Plex files“ but don’t understand how big of a role geometry plays or what advantages a dataset could offer to them.
Top answer
1 of 3
12

Even given what one of the other answers here laid out, namely that ZFS only works with actual used blocks and not empty space, yes, it is still dangerous to make a large RAIDZ1 vdev. Most pools end up at least 30-50% utilized, many go right up to the recommended maximum of 80% (some go past it, I highly recommend you do not do that at all, for performance reasons), so that ZFS deals only with used blocks is not a huge win. Also, some of the other answers make it sound like a bad read is what causes the problem. This is not so. A bit rot inside a block is not what's going to screw you here, usually, it's another disk just flat out going bad while the resilver from the first disk going bad is still going on that'll kill you.. and on 3 TB disks in a large raidz1 it can take days, even weeks to resilver onto a new disk, so your chance of that happening is not insignificant.

My personal recommendation to customers is to never use RAIDZ1 (RAID5 equivalent) at all with > 750 GB disks, ever, just to avoid a lot of potential unpleasantness. I've been OK with them breaking this rule because of other reasons (the system has a backup somewhere else, the data isn't that important, etc), but usually I do my best to push for RAIDZ2 as a minimum option with large disks.

Also, for a number of reasons, I usually recommend not going more than 8-12 disks in a raidz2 stripe or 11-15 disks in a raidz3 stripe. You should be on the low-end of those ranges with 3 TB disks, and could maybe be OK on the high-end of those ranges on 1 TB disks. That this will help keep you away from the idea that more disks will fail while a resilver is going on is only one of those reasons, but a big one.

If you're looking for some sane rules of thumb (edit 04/10/15 - I wrote these rules with only spinning disks in mind - because they're also logical [why would you do less than 3 disks in a raidz1] they make some sense even for SSD pools but all-SSD pools was not a thing in my head when I wrote these down):

  • Do not use raidz1 at all on > 750 GB disks.
  • Do not use less than 3 or more than 7 disks on a raidz1.
  • If thinking of using 3-disk raidz1 vdevs, seriously consider 3-way mirror vdevs instead.
  • Do not use less than 6 or more than 12 disks on a raidz2.
  • Do not use less than 7 or more than 15 disks on a raidz3.
  • Always remember that unlike traditional RAID arrays where # of disks increase IOPS, in ZFS it is # of VDEVS, so going with shorter stripe vdevs improves pool IOPS potential.
2 of 3
11

Is RAID-Z as bad as R5, no. Is it as good as R1 or R10, usually no.

RAID-Z is aware of blank spots on the drives, where R5 is not. So RAID-Z only has to read the areas with data to recover the missing disk. Also, data isn't necessarily striped across all the disks. A very small file might reside on just a single disk, with the parity on another disk. Because of this RAID-5 will have to read exactly as much data as the space used on the array (if 1mb is used on a 5TB array, then a rebuild only needs to read 1 mb).

Going the other way, if most of a large array is full, then most of the data will need to be read off all the disks. Compared to R1 or R10 where the data only needs to be pulled off exactly one disk (per failed disk; if multiple disks fail only in situations where the array is still recoverable too).

What you're worrying about is the fact that with every sector read operation there's a chance you'll find a sector that wasn't written correctly or is no longer readable. For a typical drive these days that's around 1x10^-16 (not all drives are equal, so lookup the specs on your drives to figure out their rating). This is incredibly infrequent, but comes out to about once every 1PB; for a 10TB array there's a 1% chance your array is toast and you don't know it until you try to recover it.

ZFS also helps mitigate this chance, since most unreadable sectors are noticeable before you start trying to rebuild your array. If you Scrub your ZFS array on a regular basis, the scrub operation will pickup these error and work around them (or alert you so you can replace the disk if that's how you roll). They recommend you scrub enterprise-grade disks about one to four times a month; and consumer-grade drives at least once a week, or more.

🌐
Proxmox
forum.proxmox.com › home › tags
raidz1 | Proxmox Support Forum
I have one primary disk for running VMs (3.8TB Samsung pm9a3 u.2) and two 1tb nvme SSDs (Samsung 980 pro). I want to create a Truenas core/scale VM using 1TB virtual disk from primary disk and 2x 1TB nvme in Raidz1 to get 2TB usable space from my NVME drives. My usecase is just to use them as a... ... Hello I have an installation where the OS runs on a raidz1 and has one bad disk: pool: rpool state: ONLINE status: One or more devices has experienced an unrecoverable error.
🌐
Level1Techs
forum.level1techs.com › hardware hub › build a pc
Stripe or RaidZ1 - Build a PC - Level1Techs Forums
November 15, 2022 - So… I may have gone too far this time… Plan is to have 1 as OS + images, etc… 5 on zfs, and the SSD is for backups. Here is the question (I’m new to ZFS): I am optimizing for IO, as this is meant for a devops/ml workload and for me IO is super important. given that I plan to continuously back up the VMs and the raw data gets backed up externally, should i go Raidz1 or Stripe ?
🌐
Reddit
reddit.com › r/proxmox › reasons not to use 2 nvme ssds in raidz1 for both proxmox and vms/app data in a homelab environment
r/Proxmox on Reddit: Reasons not to use 2 NVMe SSDs in RAIDZ1 for both Proxmox and VMs/app data in a homelab environment
November 10, 2024 -

Hi,

Context (feel free to skip):

I am currently in the process of picking hardware for my first ever all-in-one NAS + homelab box.

It won't be used for anything super critical and the idea is to host TrueNAS Scale (direct PCI passthough of the HDD controller) and a couple of VMs to run Portainer + docker apps like Jellyfin, Nextcloud, Photoprism, HomeAssistant, and possibly many, many more.

Some of the important factors driving my decisions on what hardware to get were electricity cost (low idle current consumption) and space constraints (small apartment). Also, I wanted to get a CPU with very good iGPU for Jellyfin transcoding, and to have a powerful enough CPU to experiment with many different self-hosted services in the future running virtualized on the same box.

I went for an Intel i5-12500 + new cwwk q670 motherboard (sadly, no ECC support, but Intel vPro seems nice). I will probably go for 3x 12TB NAS HDDs in RAID5 configuration.

QUESTION:

I was reading online that people like to separate Proxmox installation from the VM installs (putting them on separate SSD drives), and sometimes also app data (docker volumes/VM data) from VM installs. I understand the logic behind these decisions. However in my case, I would like to keep things simple by only using 2 consumer-grade RAIDZ1 NVMe SSDs (cca 1TB) for hosting both Proxmox and VM installs, as well as VM data/docker volume data (with some logical separation by partitioning the memory).

I am even considering partitioning part of this SSD array to use it as a virtual "scratch disk" to host (some) temporary data, before they are ready to be persisted to my HDD disks that are part of my NAS array (mainly to minimize disk fragmentation on my HDDs). One use-case I have in mind here is something like a download destination folder for my torrents.

I know that this will put more pressure on my boot disks in terms of TBW. However, I would be fine with this risk ,provided it is still possible to have an easy backing up strategy in this setup (to restore everything should both SSDs fail at the same time). This would still be worth it for me if it means I don't need to purchase a couple of additional SSDs just to separate it all out physically.

Is there something major, a compelling reason not to go with this approach? Am I missing something here? Something that will come to bite me in the future and that I just can't see right now due to the lack of experience with Proxmox/self-hosting in general?

Top answer
1 of 2
2

I think it was a misunderstanding in the old thread. I was comparing the chance of failure for two disks in a row when using either Z1 parity raid or no RAID (as you stated in the comments in the other thread). In my eyes it was never about Z1 vs. striped pool of basic vdevs, because that game is essentially over after the first fault anyway, so Z1 is of course better.

But if you just compare multiple independent pools against a single pool with a single Z1 vdev, then the problem of increased load while recalculating the parity information persists.

On the comparison of Z1 vs Z2, which the answer by Michael was mainly about, the other two points apply. I should have been more clearly in the comments, but they are limited in space unfortunately. I hope this answer clears some of this.

I thought the same thing, but I didn't realize that a URE isn't just a bit flip, it spoils the entire pool.

If we simplify the whole thing, you have your disk with its controller chip on the bottom and your hardware (RAID controller) or software (e. g. ZFS) on the top.

If any error happens in the hardware and a sector cannot be read, the chip first tries to correct it on its own if possible (for example by reading the problem sector multiple times). If it still can't make anything out of it, it gives up (on normal disks, this can take minutes and stalls the complete system which waits for "successful" or "failure" message regarding the IO operation that is pending.

Some disks have a feature called TLER (time limited error recovery), which is a hard timeout that limits this error correction time to 6-9 seconds, because traditionally, most hardware RAID controllers dropped the whole disk after 9 seconds, so a single bad sector should not make the whole disk unavailable, but be corrected by a "good" sector on the other disks (a feature that a single disk on a desktop system could not rely on, so a long timeout would be preferable).

Now, let's look at the software side: if you configure your raid controller or ZFS file system with redundancy, for example by using mirrored disks or a mirror vdev as basis for your pool, your URE can be corrected. If you do not use redundancy, the data on this sector will be gone, which may be data you care about or just random old temp data or nothing, depending on your luck. The same applies to bit flips, although the chance of them happening seems to be more dependent on outside effects (like cosmic radiation).

Since RAID0 is not subject to UREs, the question is "what is more likely, a URE in RAIDZ or a disk failure in RAID0?"

I haven't accepted this answer because I don't think it adequately explains the relevant points, but I was planning on creating my own answer once I understand why UREs destroy the whole pool, if no one else gets to it first.

I suggest you read a basic explanation of ZFS pool layout. To summarize the most important bits:

  • You can create virtual devices (vdevs) from disks, partitions or files. Each vdev can be created with different redundancy: basic (no redundancy), mirrored (1 to N disks can fail), parity raid Z1/Z2/Z3 (1/2/3 disks can fail). All redundancy works on the vdev level.
  • You create storage pools from one or more vdevs. They are always striped, therefore the loss of a single vdev means the loss of the whole pool.
  • You can have any number of pools, which are independent. If one pool is lost, the other pools continue to function.

Therefore you can reason the following:

  • If possible, prefer Z2 over Z1 because of the increased load and big window of (negative) opportunity when rebuilding large drives (large being anything over 1 TB approximately)
  • If having to choose between Z1 and multiple basic vdevs, prefer Z1 because of bit error correction which is not possible with basic vdevs
  • If you can accept partial pool loss, segment your pool into multiple smaller pools backed by a single vdev each, so that you get checksum information and faster rebuild times on fatal faults

In any of the above cases, you need to have a backup. If you cannot or don't want to afford any backup, it is about what you are more comfortable to lose - some parts of the pool with higher probability or everything with lower probability. I personally would choose the first option, but you may decide otherwise.

2 of 2
1

What is implied in answer you quoted is that with increasing storage capacity chance of failure increases accordingly, not only for rebuilding operation but for normal activity as well. So, statistically speaking, RAIDZ1 is no more fault tolerant than Raid 0 when talking about modern 4TB drives, even though case is made prima facie that it is.

So some argue that RAIDZ1 is, in fact, not an increase in protection against data loss for large-capacity hard disk drives. This has less to do with mechanical failure of the drive(s), or at least not with critical failure. URE is, to put it simply (and very simplistically) is failure to read. Be it due to prolonged read from bad sector of the drive, disk running out of spare sectors - or any other cause - it's not really an issue. It will happen, like it or not. Let's then take the bad sector example - Normally this is handled by drive internally, but if there's enough of them or the drive will take it's sweet while to fix that the RAIDZ controller might interpret the delay as drive failure and eject the drive. Now, let's imagine it's the SECOND hard drive in the pool, and it happened while rebuilding... The only viable solution is to scrub the array for those errors - if caught early, the error will be just a burp - pool will recover the data easily. But this means putting quite a big load on drives, which then increases drastically the statistical chance of URE (remember: age, writes, volume of data all increase it a lot already, without increasing the reads by order of magnitude from normal operations; all for each drive separately).

Thus the answer to your question (is a RAIDZ1 an incremental improvement on no fault tolerance) is: not really. If we use logic of the quote you face 50% chance (I think) of enough disk failures for data to be unrecoverable within first two years of said disks operation..

That is why when in our company we were faced with the dilemma of server availability or storage capacity we bit the bullet and went for RAID6 on SSDs. Should be enough for couple of years and then probably upgrade, if needed.

🌐
TrueNAS Community
truenas.com › forums › archives › freenas (legacy software releases) › freenas help & support › storage
SOLVED - RAIDZ1 not safe? | TrueNAS Community
August 11, 2016 - My primary need is data integrity and the best uptime I can get ... 3 1TB should be relatively ok for RAIDZ1. Usually the issue is with larger and higher quantity drives.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
ZFS RAIDZ Pool tied with VM disks acts strange | Proxmox Support Forum
January 29, 2025 - (d)RAIDz1/2/3 is also often disappointing for running VMs on as people expect hardware RAID5/6 performance but due to the padding, check-sums and additional features of ZFS it gives much less IOPS.