There's two types of "data" in the pool: the actual data of whatever it is you're storing, and the metadata, which is all the tables, properties, indexes and other "stuff" that defines the pool structure, the datasets, and the pointers that tell ZFS where on disk to find the actual data.

Normally this is all mixed in together on the regular pool vdevs (mirror, raidz, etc). If you add a special vdev to your pool, then ZFS will prefer to store the metadata there, and send the data proper to the regular vdevs. The main reason for doing this is if you have "slow" data vdevs; adding a special vdev of a SSD mirror can speed up access times, as ZFS can look to the SSDs to know where on the data vdevs to find its data and go directly there, rather than loading the metadata off the slow vdevs and then needing another access to get the real data.

There's another possible advantage: ZFS can store "small" files on the special vdev, leaving larger ones for the regular data vdevs.

One important thing to remember is that special vdevs are a proper part of the pool, not an add-on - just like the regular data vdevs, if the special vdev fails, the pool is lost. An SSD mirror is typical for this vdev.

This is only a rough explanation, see also:

  • zpoolconcepts(7)
  • Level1Techs writeup
🌐
GitHub
openzfs.github.io › openzfs-docs › man › master › 7 › zpoolconcepts.7.html
zpoolconcepts.7 — OpenZFS documentation
But, when an active device fails, it is automatically replaced by a hot spare. To create a pool with hot spares, specify a spare vdev with any number of devices. For example, ... Spares can be shared across multiple pools, and can be added with the zpool add command and removed with the zpool ...
🌐
GitHub
openzfs.github.io › openzfs-docs › man › 7 › zpoolconcepts.7.html
zpoolconcepts.7
You should have been redirected · If not, click here to continue
🌐
Reddit
reddit.com › r/zfs › guidance on how the special vdev performs.
r/zfs on Reddit: Guidance on how the special vdev performs.
September 12, 2021 -

Racking my brain trying to figure out it's actual behavior. No documentation anywhere I can find actually comments on this.

The special vdev is used to store metadata from the pool, so operations like directory listing are at the speed of an NVME drive, which will also improve the performance of spinning rust by reducing the load of small IO required to lookup the pool metadata. This makes sense.

What confuses me is the behavior of the small block allocation class when the special vdev is at capacity. There seems to be a few scenarios that aren't talked about anywhere in the documentation that would be important for performance in various scenarios.

From how i've seen it talked about my understanding is that it acts as a write back cache for small IO based on the special_small_blocks value of the dataset, and when the special vdev is at capacity or needs more space for metadata, small block allocation is offloaded back to the data vdevs.

However i've never actually seen this mentioned anywhere. From the closest thing to a confirmation is in the TrueNas documentation on a fusion pool saying

If the special class becomes full, then allocations spill back into the normal class.

By spill, does that mean, removed from the special class so new incoming small block writes are added to the cache. Or does that mean when the special class is full small IO is sent directly back to the normal class bypassing the special vdev.

Top answer
1 of 2
11
I have answered my own question eventually, partially from some old forum posts and practical testing. Once the special VDEV is full it stays full. The standard small block allocation is 75% of the drive space(default but adjustable), after that further blocks will be sent directly to the backing storage. It can be expanded after the fact, and also striped to another vdev if space becomes constrained, but there is no way to rebalance. Furthermore, the special vdev hits it's allocation limit and later expanded, the small IO will have to be re-written to move it back to it's allocation class. As long as you're using all striped mirrors in your pool with the same ashift(I am), you can "flush" the metadata and small IO back to your main pool by removing the special vdev. I.E zpool remove pool mirror- with x being the mirrored SSD's vdev name This will write all metadata and small blocks back to the main pool. But it also prevents the special vdev from being useful again if re-added without re-writing all your data as new metadata is only added on write.
2 of 2
2
What confuses me is the behavior of the small block allocation class when the special vdev is at capacity. There seems to be a few scenarios that aren't talked about anywhere in the documentation that would be important for performance in various scenarios. When it reaches 75% full, all further small blocks writes go to regular vdevs like normal. Current blocks stay where they are. The remaining space will be reserved for just metadata. Adjust the percentage here: /etc/modprobe.d/zfs.conf by adding zfs_special_class_metadata_reserve_pct=10% for example and rebooting. There might be a way to do it live (only lasts until reboot) I think, but I forget right now. You can search the source code here for "zfs_special_class_metadata_reserve_pct" and find what it touches. Sometimes there are comments you might find helpful. Searching the github issue and pull requests can also be helpful. You're probably thinking that all of this is horribly documented and scattered around like marbles. You're right. As long as you're using all striped mirrors in your pool with the same ashift(I am), you can "flush" the metadata and small IO back to your main pool by removing the special vdev Make absolutely certain you have things backed up and confirmed good (scrubbed) before you try this. This has rarely resulted in problems, mercenary_sysadmin saw corruption I believe when he did some testing when special vdevs first came out. The issue was never resolved and I don't know if he's revisited it. I personally won't trust vdev removal for a long time. Also I do highly recommend a triple mirror as the smallest you consider. Preferably on old enterprise ssds with real PLP (visible capacitors), which are cheap enough to find on ebay. Here's some other references for those curious about special vdevs. Various findings: https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954 Generate a histogram of block sizes Calculate data from ZDB output Clarifies behavior of what goes where when setting small block size: https://github.com/openzfs/zfs/issues/9131#issuecomment-528562601 Small blocks can now be set to anything you can set recordsize to (512B-1M) and even bigger if you enable larger recordsizes (up to 16M): https://github.com/openzfs/zfs/pull/9355
🌐
Reddit
reddit.com › r/zfs › advice on the special vdev in my zfs setup
r/zfs on Reddit: Advice on the special VDEV in my ZFS setup
January 6, 2023 -

I am planning out my ZFS setup as I'm moving from snapraid. I bought a few sticks of 118gb optane to play around with and am considering using them or some high end SSD's mirrored in a special vdev. I'm considering using some 2tb sn850x's instead of optane to be able to store small blocks on the special vdev. I store mostly video and photos in my server and plan on having a 5x20tb raidz2 and a 3x8tb mirror vdev in my pool. I have 64gb of non ecc ram and my server is on a 1gbit nic. The performance improvement I want to see is faster loading of my folders as it currently takes 10-20 seconds to load the file structure and thumbnails in the worse case. Would a special vdev suit my needs or would arc and l2arc be fine enough for my needs? I would appreciate any advice on my setup.

🌐
TrueNAS Community
truenas.com › forums › developer's corner
Redundancy necessary for special Metadata vdev? | TrueNAS Community
June 16, 2020 - Is there a way to set a different ashift for each vdev? Click to expand... TrueNAS should default to ashift=12 for all devices including the special mirror, and honestly you shouldn't use anything less than that with how common 512e drives are. Are you looking to increase the ashift value? Increasing TrueNAS SCALE ARC Size beyond the default 50% Do you have an SLOG device, or think you need one? Check benchmarks in and add more data to this thread. ... zpool create library raidz2 /zfs/disk[1-8] -o ashift=12 # PLATTER DISKS zpool add library special mirror /zfs/meta[1-2] -o ashift=13 -f #Mirrored SSDs zpool add library cache /zfs/cache1 -o ashift=13 #NVME zpool add library log /zfs/slog11 -o ashift=13 #NVME And then add the below to put small blocks on the faster SSDs?
🌐
GitHub
openzfs.github.io › openzfs-docs › man › v0.6 › 8 › vdev_id.8.html
vdev_id.8 — OpenZFS documentation
The vdev_id command is a udev helper which parses the file /etc/zfs/vdev_id.conf(5) to map a physical path in a storage topology to a channel name. The channel name is combined with a disk enclosure slot number to create an alias that reflects the physical location of the drive.
🌐
Klara Systems
klarasystems.com › home › openzfs – understanding zfs vdev types
OpenZFS - Understanding ZFS vdev Types - Klara Systems
December 4, 2025 - Confused about how to set up your ZFS pool? This in-depth guide breaks down the building blocks of a zpool—explaining vdev types like mirror, RAIDz, dRAID, and support classes such as LOG, CACHE, and SPECIAL.
Find elsewhere
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
ZFS Metadata Special Device | Proxmox Support Forum
June 16, 2023 - Thanks in advance! Click to expand... It is more or less not much more than: zpool add POOLNAME special mirror /dev/sdX /dev/sdY and for the blocksize: zfs set special_small_blocks=1M POOLNAME
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox backup server › proxmox backup: installation and configuration
PBS and ZFS Special Allocation Class VDEV ... aka Fusion Drive | Proxmox Support Forum
June 14, 2024 - ===================================================== Add a special vdev. zpool add rpool -f -o ashift=12 special mirror scsi-<>-part3 scsi-<>-part3 scsi-<>-part3 Configure it. zfs set recordsize=1M rpool zfs set special_small_blocks=512K rpool ===================================================== Test results
🌐
Reddit
reddit.com › r/zfs › how do i make a metadata special device vdev?
r/zfs on Reddit: How do I make a metadata special device vdev?
October 24, 2022 -

I have 100 TB raidz1 pool that I am about to create, and I want to store the metadata for it on mirrored SSDs.

How do I create the metadata storage for only the 100TB pool (not the OS)?

zpool create -f -o ashift=12  -m /media storage \
			-o recordsize=1M \
		    -o primarycache=metadata -o secondarycache=none \
               raidz \
                  ata-ST3000DM001-9YN166_HWID \
                  ata-ST3000DM001-9YN166_HWID \
                  ata-ST3000DM001-9YN166_HWID \
                  ata-ST3000DM001-9YN166_HWID
zpool add storage -o ashift=12 special mirror /dev/ssd0n1 /dev/ssd1n1
zfs set special_small_blocks=128K storage 
🌐
Level1Techs
forum.level1techs.com › l1 articles & video-related
ZFS Metadata Special Device: Z - L1 Articles & Video-related - Level1Techs Forums
May 2, 2024 - Introduction ZFS Allocation Classes: It isn’t storage tiers or caching, but gosh darn it, you can really REALLY speed up your zfs pool. From the manual: Special Allocation Class The allocations in the special class are dedicated to specific block types. By default this includes all metadata, ...
🌐
Techno Tim
technotim.com › posts › special-vdev-truenas
Boost ZFS Performance with a Special VDEV in TrueNAS | Techno Tim
January 7, 2026 - I’ll also share real-world benchmarks comparing pools with and without a special VDEV, so you can see the difference for yourself. ... The test script will try to create a pool based on 3 drives. HDD1, HDD2, and NVME_SPECIAL. You can modify these to match your disk Ids. ... You can also adjust the test files by changing TEST_COUNT however I found that 100,000 is a good number to get consistent results. Update the script with your disk Ids. ... It will create 2 pools, test-1 and test-2, test-2 has the special vdeb.
🌐
Proxmox
forum.proxmox.com › home › forums › proxmox virtual environment › proxmox ve: installation and configuration
ZFS Special VDEV | Proxmox Support Forum
January 1, 2025 - Hi, On the beginning of 2024, I set up a new storage server for my work using ZFS and Samba on Proxmox. I added to the ZFS pool a "special" vdev, which gives a really good performance when the 40TB (around 50 Million files) of data are backed up, as the file metadata can be analysed very quickly...
🌐
FreeBSD
forums.freebsd.org › base system › storage
ZFS - ZFS special device on shared drive | The FreeBSD Forums
September 7, 2022 - With those two slots I have to create mirror vdevs that will host the OS/swap & special device ... Click to expand... What's this "special device" you keep mentioning? Why do you need 7 TB of metadata? Metadata of what exactly? ... SirDice "ZFS special allocation class" (I think).
🌐
GitHub
openzfs.github.io › openzfs-docs › man › master › 8 › zpool-create.8.html
zpool-create.8 — OpenZFS documentation
The following command creates a ... root vdev that consists of six disks: ... The following command creates a non-redundant pool using files. While not recommended, a pool based on files can be useful for experimental purposes. # zpool create tank /path/to/file/a /path/to/file/b · The following command creates a new pool with an available hot spare: ... The following command creates a ZFS storage pool ...
🌐
Reddit
reddit.com › r/zfs › how to determine the required drive size for special vdev (allocation classes)?
r/zfs on Reddit: How to determine the required drive size for special VDEV (allocation classes)?
June 11, 2019 -

ZoL 0.8 introduced the allocation classes feature, which allows you to place metadata and optionally small blocks on a separate "special" VDEV consisting of faster drives.

Now I'm currently trying to figure out what the required drive size would be, given an existing pool with data. I need a way to get statistics showing me how much space is consumed by metadata in a pool as well as total sum of allocated space by block size. I want to be able to determine "the metadata in this pool plus all blocks of up to 8K size require X amount of disk space".

I've looked around the zdb man page but I've only found the histogram for free space, not allocated data.

Any ideas?

🌐
Reddit
reddit.com › r/zfs › why is nobody talking about the newly introduced allocation class vdevs? this could significantly boost small random i/o workloads for a fraction of the price of full ssd pools.
r/zfs on Reddit: Why is nobody talking about the newly introduced Allocation Class VDEVs? This could significantly boost small random I/O workloads for a fraction of the price of full SSD pools.
July 12, 2017 -

The new 0.8 version of ZFS included something called Allocation Classes. I glanced over it a couple of times but still didn't really understand what it was, or why it was worthy of mentioning as a key feature, until I read the man pages. And after reading it, it seems like this could be a significant performance boost for small random I/O if you're using fast SSDs. This isn't getting the attention on here it deserves. Let's dive in.

Here is what it does (from the manual):

Special Allocation ClassThe allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.

A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.

Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value.

ZFS dataset property special_small_blocks=size - This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class. Valid values are zero or a power of two from 512B up to 128K. The default size is 0 which means no small file blocks will be allocated in the special class. Before setting this property, a special class vdev must be added to the pool.

VDEV type special - A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.

------------------------------------------------------------------------------------------------------------------------------------------------

In other words - if you use SSDs and have ZFS store super small file on there, of this looks like a really good solution for slow random i/o on hard drives. You could think of putting 4 SSDs in striped mirrors and only use that as a special device, and then depending on the datasets you need, you could determine what threshold of small files to store on there. Seems like an amazingly efficient way to boost overall hard drive disk pools!

So I tested if you can make a striped mirror of a special vdev (all in a VM, I'm unable actually test this for real atm) and sure enough:

zpool add rpool special mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf

This added the special vdev, striped and mirrored, just like you'd imagine. Now again, this is only in a VM, so I have zero performance benchmarks or implications. But with the combination of metadata, indirect blocks + per-dataset determined small files, this seems promising.

Now come my questions:

  1. Let's say, with a 50TB dataset, how much would it use for metadata and indirect blocks of user data? I've seen the following calculation online to calculate metadata: size/block size*(blocksize+checksum) Which would mean that larger record sizes would have much less metadata and potentially benefit from having smaller files going to the special VDEV.

  2. Are there other things I'm missing here? Or is this a no-brainer for people who need to extract way more random I/O performance out of disk pools? Most people seem to think that SLOGs are what will make a pool much faster but I feel this is actually something that could make much more of an impact for most ZFS users. Sequential read and writes are already great on spinning disks pooled together - it's the random I/O that's always lacking. This would solve this problem to a big extent.

  3. If the special vdev gets full, it will automatically start using the regular pool for metadata, so effectively it's not the end of the world if your special vdev is getting full. This begs the question: can you later replace the special VDEV with a bigger one without any issues?

  4. Does it actually compress data on the special VDEV too? Probably won't matter with the small block sizes anyway, but still.

  5. This sentence from the manual: *The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.* Doesn't make sense to me - if you run a few RAIDZ2 pools, why would you have to use RAIDZ2 for the special vdev? Is there any reason can't you just use striped mirrors for the special vdev?

  6. Is there anybody who is actually using it yet? What are your experiences?