There's two types of "data" in the pool: the actual data of whatever it is you're storing, and the metadata, which is all the tables, properties, indexes and other "stuff" that defines the pool structure, the datasets, and the pointers that tell ZFS where on disk to find the actual data.
Normally this is all mixed in together on the regular pool vdevs (mirror, raidz, etc). If you add a special vdev to your pool, then ZFS will prefer to store the metadata there, and send the data proper to the regular vdevs. The main reason for doing this is if you have "slow" data vdevs; adding a special vdev of a SSD mirror can speed up access times, as ZFS can look to the SSDs to know where on the data vdevs to find its data and go directly there, rather than loading the metadata off the slow vdevs and then needing another access to get the real data.
There's another possible advantage: ZFS can store "small" files on the special vdev, leaving larger ones for the regular data vdevs.
One important thing to remember is that special vdevs are a proper part of the pool, not an add-on - just like the regular data vdevs, if the special vdev fails, the pool is lost. An SSD mirror is typical for this vdev.
This is only a rough explanation, see also:
zpoolconcepts(7)- Level1Techs writeup
Racking my brain trying to figure out it's actual behavior. No documentation anywhere I can find actually comments on this.
The special vdev is used to store metadata from the pool, so operations like directory listing are at the speed of an NVME drive, which will also improve the performance of spinning rust by reducing the load of small IO required to lookup the pool metadata. This makes sense.
What confuses me is the behavior of the small block allocation class when the special vdev is at capacity. There seems to be a few scenarios that aren't talked about anywhere in the documentation that would be important for performance in various scenarios.
From how i've seen it talked about my understanding is that it acts as a write back cache for small IO based on the special_small_blocks value of the dataset, and when the special vdev is at capacity or needs more space for metadata, small block allocation is offloaded back to the data vdevs.
However i've never actually seen this mentioned anywhere. From the closest thing to a confirmation is in the TrueNas documentation on a fusion pool saying
If the special class becomes full, then allocations spill back into the normal class.
By spill, does that mean, removed from the special class so new incoming small block writes are added to the cache. Or does that mean when the special class is full small IO is sent directly back to the normal class bypassing the special vdev.
zpool list -v -H -P
-v verbose
-P show full paths, not just the last component
-H script mode - no headings, fields separated by tab character
That will get you a lot closer.
The zpool-status(8) command support JSON output in more recent OpenZFS versions.
Here's an example:
$ zpool status --json | jq -r '.. | select(.vdev_type? == "disk").name'
usb-QEMU_QEMU_HARDDISK_1-0000:00:04.0-4.2-0:0
usb-QEMU_QEMU_HARDDISK_1-0000:00:04.0-4.5-0:0
I am planning out my ZFS setup as I'm moving from snapraid. I bought a few sticks of 118gb optane to play around with and am considering using them or some high end SSD's mirrored in a special vdev. I'm considering using some 2tb sn850x's instead of optane to be able to store small blocks on the special vdev. I store mostly video and photos in my server and plan on having a 5x20tb raidz2 and a 3x8tb mirror vdev in my pool. I have 64gb of non ecc ram and my server is on a 1gbit nic. The performance improvement I want to see is faster loading of my folders as it currently takes 10-20 seconds to load the file structure and thumbnails in the worse case. Would a special vdev suit my needs or would arc and l2arc be fine enough for my needs? I would appreciate any advice on my setup.
I have 100 TB raidz1 pool that I am about to create, and I want to store the metadata for it on mirrored SSDs.
How do I create the metadata storage for only the 100TB pool (not the OS)?
zpool create -f -o ashift=12 -m /media storage \
-o recordsize=1M \
-o primarycache=metadata -o secondarycache=none \
raidz \
ata-ST3000DM001-9YN166_HWID \
ata-ST3000DM001-9YN166_HWID \
ata-ST3000DM001-9YN166_HWID \
ata-ST3000DM001-9YN166_HWIDzpool add storage -o ashift=12 special mirror /dev/ssd0n1 /dev/ssd1n1
zfs set special_small_blocks=128K storage
ZoL 0.8 introduced the allocation classes feature, which allows you to place metadata and optionally small blocks on a separate "special" VDEV consisting of faster drives.
Now I'm currently trying to figure out what the required drive size would be, given an existing pool with data. I need a way to get statistics showing me how much space is consumed by metadata in a pool as well as total sum of allocated space by block size. I want to be able to determine "the metadata in this pool plus all blocks of up to 8K size require X amount of disk space".
I've looked around the zdb man page but I've only found the histogram for free space, not allocated data.
Any ideas?
The new 0.8 version of ZFS included something called Allocation Classes. I glanced over it a couple of times but still didn't really understand what it was, or why it was worthy of mentioning as a key feature, until I read the man pages. And after reading it, it seems like this could be a significant performance boost for small random I/O if you're using fast SSDs. This isn't getting the attention on here it deserves. Let's dive in.
Here is what it does (from the manual):
Special Allocation ClassThe allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.
A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.
Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value.
ZFS dataset property special_small_blocks=size - This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class. Valid values are zero or a power of two from 512B up to 128K. The default size is 0 which means no small file blocks will be allocated in the special class. Before setting this property, a special class vdev must be added to the pool.
VDEV type special - A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.
------------------------------------------------------------------------------------------------------------------------------------------------
In other words - if you use SSDs and have ZFS store super small file on there, of this looks like a really good solution for slow random i/o on hard drives. You could think of putting 4 SSDs in striped mirrors and only use that as a special device, and then depending on the datasets you need, you could determine what threshold of small files to store on there. Seems like an amazingly efficient way to boost overall hard drive disk pools!
So I tested if you can make a striped mirror of a special vdev (all in a VM, I'm unable actually test this for real atm) and sure enough:
zpool add rpool special mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf
This added the special vdev, striped and mirrored, just like you'd imagine. Now again, this is only in a VM, so I have zero performance benchmarks or implications. But with the combination of metadata, indirect blocks + per-dataset determined small files, this seems promising.
Now come my questions:
-
Let's say, with a 50TB dataset, how much would it use for metadata and indirect blocks of user data? I've seen the following calculation online to calculate metadata:
size/block size*(blocksize+checksum)Which would mean that larger record sizes would have much less metadata and potentially benefit from having smaller files going to the special VDEV. -
Are there other things I'm missing here? Or is this a no-brainer for people who need to extract way more random I/O performance out of disk pools? Most people seem to think that SLOGs are what will make a pool much faster but I feel this is actually something that could make much more of an impact for most ZFS users. Sequential read and writes are already great on spinning disks pooled together - it's the random I/O that's always lacking. This would solve this problem to a big extent.
-
If the special vdev gets full, it will automatically start using the regular pool for metadata, so effectively it's not the end of the world if your special vdev is getting full. This begs the question: can you later replace the special VDEV with a bigger one without any issues?
-
Does it actually compress data on the special VDEV too? Probably won't matter with the small block sizes anyway, but still.
-
This sentence from the manual: *The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.* Doesn't make sense to me - if you run a few RAIDZ2 pools, why would you have to use RAIDZ2 for the special vdev? Is there any reason can't you just use striped mirrors for the special vdev?
-
Is there anybody who is actually using it yet? What are your experiences?
Don Brady presented on this work at the Second (edit: First?) annual ZFS user conference in Norwalk, CT sponsored by Datto. It was required by the DRAID work Intel is working on for Fermilab and perhaps other Nuclear tech labs across the U.S. very interesting work, I haven't had a need to implement it yet but am interested in hearing about deployments. One of the ZFS mailing lists had a complaint from a user that added a special Allocation VDEV, played around with it, and then found they couldn't delete it easily like a SLOG or l2arc device. N.B. For now, it is permanent once added , pool must be copied and destroyed and rewritten if you change your mind.
here is the link to the presentation document : https://zfs.datto.com/2017_slides/brady.pdf
Presumably the redundancy advice is to stop people from treating the special device like l2arc and potentially losing the entire pool after a single drive failure.
This looks very interesting even without the small file option, presumably it could be used as a metadata only l2arc with no overhead in arc. I’d be curious to see some benchmarks of this vs. l2arc vs. normal with nvme drives or even optane aics.