Comment by mustache_kimono

Comment by mustache_kimono 5 days ago

18 replies

> It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.

Perhaps I am misunderstanding you, but you can offline and remove drives from a ZFS pool.

Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure the drive topology?

cm2187 5 days ago

So for instance I have a ZFS pool with 3 HDD data vdevs, and 2 SSD special vdevs. I want to convert the two SSD vdevs into a single one (or possibly remove one of them). From what I read the only way to do that is to destroy the entire pool and recreate it (it's in a server in a datacentre, don't want to reupload that much data).

In windows, you can set a disk for removal, and as long as the other disks have enough space and are compatible with the virtual disks (eg you need at least 5 disks if you have parity with number of columns=5), it will rebalance the blocks onto the other disks until you can safely remove the disk. If you use thin provisioning, you can also change your mind about the settings of a virtual disk, create a new one on the same pool, and move the data from one to the other.

Mdadm/lvm will do the same albeit with more of a pain in the arse as RAID requires to resilver not just the occupied space but also the free space so takes a lot more time and IO than it should.

It's one of my beef with ZFS, there are lots of no return decisions. That and I ran into some race conditions with loading a ZFS array on boot with nvme drives on ubuntu. They seem to not be ready, resulting in randomly degraded arrays. Fixed by loading the pool with a delay.

  • formerly_proven 5 days ago

    My understanding is that ZFS does virtual <-> physical translation in the vdev layer, i.e. all block references in ZFS contain a (vdev, vblock) tuple, and the vdev knows how to translate that virtual block offset into actual on-disk block offset(s).

    This kinda implies that you can't actually remove data vdevs, because in practice you can't rewrite all references. You also can't do offline deduplication without rewriting references (i.e. actually touching the files in the filesystem). And that's why ZFS can't deduplicate snapshots after the fact.

    On the other hand, reshaping a vdev is possible, because that "just" requires shuffling the vblock -> physical block associations inside the vdev.

    • ryao 5 days ago

      There is a clever trick that is used to make top level removal work. The code will make the vdev readonly. Then it will copy its contents into free space on other vdevs (essentially, the contents will be stored behind the scenes in a file). Finally, it will redirect reads on that vdev into the stored vdev. This indirection allows you to remove the vdev. It is not implemented for raid-z at present though.

      • formerly_proven 5 days ago

        Though the vdev itself still exists after doing that? It just happens to be backed by, essentially, a "file" in the pool, instead of the original physical block devices, right?

Sesse__ 5 days ago

> Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure of the drive topo?

mdadm can convert RAID-5 to a larger or smaller RAID-5, RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or the other way around, RAID-0 to a degraded RAID-5, and many other fairly reasonable operations, while the array is online, resistant to power loss and the likes.

I wrote the first version of this md code in 2005 (against kernel 2.6.13), and Neil Brown rewrote and mainlined it at some point in 2006. ZFS is… a bit late to the party.

  • ryao 5 days ago

    Doing this with the on disk data in a merkle tree is much harder than doing it on more conventional forms of storage.

    By the way, what does MD do when there is corrupt data on disk that makes it impossible to know what the correct reconstruction is during a reshape operation? ZFS will know what file was damaged and proceed with the undamaged parts. ZFS might even be able to repair the damaged data from ditto blocks. I don’t know what the MD behavior is, but its options for handling this are likely far more limited.

    • Sesse__ 5 days ago

      Well, then they made a design choice in their RAID implementation that made fairly reasonable things hard.

      I don't know what md does if the parity doesn't match up, no. (I've never ever had that happen, in more than 25 years of pretty heavy md use on various disks.)

      • ryao 5 days ago

        I am not sure if reshaping is a reasonable thing. It is not so reasonable in other fields. In architecture, if you build a bridge and then want more lanes, you usually build a new bridge, rather than reshape the bridge. The idea of reshaping a bridge while cars are using it would sound insane there, yet that is what people want from storage stacks.

        Reshaping traditional storage stacks does not consider all of the ways things can go wrong. Handling all of them well is hard, if not impossible to do in traditional RAID. There is a long history of hardware analogs to MD RAID killing parity arrays when they encounter silent corruption that makes it impossible to know what is supposed to be stored there. There is also the case where things are corrupted such that there is a valid reconstruction, but the reconstruction produces something wrong silently.

        Reshaping certainly is easier to do with MD RAID, but the feature has the trade off that edge cases are not handled well. For most people, I imagine that risk is fine until it bites them. Then it is not fine anymore. ZFS made an effort to handle all of the edge cases so that they do not bite people and doing that took time.

      • amluto 5 days ago

        I’ve experienced bit rot on md. It was not fun, and the tooling was of approximately no help recovering.

TiredOfLife 5 days ago

Storage Spaces doesn't dedicate drive to single purpose. It operates in chunks (256MB i think). So one drive can, at the same time, be part of a mirror and raid-5 and raid-0. This allows fully using drives with various sizes. And choosing to remove drive will cause it to redistribute the chunks to other available drives, without going offline.

  • cm2187 5 days ago

    And as a user it seems to me to be the most elegant design. The quality of the implementation (parity write performance in particular) is another matter.