Comment by uniqueuid

Comment by uniqueuid 5 days ago

18 replies

It's good to see that they were pretty conservative about the expansion.

Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.

That said, there is one tiny caveat people should be aware of:

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

chungy 5 days ago

I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.

  • crote 5 days ago

    I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.

    For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.

    • ryao 5 days ago

      You have a couple options:

      1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.

      2. Use send/receive inside the pool.

      Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

      • pdimitar 2 days ago

        Can you give sample commands on how to achieve both options that you gave?

    • bmicraft 5 days ago

      Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.

      Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.

    • chungy 5 days ago

      It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.

      Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.

  • stavros 5 days ago

    Is that the case? What if I expand a 3-1 array to 3-2? Won't the old blocks remain 3-1?

    • Timshel 5 days ago

      I don't believe it supports adding parity drives only data drives.

      • stavros 5 days ago

        Ahh interesting, thanks.

        • bmicraft 5 days ago

          Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.

wjdp 5 days ago

Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.

rekoil 5 days ago

Yaeh it's a pretty huge caveat to be honest.

    Da1 Db1 Dc1 Pa1 Pb1
    Da2 Db2 Dc2 Pa2 Pb2
    Da3 Db3 Dc3 Pa3 Pb3
    ___ ___ ___ Pa4 Pb4
___ represents free space. After expansion by one disk you would logically expect something like:

    Da1 Db1 Dc1 Da2 Pa1 Pb1
    Db2 Dc2 Da3 Db3 Pa2 Pb2
    Dc3 ___ ___ ___ Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4
But as I understand it it would actually expand to:

    Da1 Db1 Dc1 Dd1 Pa1 Pb1
    Da2 Db2 Dc2 Dd2 Pa2 Pb2
    Da3 Db3 Dc3 Dd3 Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4
Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.

Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.

  • ryao 5 days ago

    ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.

    The slides here explain how it works:

    https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

    Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

    • chungy 5 days ago

      To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.

      I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.

      I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.

  • magicalhippo 5 days ago

    Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.

    You can see this in the presentation[1] slides[2].

    The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.

    Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then

        Da1 Db1 Dc1 Pa1 Pb1
        Da2 Db2 Dc2 Pa2 Pb2
        Da3 Db3 Pa3 Pb3 ___
    
    would after RAID-Z expansion would become

        Da1 Db1 Dc1 Pa1 Pb1 Da2
        Db2 Dc2 Pa2 Pb2 Da3 Db3 
        Pa3 Pb3 ___ ___ ___ ___
    
    Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.

    However if the same data was written in the post-expanded vdev configuration, it would have become

        Da1 Db1 Dc1 Dd1 Pa1 Pb1
        Da2 Db2 Dc2 Dd2 Pa2 Pb2
        ___ ___ ___ ___ ___ ___
    
    Ie, you'd have 6 free blocks not just 4 blocks.

    Of course this doesn't count for writes which end up taking less than the maximal stripe width.

    [1]: https://www.youtube.com/watch?v=tqyNHyq0LYM

    [2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

    • ryao 5 days ago

      Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.

      There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.

      • magicalhippo 5 days ago

        What are the errors? I tried to show exactly what you talk about.

        edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.

        The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.

      • [removed] 5 days ago
        [deleted]