Comment by uniqueuid

Comment by uniqueuid 6 months ago

It's good to see that they were pretty conservative about the expansion.

Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.

That said, there is one tiny caveat people should be aware of:

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

chungy 6 months ago

I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.

Reply View 9 replies

crote 6 months ago

I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.
For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.

Reply View | 4 replies
- ryao 6 months ago
  
  You have a couple options:
  1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.
  2. Use send/receive inside the pool.
  Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
  
  Reply View | 1 reply
  
  pdimitar 6 months ago
  
  Can you give sample commands on how to achieve both options that you gave?
  
  Reply View | 0 replies
- bmicraft 6 months ago
  
  Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.
  Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.
  
  Reply View | 0 replies
- chungy 6 months ago
  
  It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.
  Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.
  
  Reply View | 0 replies
stavros 6 months ago

Is that the case? What if I expand a 3-1 array to 3-2? Won't the old blocks remain 3-1?

Reply View | 3 replies
- Timshel 6 months ago
  
  I don't believe it supports adding parity drives only data drives.
  
  Reply View | 2 replies
  
  stavros 6 months ago
  
  Ahh interesting, thanks.
  
  Reply View | 1 reply
  
  bmicraft 6 months ago
  
  Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.
  
  Reply View | 0 replies

wjdp 6 months ago

Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.

Reply View 0 replies

rekoil 6 months ago

Yaeh it's a pretty huge caveat to be honest.

    Da1 Db1 Dc1 Pa1 Pb1
    Da2 Db2 Dc2 Pa2 Pb2
    Da3 Db3 Dc3 Pa3 Pb3
    ___ ___ ___ Pa4 Pb4

___ represents free space. After expansion by one disk you would logically expect something like:

    Da1 Db1 Dc1 Da2 Pa1 Pb1
    Db2 Dc2 Da3 Db3 Pa2 Pb2
    Dc3 ___ ___ ___ Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

But as I understand it it would actually expand to:

    Da1 Db1 Dc1 Dd1 Pa1 Pb1
    Da2 Db2 Dc2 Dd2 Pa2 Pb2
    Da3 Db3 Dc3 Dd3 Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.

Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.

Reply View 6 replies

ryao 6 months ago

ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.
The slides here explain how it works:
https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

Reply View | 1 reply
- chungy 6 months ago
  
  To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.
  I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.
  I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.
  
  Reply View | 0 replies
magicalhippo 6 months ago
Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
Da1 Db1 Dc1 Pa1 Pb1 Da2 Db2 Dc2 Pa2 Pb2 Da3 Db3 Pa3 Pb3 ___
would after RAID-Z expansion would become
Da1 Db1 Dc1 Pa1 Pb1 Da2 Db2 Dc2 Pa2 Pb2 Da3 Db3 Pa3 Pb3 ___ ___ ___ ___
Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.
However if the same data was written in the post-expanded vdev configuration, it would have become
Da1 Db1 Dc1 Dd1 Pa1 Pb1 Da2 Db2 Dc2 Dd2 Pa2 Pb2 ___ ___ ___ ___ ___ ___
Ie, you'd have 6 free blocks not just 4 blocks.
Of course this doesn't count for writes which end up taking less than the maximal stripe width.
[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Reply View | 3 replies
- ryao 6 months ago
  
  Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.
  There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
  
  Reply View | 2 replies
  
  magicalhippo 6 months ago
  
  What are the errors? I tried to show exactly what you talk about.
  edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
  The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.
  
  Reply View | 0 replies
  
  [removed] 6 months ago
  
  [deleted]
  
  Reply View | 0 replies