Comment by azalemeth

Comment by azalemeth 5 days ago

15 replies

My understanding is that single-disk btrfs is good, but raid is decidedly dodgy; https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5... states that:

> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6.

> There are some implementation and design deficiencies that make it unreliable for some corner cases and *the feature should not be used in production, only for evaluation or testing*.

> The power failure safety for metadata with RAID56 is not 100%.

I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive. I've used either mdadm + ext4 (for /) or zfs (for large /data mounts) ever since. Zfs is fantastic and I genuinely don't understand why it's not used more widely.

crest 5 days ago

One problem with your setup is that ZFS by design can't use a traditional *nix filesystem buffer cache. Instead it has to use its own ARC (adaptive replacement cache) with end-to-end checksumming, transparent compression, and copy-on-write semantics. This can lead to annoying performance problems when the two types of file system caches contest for available memory. There is a back pressure mechanism, but it effectively pauses other writes while evicting dirty cache entries to release memory.

  • ryao 5 days ago

    Traditionally, you have the page cache on top of the FS and the buffer cache below the FS, with the two being unified such that double caching is avoided in traditional UNIX filesystems.

    ZFS goes out of its way to avoid the buffer cache, although Linux does not give it the option to fully opt out of it since the block layer will buffer reads done by userland to disks underneath ZFS. That is why ZFS began to purge the buffer cache on every flush 11 years ago:

    https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c3b6...

    That is how it still works today:

    https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4cef...

    If I recall correctly, the page cache is also still above ZFS when mmap() is used. There was talk about fixing it by having mmap() work out of ARC instead, but I don’t believe it was ever done, so there is technically double caching done there.

    • taskforcegemini 5 days ago

      what's the best way to deal with this then? disable filecache of linux? I've tried disabling/minimizing arc in the past to avoid the oom reaper, but the arc was stubborn and its RAM usage remained as is

      • ssl-3 5 days ago

        I didn't have any trouble limiting zfs_arc_max to 3GB on one system where I felt that it was important. I ran it that way for a fair number of years and it always stayed close to that bound (if it was ever exceeded, it wasn't by a noteworthy amount at any time when I was looking).

        At the time, I had it this way because I had fear of OOM events causing [at least] unexpected weirdness.

        A few months ago I discovered weird issues with a fairly big, persistent L2ARC being ignored at boot due to insufficient ARC. So I stopped arbitrarily limiting zfs_arc_max and just let it do its default self-managed thing.

        So far, no issues. For me. With my workload.

        Are you having issues with this, or is it a theoretical problem?

  • [removed] 5 days ago
    [deleted]
lousken 5 days ago

I was assuming OP wants to highlight filesystem use on a workstation/desktop, not for a file server/NAS. I had similar experience decade ago, but these days single drives just work, same with mirroring. For such setups btrfs should be stable. I've never seen a workstation with raid5/6 setup. Secondly, filesystems and volume managers are something else, even if e.g. btrfs and ZFS are essentialy both.

For a NAS setup I would still prefer ZFS with truenas scale (or proxmox if virtualization is needed), just because all these scenarios are supported as well. And as far as ZFS goes, encryption is still something I am not sure about especially since I want to use snapshots sending those as a backup to remote machine.

hooli_gan 5 days ago

RAID5/6 is not needed with btrfs. One should use RAID1, which supports striping the same data onto multiple drives in a redundant way.

  • johnmaguire 5 days ago

    How can you achieve 2-disk fault tolerance using btrfs and RAID 1?

    • Dalewyn 5 days ago

      By using three drives.

      RAID1 is just making literal copies, so each additional drive in a RAID1 is a self-sufficient copy. You want two drives of fault tolerance? Use three drives, so if you lose two copies you still have one left.

      This is of course hideously inefficient as you scale larger, but that is not the question posed.

      • johnmaguire 5 days ago

        > This is of course hideously inefficient as you scale larger, but that is not the question posed.

        It's not just inefficient, you literally can't scale larger. Mirroring is all that RAID 1 allows for. To scale, you'd have to switch to RAID 10, which doesn't allow two-disk fault tolerance (you can get lucky if they are in different stripes, but this isn't fault tolerance.)

        But you're right - RAID 1 also scales terribly compared to RAID 6, even before introducing striping. Imagine you have 6 x 16 TB disks:

        With RAID 6, usable space of 64 TB, two-drive fault tolerance.

        With RAID 1, usable space of 16 TB, five-drive fault tolerance.

        With RAID 10, usable space of 32 GB, one-drive fault tolerance.

      • ryao 5 days ago

        Btrfs did not support that until Linux 5.5 when it added RAID1c3. On its mirror devices instead of doing mirroring, it just stores 2 copies, no matter how many mirror members you have.

      • [removed] 5 days ago
        [deleted]
brian_cunnie 5 days ago

> I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive.

Me, too. The drive was unrecoverable. I had to reinstall from scratch.