Comment by kevin_nisbet
Comment by kevin_nisbet 9 hours ago
I'm with you, I think most people might think they don't need this reliability, until they do. I'm sure there is some subset of clusters where the claim is correct.
But from the article, turning off fsync and expecting to only lose a few ms of updates. I've tried to recover etcd on volumes that lied about fsync and experienced a power outage, and I don't think we managed to recover it. There might be more options now to recover and ignore corrupted WAL entries, but at that time it was very difficult and I think we ended up just reinstalling from scratch. For clusters where this doesn't matter or the SLOs for recovery account for this, I'm totally onboard, but only if you know what you're doing.
And similar the point from the article that "full control plane data loss isn’t catastrophic in some environments" is correct, in the sense of what the author means by some environments. Because I don't think it's limited to those that are management by gitops as suggested, but where there is enough resiliency and time to redeploy and do all the cleanup.
Anyways, like much advice on the internet, it's not good or bad, just highly situational, and some of the suggestions should only be applied if the implications are fully understood.