Comment by morgante

Comment by morgante 13 hours ago

6 replies

The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.

AyyEye 3 hours ago

Poe's law. Lichess is 14 years old and their longest outage is less than 12 hours. Google and AWS have both had ~6 hour outages and that's with billions od dollars depending on them and thousands of engineers. Simpler is working just fine.

lukhas 10 hours ago

We're an understaffed charity.

  • justinclift 4 hours ago

    As a general thought, any idea if people have looked at something like (for example) using Proxmox on the physical hardware so the services can be put on VMs which can be migrated between hosts if there are problems?

  • morgante 10 hours ago

    Yeah I'm not criticizing it as a charity, just pointing out this definitely isn't "superior to most commercial services."

    That being said, removing dependence on single hardware nodes isn't something you need a big team for. I've done failover at 1-person startups.

KolmogorovComp 10 hours ago

And yet even Meta recently had a multiple hours downtime, despite a budget thousands if not million times higher. Would you call them negligent too?

By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.

  • morgante 10 hours ago

    To be clear, you don't need to make it more complex / failure-prone. I didn't say failover needs to be automated.

    Kubernetes or complex cloud services are not required to have some basic deployment automation.

    You can do it with a simple bash script if you need to. It's just pretty surprising to see the reaction to a hardware failure being to wait around for it to be repaired instead of simply spinning up a new host.