Comment by morgante

Comment by morgante 10 months ago

The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.

AyyEye 10 months ago

Poe's law. Lichess is 14 years old and their longest outage is less than 12 hours. Google and AWS have both had ~6 hour outages and that's with billions od dollars depending on them and thousands of engineers. Simpler is working just fine.

Reply View 0 replies

lukhas 10 months ago

We're an understaffed charity.

Reply View 2 replies

justinclift 10 months ago

As a general thought, any idea if people have looked at something like (for example) using Proxmox on the physical hardware so the services can be put on VMs which can be migrated between hosts if there are problems?

Reply View | 0 replies
morgante 10 months ago

Yeah I'm not criticizing it as a charity, just pointing out this definitely isn't "superior to most commercial services."
That being said, removing dependence on single hardware nodes isn't something you need a big team for. I've done failover at 1-person startups.

Reply View | 0 replies

KolmogorovComp 10 months ago

And yet even Meta recently had a multiple hours downtime, despite a budget thousands if not million times higher. Would you call them negligent too?

By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.

Reply View 1 reply

morgante 10 months ago

To be clear, you don't need to make it more complex / failure-prone. I didn't say failover needs to be automated.
Kubernetes or complex cloud services are not required to have some basic deployment automation.
You can do it with a simple bash script if you need to. It's just pretty surprising to see the reaction to a hardware failure being to wait around for it to be repaired instead of simply spinning up a new host.

Reply View | 0 replies