Comment by gschier
How do you deal with drive failures? How often does a Railway team member need to visit a DC? What's it like inside?
How do you deal with drive failures? How often does a Railway team member need to visit a DC? What's it like inside?
We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.
The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.
We'll blog about this system at some point in the coming months.
Everything is dual redundancy. We run RAID so if a drive fails it's fine; alerting will page oncall which will trigger remote hands onsite, where we have spares for everything in each datacenter