Comment by gschier

Comment by gschier a year ago

How do you deal with drive failures? How often does a Railway team member need to visit a DC? What's it like inside?

justjake a year ago

Everything is dual redundancy. We run RAID so if a drive fails it's fine; alerting will page oncall which will trigger remote hands onsite, where we have spares for everything in each datacenter

Reply View 2 replies

gschier a year ago

How much additional overhead is there for managing the bare-metal vs cloud? Is it mostly fine after the big effort for initial setup?

Reply View | 1 reply
- ca508 a year ago
  
  We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.
  The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.
  We'll blog about this system at some point in the coming months.
  
  Reply View | 0 replies