Comment by 8cvor6j844qw_d6
Comment by 8cvor6j844qw_d6 5 days ago
I'll be interested in the incident writeup since DNS is mentioned. It will be interesting in a way if it is similar to what happened at AWS.
Comment by 8cvor6j844qw_d6 5 days ago
I'll be interested in the incident writeup since DNS is mentioned. It will be interesting in a way if it is similar to what happened at AWS.
That RCA was fun. A distributed system with members that don't know about each other, don't bother with leader elections, and basically all stomp all over each other updating the records. It "worked fine" until one of the members had slightly increased latency and everything cascade-failed down from there. I'm sure there was missing (internal) context but it did not sound like a well-architected system at all.
It's pretty unlikely. AWS published a public 'RCA' https://aws.amazon.com/message/101925/. A race condition in a DNS 'record allocator' causing all DNS records for DDB to be wiped out.
I'm simplifying a bit, but I don't think it's likely that Azure has a similar race condition wiping out DNS records on _one_ system than then propagates to all others. The similarity might just end at "it was DNS".