Comment by bjourne
Comment by bjourne 2 days ago
I don't understand why the author doesn't consider load balancing and failover legitimate use cases for low ttl. Cause it wrecks their argument?
Comment by bjourne 2 days ago
I don't understand why the author doesn't consider load balancing and failover legitimate use cases for low ttl. Cause it wrecks their argument?
You are misunderstanding how HA works with DNS TTLs.
Now there are multiple kinds of HA, so we'll go over a bunch of them here.
Case 1: You have one host (host A) on the internet and it dies, and you have another server somewhere (host B) that's a mirror but with a different IP. When host A dies you update DNS so clients can still connect, but now they connect to host B. In that case the client will not connect to the new IP until their DNS resolver gets the new IP. This was "failover" back in the day. That is dependent on the DNS TTL (and the resolver, because many resolvers and aches ignore the TTL and used their own).
In this case a high TTL is bad, because the user won't be able to connect to your site for TTL seconds + some other amount of time. This is how everyone learned it worked, because this is the way it worked when the inter webs were new.
Case 2: instead of one DNS record with one host you have a DNS record with both hosts. The clients will theoretically choose one host or the other (round robin). In reality it's unclear if that actually do that. Anecdotal evidence shows that it worked until it didn't, usually during a demo to the CEO. But even if it did that means that 50% of your requests will hit a X second timeout as the clients try to connect to a dead host. That's bad, which is why nobody in their right minds did it. And some clients always picked the first host because that's how DNS clients are sometimes.
Putting a load balancer in front of your hosts solves this. Do load balancers die? Yeah, they do. So you need two load balancers...which brings you back to case 1.
These are the basic scenarios that a low DNS TTL fixes. There are other, more complicated solutions, but they're really specialized and require more control of the network infrastructure...which most people don't have.
This isn't an "urban legend" as the author states. These are hard-won lessons from the early days of the internet. You can also not have high availability, which is totally fine.
Being specific: AWS load balancers use a 60 second DNS TTL. I think the burden of proof is on TFA to explain why AWS is following an "urban legend" (to use TFA's words). I'm not convinced by what is written here. This seems like a reasonable use case by AWS.
You don't want to add too many A/AAAA records, or your response gets too big and you run into fun times. IIRC, you can do about 8 of each before you get to the magic 512 byte length (yeah, you're supposed to be able to do more, 1232 bytes as of 2020-10-01, but if you can fit in 512 bytes, you might have better results on a few networks that never got the memo)
And then if you're dealing with browsers, they're not the best at trying everything, or they may wait a long time before trying another host if the first is non-responsive. For browsers and rotations that really do change, I like a 60 second TTL. If it's pretty stable most of the time, 15 minutes most of the time, and crank it down before intentional changes.
If you've got a smart client that will get all the answers, and reasonably try them, then 5-60 minutes seems reasonable, depending on how often you make big changes.
All that said, some caches will keep your records basically forever, and there's not much you can do about that. Just gotta live with it.
It's not good as a first line of defense for failover, but with some client software and/or failure mechanisms there aren't any better approachs I'm aware of. Some of the software I administer doesn't understand multiple A/AAAA records.
And a BGP failure is a good example too. It doesn't matter how resilient the failover mechanisms for one IP are if the routing tables are wrong.
Agreed about some providers enforcing a larger one, though. DNS propagation is wildly inconsistent.
Perhaps as most these days are using Anycast [1] to do failovers. It's faster and not subject to all the oddities that come with every application having its own interpretation of DNS RFC's most notably java and all its work-arounds that people may or may not be using and all the assorted recursive cache servers that also have their own quirks thus making Anycast a more reliable and predictable choice.
Agreed; I have no idea how you'd implement that across multiple ASNs, which is definitely a requirement for multi-cloud or geo-redundant architectures.
Seems like you'd be trying to work against the basic design principles of Internet routing at that point.
Because unless your TTL is exceptionally long you will almost always have a sufficient supply of new users to balance. Basically you almost never need to move old users to a new target for balancing reasons. The natural churn of users over time is sufficient to deal with that.
Failover is different and more of a concern, especially if the client doesn't respect multiple returned IPs.