Comment by Bender

Comment by Bender 5 days ago

4 replies

In my opinion if there is no overlapping networks or the Infrastructure as Code understands pods, k8's and such then /etc/hosts can speed up resolution leaving things outside of the data-center to utilize DNS then it makes sense but requires some critical thinking about how all the inter-dependencies in the data-center play together and how fail-overs are handled.

Why aren't cloud providers and FAANGs doing this already

This probably requires that anyone touching the Infrastructure as Code are all critical thinkers and fully understand the implications of mapping applications to hosts including but not limited to applications having their own load balancing mechanisms, fail-over IP addresses, application state and ARP timeouts, broadcast and multicast discovery. It can be done but I would expect large companies to avoid this potential complexity trap. It might work fine in smaller companies that have only senior/principal engineers. Using /etc/hosts for boot-strapping critical infrastructure nodes required for dynamic DNS updates could still make sense in some cases. Point being, this gets really complex and whatever is managing the Infrastructure as Code would have to fully aware of every level of abstraction, NAT's, SNAT's, hair-pin routes, load balanced virtual servers and origin nodes. Some companies are so big and complex that one human can not know the whole thing so everyone's silo knowledge has to be merged into this Inf as Code beast. Recursive DNS on the other hand only has to know the correct up-stream resolvers to use or if they are supposed to talk directly to the root DNS servers. This simplifies the layers upon layers of abstraction that manage their own application mapping and DNS.

Another complexity trap people get lured into is split-views which should be avoided due to growing into a complexity trap over time and breaking sites when one dependency starts to interfere with another. Everyone has to learn the hard way for themselves on this topic.

My preference would be to instead make DNS more resilient. Running Unbound [1] on every node pointing to a group of edge DNS resolvers for external IP addresses with customized settings to retry and keep state up the fastest upstream resolving DNS nodes, also caching infrastructure addresses and their state, setting realistic min/max DNS TTL times is a small step in the right direction. Dev/QA environments should also enable query logging to a tmpfs mount to help debug application misconfigurations and spot less than optimal uses of DNS within the infrastructure and application settings before anything gets to staging or production. Grab statistical data from Unbound on every node and ingest it into some form of big-data/AI web interface so questions about resolution, timing, errors may potentially be analyzed.

This is just my two cents based on my experience. If it seems like I was just spewing words I was watching Shane Gillis and did not want to turn it off.

[1] - https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound...

notepad0x90 5 days ago

Thanks for the well thought response friend :)

You made some really good points. But here is my follow up: With /etc/hosts, there is no need to complicate things, for example:

10.0.0.1 sql.app.local storage.local lb.corp.net

This line could be present on every host on every network, everywhere. The only thing that should matter in my opinion is that the name portion needs to be very specific. Even if you have NAT, SNAT, etc..., /etc/hosts is only relevant to the host attempting to resolve a name, it already knows what name to use.

So long as you have one big-and-flat /etc/hosts everywhere, you just have to make sure that whenever you change an IP for a service, the global /etc/hosts reflects that change. and of course the whole devops tests, reviews,etc... ensure you don't screw that up.

Back in the day, this was a really bad idea because the problem of managing /etc/hosts at scale wasn't solved. But it is just a configuration file for which IaC is best-suited.

DNS on the other hand is a complex system that has hierarchies, zones, different record types, aliases, TTLs, caches, and more. in a closed private network, is DNS really worth it when you have already invested in IaC?

  • Bender 5 days ago

    So long as you have one big-and-flat /etc/hosts everywhere

    I get where you are coming from and in a small to almost medium company that might work but at some point there will eventually be conflicts when networks and environments managed by many different teams will start to conflict or not be able to resolve things until someone opens a ticket to update the other departments Infrastructure as a Service. In my experience teams and orgs want to have control over their thing and while they could logically all share commit access to a big flat thing it will start to introduce artificial problems.

    I could be wrong. Perhaps in your company it will work out just fine. As nobody here on HN knows the logical and physical structure of your company maybe pull a meeting together consisting of leaders from each team/org that currently influence DNS records and ask them to pick apart your idea after documenting it in a way everyone can visually understand how the code repositories and multi-department git permissions would be laid out and how each team would be able to independently add, change and delete records whenever they need to and review audit logs both in the repositories and possibly on each node. My views could be skewed by all the politics that naturally occur as organizations grow. For what it's worth I was in a company that had multi-data-center wide /etc/hosts and it was just dandy when the company was small. We outgrew that by the second iteration of our data-centers.

    • notepad0x90 5 days ago

      You make a good point, I'm still a bit stuck on the conflict part since you can have multiple names. but i can envision where multiple teams want to use db.local or something, and if you're providing services internally, that could be hard to scale for sure. I'd like to think that those people avoiding pesky tickets and all that end up causing outages by moving their conflict to DNS? but what do I know.

      In the end, I trust your experience over my opinion. Thank you.

      • Bender 4 days ago

        but i can envision where multiple teams want to use db.local or something

        They could just use service1.region1.db.local but the trick is to get all the teams to agree to this or have a top-down decision from leaders in a new greenfield data-center design. Only you and your coworkers can really decide if this works. I hope it works out.