Hacker News new | past | comments | ask | show | jobs | submit | more mjs33's comments login

Their DNS system failed? How?! Unless DNS stands for “Do Not Sell”


This happened to us at Hustle years ago. Basically if you run on AWS there’s a DNS server provided inside each VPC that usually works fine but which has no observable load metrics etc... so you don’t really know you are slamming it and are about to have a problem unless you audit your entire codebase.

Why? Well that tiny DNS server has certain capacity constraints and if you don’t cache DNS lookups by using a http/https agent for example (in NodeJS) you wind up looking up the same dns info over and over and churning sockets like it’s going out of style. If you run really really hot the poor thing falls over (rightly so).

The limits are high and DNS is fast so you usually don’t notice but when you are under load bugs like this come out of the woodwork. When it falls down you look up the AWS docs, lean back in your chair upon finding this isn’t an “elastic” part of AWS and say “FUUUUUUUUCK” so loud it can be heard from outer space.

If you are Robinhood though don’t you have some former Netflix SRE/DevOps beast on staff that knows this and so you run your own DNS and monitor it?


I read this and thought, “surely there’s an OS-level DNS cache?”

Apparently not on Linux! https://stackoverflow.com/questions/11020027/dns-caching-in-...


Well, there is https://www.freedesktop.org/software/systemd/man/systemd-res... but you may or may not think that's part of the "OS".


That's misleading. The way that this has worked for decades on Linux-based operating systems and on Unices is that one installs a local caching DNS proxy, choosing one of the many available: ISC's BIND, Bernstein's dnscache, unbound, dnsmasq, PowerDNS, MaraDNS, and so forth.

Every Unix system having a local caching DNS proxy was and is as much a norm as every Unix system having a local MTS. A quarter of a century ago, this would have been BIND and Sendmail. Things are more variable, now.

To illustrate that this was considered the norm, here is a random book from the 1990s. Smoot Carl-Mitchell's _Practical Internetworking with TCP/IP and UNIX_ says, quite unequivocally:

> You must run a DNS server if you have Internet connectivity. The most common UNIX DNS server is the Berkeley Internet Name Daemon (BIND), which is part of most UNIX systems.

People sometimes think that this is not the case nowadays, and the fact that a computer is a personal computer magically means that a Unix or Linux-based operating system should offload this task and not perform it locally. They are wrong, and that is DOS Think. Ironically, they don't even get to play the resource allocation card nowadays. The amount of memory and network bandwidth that needs to be devoted to caching proxy DNS service on a personal computer is dwarfed by the amounts nowadays consumed by WWW browsers and HTTP(S).

There's no similar argument for a node in a datacentre.

Ideally, not only should every machine have a (forwarding/resolving) caching proxy DNS server, every organization (or LAN, or even machine) should have a local root content DNS server. A lot of (quite valid) DNS lookups stop at the root with fixed or negative answers. Stopping that from leaving the site/LAN/machine is beneficial.

Ironically, putting a forwarding caching proxy DNS service on the local end of any congested, slow, expensive, or otherwise limited link is advice that I and others have been handing out for over 20 years. It's exactly what one should be doing with things like Amazon's non-local proxy DNS server limited to 1024 packets/second/interface.

* http://jdebp.uk./FGA/dns-server-roles.html#ChoosingProxy

So the question is not whether there a local DNS cache mechanism exists. It's whether it's set up by the company dishing out the VMs, and if not why not. Amazon provides instructions on how to add dnsmasq, and clearly labels this as how to reduce DNS outages. So it's not even the case that Amazon is wrongly discouraging having local caching proxy DNS servers.

* https://aws.amazon.com/premiumsupport/knowledge-center/dns-r...


The point of my comment wasn't to say "don't cache" but rather, don't expect that the OS is going to automatically do it for you (as would be the case on Windows and Mac).


I didn’t say they discourage usage of a dns cache at all.


Wait, what?? There's an invisible DNS server running inside your VPC? I get what you're saying wrt cached DNS lookups but this seems wild.


It's a DNS resolver that runs on the hypervisor hosting every instance.


Yes and they limit you to throwing 1024 packets per second per network interface at it.

Of course you could run your own dns cache per host/pod whatever.


you've got me so curious, could you please point me to the aws docs?


It’s the first thing on google when you google “aws dns vpc limits” but sure:

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.htm...


Your VPC has a DNS server at .2 of your VPC CIDR block that is mounted via loopback on the dom0 and exposed to your VPC to let you do lookups via their DNS infra.

https://aws.amazon.com/premiumsupport/knowledge-center/vpc-e...


This allows them to hand out private network addresses (IIRC they use 172.x.x.x) when the DNS query happens from within AWS.


"Invisible?" I mean, everyone who builds AWS infra, even just single ec2 instances, is aware of it. It's definitely possible that application engineers aren't aware, though.


AWS should simply provide monitoring and alerting by default on these footnote service limits.


What scenarios cause this many DNS lookups though? Connections should be kept-alive after the IP translation, so if it's really new connections being setup constantly then wouldn't that show up as a major bottleneck first?


Running on Kubernetes this is easy, it's one of the first issues you hit.

Every DNS request for external domains turns into 10 if you don't explicitly configure FQDNs (dot at the end). This is because in the default configuration the resolver runs with ndots 5 to search all the possible internal Kubernetes and cloud-provider names. Then you have lookups for IPv4 and IPv6 in parallel. So for every external name you look up, you storm the upstream DNS with 10 requests for non existing domains.

Furthermore, the current default DNS service in Kubernetes doesn't have any kind of caching for these kinds of lookups (especially not NXDOMAIN) enabled.

But like I said, this is one of the first issues you hit running Kubernetes on Amazon. It is widely known and can easily be fixed by scaling up some more instances, changing ndots settings, using FQDNs or configuring caching. There is no way that this was the issue, it is plastered all over the internet, the logs are clear and the fixes can be implemented in minutes.

It also doesn't go down completely, the rate-limiter is packets/s on the interface.


It’s easy to have tens of thousands of dns lookups per sec if you don’t know what you’re doing or didn’t pay attention. Connections wouldn’t be bottleneck if the are outbound.


Finally a real, cheap solution for climate change!!!


Some problems with that:

- You can only offset a certain amount of warming with it. If you put too much aerosol into the stratosphere it will merge, become larger and precipitate quite fast. The exact possible offset can only be estimated, but is below what we're already committed to.

- You have to keep doing it. As soon as you stop you run into trouble very fast.

- In the models we see drastic circulation changes. For example the jet stream collapses. Do you want to test it in real life?

- The issue of ocean acidification still remains. The additional sulphuric acid in the environment won't help either.

- Ah and of course it's not cheap. We do not have the tech to do it yet in the amount necessary.

The currently easiest, cheapest and safest way to fight climate change remains to stop burning fossil fuels.

(edit: And of course I get you're not being serious.)


As always, also worth noting that this is literally acid rain. Putting it in the stratosphere is intended to lessen the amount of acid rain per warming averted, but it's still the exact same chemicals that caused acid rain. There are very good reasons we removed sulfur from gasoline.

Acid rain is probably better than global warming, assuming it doesnt literally kill the ocean. Its good that we have a potential backstop- we could definitely halt warming in a very short time and it might not kill all life. We can even produce the necessary sulfur for a short while. Or like, we could stop warming without dumping incredible amounts of acid in the air we breathe.


> we could definitely halt warming in a very short time

The limit estimated (from models) here [0] is a decrease of -2 W/m^2 at most.

The IPCC scenarios are equivalent with a number of plausible energy imbalances by the end of the century, ranging for 2.6 W/m^2 (RCP 2.6) in the best and 8.5 W/m^2 (RCP 8.5) in the worst case. [1]

So even in the best case it might very possibly not be enough to halt everything. Maybe it can buy us some time, but I think the only viable path forward is to stop as soon as possible and then carbon-capture it back.

(See also this graphic [2] for possible pathways to stay below 1.5°C warming, all of which include carbon capture, up to half of current emissions starting in 10 years (in the best case by land-use changes, in all others by actively removing the carbon).)

[0] https://d-nb.info/1160958696/34

[1] https://www.ipcc.ch/sr15/chapter/chapter-1/

[2] https://www.ipcc.ch/site/assets/uploads/sites/2/2019/02/SPM3...


Stratospheric Aerosol Injection is exactly about artificially recreating the relevant aspects of this natural phenomenon.

It's not a fix, but might be an effective mitigation.

[1] https://en.wikipedia.org/wiki/Stratospheric_aerosol_injectio...


You know I was thinking about that. Is there anyway we could use this phenomena to lower temps?

Like have a large parasol like cloud moving over the globe reflecting sunlight.



It would require a very large eruption to make a worthwhile dent in the rate of warming; small-ish ones happen all the time with no strong effect. Even if such an eruption didn't have a lot of highly undesirable side effects, we simply don't have any way to trigger something like this. At any rate, the short-term consequences would be absolutely devastating and in no way cheap.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: