Have they implemented IP failover already? I haven't heard anything. Having your...

brianwawok · on July 8, 2015

What?

You use DNS failover and multiple load balancers.

FOO.COM A record -> 1.2.3.4, 1.2.3.5, 1.2.3.6

Then at 1.2.3.4, 1.2.3.5, and 1.2.3.6 you put a load balancer that splits loads between all of your clients.

Any LB goes down, and DNS client retries will deal with it. If any backend server goes down, your LB will deal with it.

Using this pretty successfully at digital ocean right now. What is the downside? I guess client DNS retries takes a few seconds, but for a rare case of a load balancer dying, seems not a deal breaker.

gog · on July 8, 2015

That is not how things work. Once your system resolves the DNS record it will keep using that record for a while (depending on the TTL of the record and other factors).

Your browser will also cache the result of the DNS lookup, and if that server goes down it will not try to do another DNS lookup for another host and your service will be unavailable.

It will also be unavailable for any new customer that gets the "faulty" IP address.

Specifying multiple DNS records will just cause your DNS server to use one of those, usually in a round robin fashion.

brianwawok · on July 8, 2015

TTL does not matter because I am not adding or removing systems from my DNS record. Even during an outage, a request to my domain name will return both the broken and the working load balancers.

I am simply giving a list of servers that can answer a request.. clients know to keep trying till one works. (Which they all do. Try it!)

kawsper · on July 8, 2015

Basecamp/SignalvsNoise/37signals had an article up on how they used Dyn.coms DNS service to achieve something like this, but I can't seem to find it. They had some nice graphs for when they tested it out.

Edit: My Google skills are poor, but I found it here: https://signalvnoise.com/posts/3857-when-disaster-strikes

brianwawok · on July 8, 2015

Thanks for the -1 on a true statement about my own hosting setup!

ceejayoz · on July 8, 2015

"Why is DNS failover not recommended?" http://serverfault.com/questions/60553/why-is-dns-failover-n...

Among other things, IE7 will pin IPs for 30 minutes, non-browser clients may have serious issues, etc.

brianwawok · on July 8, 2015

Yah for sure, it is not perfect but it is pretty good.

In my use case, I don't support IE7 (won't work at all on my SAAS app), and I only support browser clients.

I have simulated LB failures by killing nginx, and watched traffic flow over to the other LB without a big delay (in 30 seconds everyone was over).

Fancier IP failover is nice for sure, and would let some more enterprisey people in.. but for a lot of apps out there, DNS failover works great. Surprised from above how many people don't realize it exists or works so well (for so little effort).

toast0 · on July 8, 2015

killing nginx is good for testing load balancer application crashed, but insufficient for testing load balancer host mysteriously vanished; for that I would set your firewall to drop incoming SYNs on the load balanced port. You'll have a much bigger client side delay when there's no response than when there's a quick port closed response.

brianwawok · on July 8, 2015

For sure, thanks for the tip!

Like I said in previous posts, I don't think it the end all rock solid load balancer answer. But I like to sleep through the night, and if having a short pause the one night a year a load balancer crash happens, my uptime is way higher than most of the internet.

nailer · on July 8, 2015

> Among other things, IE7 will pin IPs for 30 minutes

What, regardless of TTL? That's gross.

waffle_ss · on July 8, 2015

> I guess client DNS retries takes a few seconds

Have you tested this in all the browsers? According to this ServerFault post[1], it could take minutes for an IP address to be considered "down" in Chromium before it cycles to the next one; Firefox apparently waits 20 seconds[2]. Those posts are dated 2011 but I can't imagine the behavior would've changed a whole lot since then. A user is not going to wait multiple minutes or even 20 seconds for a web page to render - it's effectively down.

IP failover with heartbeat or keepalived seems like a much better solution to me when feasible.

[1]: http://serverfault.com/a/328321/85897

[2]: https://bugzilla.mozilla.org/show_bug.cgi?id=641937

brianwawok · on July 8, 2015

I have tested it in production, and seen traffic move! I am honestly not sure I ever had a customer access my site in Chromium, so that is not a deal breaker either way (assuming it wasn't also a Chrome bug).

Hacker news takes > 20 seconds to load all the time. You mash reload and go on with your life.

I think people get too hung up on "I must have the most optimal HA setup in the history of the world" they end up having no HA, or spending thousands upon thousands of dollars to make some elaborate AWS rube goldberg device that lets you checkoff a bunch of HA boxes. I know a lot of people who did that, and their fancy AWS HA contraption totally fails in the real world because the entire US-EAST region went down and operation depending on at least one availability zone of it working to stay up. Look how much effort Netflix puts into HA, and how many hours a year they are totally broken.

For each application, you have many competing desires. You can have a HA website or web application without using IP failover. IP failover is cool, but not without it's own problems. Every solution has pros and cons. DNS round robin is not a bad solution for many classes of apps that want dead simple failover.

vruiz · on July 8, 2015

> Any LB goes down, and DNS client retries will deal with it.

How? How does the DNS client know that the IP no longer works? do browsers today have this mechanism?

I'm not a network guy so perhaps I'm wrong but it's my understanding the problem with DNS load balancing is that you can not invalidate the TTL on the client.

brianwawok · on July 8, 2015

It is up to the client. But all of the clients (browsers) out there do more or less the same thing.. they try the first DNS record.. if no response in ~30 seconds, try the second, and so on - going down the list.

TTL does not matter here because I am not yanking or adding to my DNS record. I am simply saying "Here are 3 servers.. try them in order until you find one that works".

In practice, a helpful feature is

a) Most clients try them in order from top to bottom b) Most DNS servers (including Digital Oceans) randomize the return order.

So if you do 2 dns requests, the first will return 1.2.3.4, 1.2.3.5, 1.2.3.6, and the second will return 1.2.3.5, 1.2.3.6, 1.2.3.4

This has the double benefit of splitting traffic more or less evenly between my load balancers, and dealing with things with one or more is dead.

vruiz · on July 8, 2015

I'm not sure all clients will behave as you are experiencing. But in any case:

> if no response in ~30 seconds, try the second

That is not HA. Most people will not wait 30 seconds for a page to load. If your business looses money with every minute of downtime this is certainly not adequate. It's certainly not recommended https://en.wikipedia.org/wiki/Round-robin_DNS#Drawbacks

brianwawok · on July 8, 2015

Name a business that has not had a 30 second outage in the past year?

How about services you use a lot. How many hours has hacker News been down in 2015 (yet you are still on it right now)? How many hours has netflix been down in 2015? How many hours has entire chunks of AWS been down in 2015?

Every business is a spectrum. A HFT trading shop may decide that 1 second downtime per day is their max outage. A webpage advertising a pet adoption event may decide that 6 hours of downtime per day is the most they can tolerate. You have to make this decision for each product, and even better -each part of each product.

The entire point of this post thread was the idea behind "I can not use DO for serious stuff until they implement load balancing"... which is silly for most businesses. And even those businesses that need high uptime, I offered (and still believe) that DNS round robin is a decent way to get HA for almost no money.

You link to an article about it, but miss the boat. What other solution can I implement in a few minutes to provide available load balancing between any two servers in the world (same or different host provider, same or different datacenter, same or different continent).

Sometimes the relatively simple solution is "good enough". Sure you can find a wikipedia page saying where it is not perfect. I would not DNS round robin a HFT trading app. I have no problems on it for 99.99% of the web though. So much of the web has NO failover of any kind, stupid simple DNS round robin would be a vast improvement for most websites.

vruiz · on July 8, 2015

> Name a business that has not had a 30 second outage in the past year?

It's not a 30 second outage! Your domain will keep resolving the bad IP. Even with an extremely low TTL (also not recomendable) ISP's DNS will cache it and even some will ignore your TTL. A big portion of all new users will keep hitting the bad IP.

Anyway, I won't try to convince you to change your setup if you are happy with it, but it's obvious from the comments that I'm not the only one thinking it's a suboptimal solution, so at least some of us won't be considering DO for HA systems given the circumstances.

brianwawok · on July 8, 2015

> It's not a 30 second outage! Your domain will keep resolving the bad IP. Even with an extremely low TTL (also not recomendable) ISP's DNS will cache it and even some will ignore your TTL. A big portion of all new users will keep hitting the bad IP.

So with 5 load balancers, 1/5 of customers see a one time hit of 30 seconds (after which they return to full speed).

What better solution for the same price do you propose to get HA on a budget cloud provider?

vruiz · on July 9, 2015

> What better solution for the same price do you propose to get HA on a budget cloud provider?

Nothing, your solution is obviously better than having none and it's enough for your needs. But the original discussion was about what's needed for DO to become a competitor for big business, not low budget, there they are already king.

Normally you want at least IP failover, meaning that you get an IP and can be rerouted to a different server with an simple API call. At work we use hetzner, which is not exactly a high-end provider but offers it: http://wiki.hetzner.de/index.php/Failover/en

Then, can be even better if the provider offers this HA-load balancer as a service, so you don't have to setup anything.

You might still need DNS failover to recover from a full datacenter going offline.

tobz · on July 8, 2015

Your view is an accurate view. It takes the end user -- be it some sort of client, browser, or manual user retry -- to hit the other, alive IP(s). There's also the TTL of a bad record being dropped to consider.

You can simulate IP failover with something like Elastic Network Interfaces / Elastic IPs in AWS... it's just not going to be on the same level of speed as doing it in, say, your own rack in a datacenter. It's also subject to weirdness where you could have some sort of split brain, nodes trying to take over interfaces in a loop. The health checked "multiple load balancers behind a single DNS record" approach has flaws but also simplifies a lot of things.