Load Balancing without Load Balancers

mnutt · on March 6, 2013

If a particular server starts to become overloaded, and it appears there is sufficient capacity elsewhere, then just some of the BGP routes can be withdrawn to take some traffic away from the overloaded server

I'd be interested to hear about the mechanism for determining if there is sufficient capacity elsewhere, and how you avoid a cascading failure.

eggnet · on March 7, 2013

I'm interested in finding out what happens to tcp sessions that were established when routes are withdrawn.

mnutt · on March 8, 2013

This is the only thing I've found on this:

http://news.ycombinator.com/item?id=2484047

Titanous · on March 6, 2013

I'd be interested to see a blog post that details what "adjusting the way we handle [TCP] protocol negotiation itself [for Anycast]" entails.

sophacles · on March 6, 2013

Something that isn't quite clicking for me from this article, and my other knowledge of cloudflare...

So you have Nd DNS IPs, Nc Cache IPs, Np Proxy Ips and so on, plus some failovers. It seems to me you can only have Sum(Nsubscript) + M servers in any given pop. Which is all good, but I presume that the load on proxy and cache servers would be such that you'd need quite a few instances of each. Further, given the nature of cloudflare's services, it would seem that some CDNed sites would be heavier than others.

So how do you assign various sites to IPs? Is this via some dynamic DNS magic? Is there a lot of communication between proxy and cache instances at each pop (DHT or similar?).

Basically, what I'm saying is, using BGP to do most of the load balancing is awesome, but it seems there has to be more to it than that, otherwise you'd experience a lot of flapping between servers handling heavy sites.

That or I'm missing something, but what?

junkilo · on March 6, 2013

You anycast the dns servers and anycast the results the dns servers serve up. This is kinda confusing, but all it is really is announcing your (same) IP block at every datacenter and relying (maybe selling/) on bgp AS path selection as the app to get traffic to the closest datacenter.

eastdakota · on March 6, 2013

Yes, we control DNS as well so we can use that to further spread load and move traffic as needed.

23david · on March 6, 2013

Why not also use software load balancers? I don't see the advantage of going to so much effort to avoid software load balancing. It's neat that you guys got it to work, but I would think that a hybrid system would have more functionality and could better handle degraded performance situations.

hhw · on March 7, 2013

Exactly, as it's pretty much standard with other CDN's, and has been for a very long time. Without any application level intelligence, you would end up with duplicate copies of content, as every node in every city ends up caching every bit of content. You're then limited to the amount of cache each node has, resulting in mush less available cache overall, more content expiring sooner, and even more requests back to the origin server. Although, if you use a tiered approach with some intermediate nodes between the end cachers and the origin server, you could mitigate this somewhat but it would still be quite suboptimal. The more standard method is to use a 2 layer approach, with the front end first layer intelligently hashing the full URL across the pool of the back end second layer. The trickier part then is if a single object requires more than one back end node, you need to be hash the same content to more than one node. This could be done by just having multiple pools of server on the back end, if the scale requires it.

MichaelGG · on March 6, 2013

What do you mean? Isn't the reason to push it off to the router that routers scale up like crazy? Whereas doing more even 40Gbps linerate on a server is nearly impossible without specialized hardware. Even 10gbps linerate can be a lot of work as of a year ago.

And if the routers LB to a bunch of software nodes, then aren't you sort of in the same position?

Can you help me understand?

xxdesmus · on March 6, 2013

A software load balancer is coming soon. Don't worry. The hybrid approach (network + app level intelligence) still has some advantages as well.

jcr · on March 6, 2013

> 2. Router: at the edge of each of our PoPs is a router. This router announces the paths packets take to CloudFlare's network from the rest of the Internet.

Umm, just curious, but I thought you used a set of redundant routers at each POP?

Or is the single router used highly redundant on its own?

jauer · on March 6, 2013

Since this is all for a anycast system it might be better to have more POPs at more locations instead of trying to make a single POP bulletproof by doubling up on what could be the single largest capital expense in building a POP.

eastdakota · on March 6, 2013

This is generally our philosophy.

msumpter · on March 6, 2013

Typically a PoP could be defined as either an individual provider being brought into a facility or the location of a group of providers entering the facility (meet me room). In this instance I think they mean an individual provider (for example Level3) bringing in a feed into a data center. You would then get multiple feeds from distinct providers. Each feed would require a router for termination, and this router is typically using BGP. Since you keep multiple providers having redundant routers on each feed isn't always necessary. But there are cases where I've seen a single feed being served by redundant routers using something like VRRP. After that you would take feeds from the boarder router into our inner/core switch fabric to be distributed throughout your network. It just depends on the level of redundancy you want at each layer.

jcr · on March 6, 2013

Thankyou. I remember VRRP being a bit of a mess from inception due to the Cisco patent claims against the original IETF draft (too similar to their patented HSRP). Are there still a lot of compatibility problems across vendors? At home I used to use CARP (Common Address Redundancy Protocol) on OpenBSD, but I haven't done it for awhile, and I know it's improved a lot since then (i.e. pfsync(4), ifstated(8), ...).

jacobian · on March 6, 2013

What, exactly, does the picture of a woman in a swimsuit have to do with load balancing?

coat · on March 6, 2013

She's...balancing?

citricsquid · on March 6, 2013

That's not a swimsuit, it's yoga attire.

vacri · on March 7, 2013

Sometimes a cigar is just a cigar.

darkarmani · on March 7, 2013

You mean the woman "balancing?"

zorkian · on March 7, 2013

Yeah, I found it incongruous and strange to have that picture on top.

rdl · on March 6, 2013

One thing I've wondered is how homogenous your POPs are -- how big is the multiple (in router size/number and server count) between say San Jose POP and Sydney POP. I assume you'd want to scale based on expected traffic which considers each POP nearest, but that changes based on factors outside your control, and could change fairly fast. Plus DoS might come from areas which don't see a lot of regular traffic.

eastdakota · on March 6, 2013

Yes, our PoPs are different sizes depending on the load. San Jose and Los Angeles, for example, are larger than Seattle. We've designed the system to scale relatively linearly by adding additional servers. As a PoP gets more traffic we can handle it in two ways: 1) adding more equipment to the PoP; or 2) adding another PoP to offload a portion of the traffic.

spydum · on March 6, 2013

One thing I always wondered is how session persistence is maintained for things like webapps when trying to host services through anycast. Only idea I can dream up is cookie based and involves each site having internal proxy mappings to each pop in the event a different pop becomes the favored route (paths change all the time, and routers on the internet dont care that you had an established connection).

toast0 · on March 7, 2013

CloudFlare doesn't do sessions, they're just* a proxy cache to their customers' origins. * A nice proxy cache; and they do have some features to muck about with the content on the way through if you desire, but I don't see any application aware routing options.

What you're asking about isn't really an anycast problem either, you can have the same situation with any load balancing situation; if you need the client to come back to the same server, you need the client to bring you back something that says which server to route to, for example a cookie or a hostname, and you need something in your stack that handles that and routes it (could be DNS entries, a hardware or software load balancer, application logic, etc). Avoiding sessions is better, of course, if you can (or if you can have the client keep the state information; possibly encrypted and signed).

junkilo · on March 6, 2013

Great writeup! If you offer a global service you are expected to use anycast on the WAN. BGP AS path selection is just as effective whether you use unicast or anycast.

jamieb · on March 6, 2013

+1 for remembering Las Vegas (and Seattle) go first.

eastdakota · on March 6, 2013

TOGoS · on March 6, 2013

This article would be more enjoyable without the extraneous clip art.