TL;DR: Fastly needs full routing tables on all CDN nodes in order to determine which is the best transit path to push out content through. In order to save money, they used a programmable Arista switch instead of a traditional router. Their solution is to reflect BGP routes via the switch to the nodes and and fake direct connectivity between the nodes and the transit provides, so that nodes can directly push out content to whichever transit provider they determine is best on a per packet basis.
Please correct me if I'm wrong.
Maybe I'm just obtuse, but I found the blog post confusing about what it really is about and long winded, taking a very long time to come to the point. There is a severe lack of context in the beginning, it would have tremendously benefited from the what and the why of what they are trying to do.
For me the TL;DR would be: Our edge nodes originally had full BGP tables from our upstreams. This was good as our nodes care about path selection. When we needed to scale, we didn't want to use a pair of (expensive) routers, because we didn't actually need routing, but our upstream were not happy about having to peer to more than a pair of devices. We bought some switches, reflected BGP routes to the nodes, and had to hack L2 connectivity between the switch and the nodes.
I'm also really confused as to why they insisted on taking full tables at their edge if they also built their own route optimization technology. It seems like it would be much simpler and faster if they just took a default route from each provider and then did route probing on their top 30% prefixes in each PoP. You don't need full edge routes for that.
You need a single route reflector (or two for redundancy) to tell you what routes are in the table. Then probe the busiest prefixes (as determined by flow data) via each provider and take the best route. Localpref the best route higher in the route reflector and send it to the Linux hosts. No need for a huge FIB on the edge. This is basically what Noction IRP/Internap FCP platforms already do.
This reminds me of the scene in Primer where they ask if they're doing it for a reason or just showing off.
FWIW, the Netflix networks just take a default route from a pair of transit networks in each PoP. At their scale, they found it better to negotiate better worldwide transit deals than optimize networking at the PoP level. Different game than Fastly, but another data point.
I remember those FCP boxes, its been a while. I think its maybe what they call Miro now, do you know? The Avaya route science boxes did the same thing. It seems both of these have fallen out of usage. I always thought it was odd that there wasn't an opensource project for this. I know Internap originally did it in PERL.
Yep, FCP's successor is Miro. I haven't had a chance to look into it, but their sales people claim it's all new.
Noction has pretty much taken over the entire market. Overall their product is good, but the Internet game has changed a lot since the original FCP days (lots more direct peers, lots more exchanges, lots more route servers, lots more paid peering), and I don't think that things like FCP/Miro/IRP do well once you move past paid peering. Things just become too complex. IRP exposes some of this to you, but the implementation is kludgey (routing instance per peer, DSCP bits to identify peers on exchanges, and more).
Agreed about how the Internet game has changed. If I remember correctly Internap didn't peer with anybody they simply bought tons of transit and connected the PNAPs with local fiber.
I didn't understand your comment about "... once you move past paid peering."
Did you mean "once you move to paid peering"? I would think that moving past it would be actual settlement-free peering. But I'm curious what you meant.
> It seems like it would be much simpler and faster if they just took a default route from each provider and then did route probing on their top 30% prefixes in each PoP.
Probing just 30% of the top prefixes is likely not enough for their use case. Probing is also an after the fact approach. Fastly needs a solution which will give best possible path from first packet.
> FWIW, the Netflix networks just take a default route from a pair of transit networks in each PoP. At their scale, they found it better to negotiate better worldwide transit deals than optimize networking at the PoP level. Different game than Fastly, but another data point.
Different game indeed. Netflix just cares about throughput, Fastly also needs low latency, hence the need for more optimizing.
"Probing just 30% of the top prefixes is likely not enough for their use case. Probing is also an after the fact approach. Fastly needs a solution which will give best possible path from first packet."
That's not how BGP works. There's no such thing as getting the best path from the first packet in terms of latency. There's nothing in BGP that tell you about the path in terms of latency or performance. The only "best" that BGP gives you is AS path length between you and the destination. It tells you nothing about congestion/packet loss upstream. The only way to get that is out of band probing, like taking latency measurements and pref'ing routes. There is no approach beyond "after the fact"
There is nothing in this article that talks about optimization of routes.
Of course not. I only stated what Fastly needs, I did not make a statement about how BGP works.
By having full BGP feeds from all transit providers the nodes can use the AS paths as part of the heuristics to determine the best possible path (with the available information) from first packet. Obviously the heuristics will also use after the fact probe information to tune the model.
The point is that the solution proposed by the grandparent is not better than the one described in the article.
> There is nothing in this article that talks about optimization of routes.
What you're talking about works OK in a single colo or datacenter model, but it doesn't map to the PoP model. It's the PoP model that's driving them to do this. In their PoPs they've got local peers and exchanges that will handle the bulk of their traffic. The leftover stuff is only local and probably minor. They can definitely find that with flow data and optimize across a pair of default routes (basically building their own routing table). It's a pretty common practice when all you deploy is PoPs. Why do you care about a full BGP feed when you're only handling 4-10% of the Internet in a given PoP?
AS path length doesn't tell you anything about a route's performance. It's a hint and only a hint.
> What you're talking about works OK in a single colo or datacenter model, but it doesn't map to the PoP model. It's the PoP model that's driving them to do this
Could you please explain how the PoP model in this case differs from the colo/DC model?
Outside the US Fastly's PoP cover multiple countries or whole continents.
> Probing just 30% of the top prefixes is likely not enough for their use case. Probing is also an after the fact approach. Fastly needs a solution which will give best possible path from first packet. <
Considering that Fastly deploys PoPs and not colos or datacenters, I think that the top 30% of their transit routes would be the lionshare of non-peer activity. Fastly's PoPs are generally at (or have reach to) the largest Internet exchanges for peering purposes. Peering routes are generally the best, cheapest, most concise routes you can get. You don't need a big, expensive router when you're in an exchange's peering mesh. Considering that you're only handling local traffic (by design of having multiple PoPs), peering plus two default transits will probably optimize for the traffic that matters most.
Also as mentioned, you can't really determine "best possible path" without probing. You can see what BGP tells you, but that's not always the best possible path. People that run networks know this. That's why things like FCP/IRP/RouteScience exist (for better or worse).
> Different game indeed. Netflix just cares about throughput, Fastly also needs low latency, hence the need for more optimizing. <
Sure, but you won't get that optimization from BGP. In a Fastly PoP, you're mainly dealing with in-country or in-region eyeball networks. If you buy transit from 2-3 major networks, BGP probably won't help you with those eyeballs. The path length will probably be the same (the eyeballs either peer or buy transit from large tier 1s). You generally can't trust MED to do anything useful with it, so that's out. Those are the only 2 things BGP can use to impact route selection. If path length and MED are the same, it's a coin flip as to which path to take (not really as the first installed route is the winner).
If you just took default routes and peered locally, you'd be much better off in a PoP scenario than trying to re-invent route reflectors and taking full BGP feeds.
I'd bet the best transit path isn't the primary reason. Having the routes in the application means it choose a path based on cost, customer preference or any number of other business rules.
But it is, you can probe and route around brownouts in certain AS's in the path. I would say that a CDNs ability to get to the content to the eyeballs is the most important and trumps all else.
I think cost is most important. I know other CDNs use routes for managing costs too. Being able to modify traffic flows means you control the flow of money between peers and/or customer=>transit relationships.
Cost is unlikely to be most important. If it were the CDN could just buy the cheapest transit, accept a default route and call it a day.
In the article Fastly has 13 transit links at one site and the ability to ping each route to determine the best path. They would hardly go to all this trouble if cost control was the main objective.
Cost is important to any business, but CDNs are all about speed, the biggest metric is "time to first byte." This is always going to take precedence over cost for a CDN. Differences in transit costs for a large CDN with high commits is negligible.
There are more costs involved than just those of the CDN.
Let's say I'm a network operator. I have some number of POPs/facilities, peering relationships, transit customers maybe even some relationships where I pay for transit. If I decide to partner with or become a customer of a CDN who can control the path traffic takes to reach the CDN on my network, I/they/we can use that negate, or elminate, my transit costs AND drive revenue to me.
(very simple example) Lets say one of my transit customers also has a non-paid peering relationship with one of my upstreams. Customer may choose to take the longer path to reach the CDN on my network because it's free to them, rather than paying me. Now, not only are they not paying me for transit, but I may be incurring additional cost to pay my transit. If this routing data is in the CDN application, a logical step is to report on costs of the traffic. I would see this traffic taking a more costly path and be able to act on it. Maybe that means pre-pending my ASN, maybe that means modifying the DNS requests made from my customer's IP range(s) to use a block not routed across the other ASN. Maybe it means I simply don't want that traffic to my "nodes" anymore.
Maybe all anyone needs is "fast" today, I agree its not the only factor, but being able to manage cost was a discussion point in the past.
Interesting approach. So Fastly is offloading the full routing table to their carriers' router(s). That's because routers that can hold full BGP tables are expensive to purchase and maintain. But to retain some form of control, they're terminating eBGP at the switch and using iBGP to disseminate (inject) the providers' route (next hop).
I feel (I don't have direct experience with this setup) like they're just offloading some compute power (therefore cost) to the hosts. So the cost is automatically spread out across their relatively massive edge nodes. A line showing a router plus support costing $100,000 looks bad in expenditures vs showing a server plus integrated routing showing $2500.
I'm curious about how this impacts Varnish considering how table look ups can be bus-expensive during odd route changes/flaps (storms). %sys must go through the roof as a result.
> So Fastly is offloading the full routing table to their carriers' router(s).
The carriers' routers will have full routing tables in any case, so Fastly is not really offloading anything. They just aren't downloading full routes to a big central router and doing routing decisions there, but rather at a host level.
> I feel (I don't have direct experience with this setup) like they're just offloading some compute power (therefore cost) to the hosts.
This appears to be the case. The routing part isn't very computationally expensive, the biggest problem at larger bitrates is moving the packets with software. But then again they are already limited by what their CDN nodes can push out, so the routing isn't really much more of a burden.
I didn't see him in the room, so that doesn't mean he wasn't there, but it sounds like Artur was paying attention to Dave Temkin's talk at NANOG (from Netlix).
Titled: "Help! My Big Expensive Router Is Really
Expensive!"
Netflix goes even cheaper/simpler and just uses default routes to a pair of transit providers. It may come as a shock to most of you, but no, Netflix is not 100% in AWS (compute, yes; network, oh hell no).
> they're just offloading some compute power (therefore cost) to the hosts
I don't know enough about networking to follow the tech, but they seem to say in the post that's exactly what they intended to do:
> The idea of dropping several millions of dollars on overly expensive networking hardware wasn’t particularly appealing to us. As systems engineers we’d much rather invest the money in commodity server hardware, which directly impacts how efficiently we can deliver content.
Great read, love the "hey what does it need?" approach rather than the "how is this done?" approach. Tut Systems had bought one of the first "hotel internet" companies back in the 90's which used a similar approach by subverting the ARP protocol, when you connected any thing you tried to ARP for it would respond "Yup, that's me! Send me your packets" and you would end up at the "Give us your credit card" signup.
The nice thing is that at this level networking is really simple. And if you can get access to the internals of switches to craft behaviors at that level, it is a pretty good way to go.
Proxy ARP it's a thing and it is exactly what you've described :)
And if by proxy DNS you mean you'll subvert ARP to reach a DNS server... that's proxy ARP. DNS is a few layers above :)
Definition:
Proxy ARP is the technique in which one host, usually a router, answers ARP requests intended for another machine. By "faking" its identity, the router accepts responsibility for routing packets to the "real" destination.
Amazon, Google, etc. etc. all these companies building their own custom network devices, and so much of it coming back to both "It does too much, most of which we don't need" and "We want to do our own thing at that layer".
As James Hamilton noted at the AWS Re:Invent convention, not only is there they overhead and development expense of these unneeded components, just the sheer complexity of the application running is inevitably leading to bugs and unexpected behaviour. By simplifying the device to just do the few things you actually need it to do, you end up more performant, and more reliable.
I wonder if the entrenched network appliance providers will wake up?
They'll want to be looking at using MPLS and EPE techniques now that there's support for it on their Arista platforms. This L2 technique is arcane, going to be painful to scale and reapply generally to other areas.
Huh? What would MPLS do? They don't operate a backbone. They're operate only at the edge like all CDNs. There's nothing arcane about this, just basic networking. Where is the scalability issue?
That'd be great if they could. We started using Arista 5 or 6 years ago at a small startup I was working for because Arista was the only vendor we could find who was comfortable giving us such an amazing amount of access to the inner workings of their high speed switches. Really enjoyed working with their gear and, when the occasional question came up, their engineers (top notch folks).
I've run Arista for 10 years now (since Arastra days), but I'm a lot more interested in whitebox switches running some sort of open networking software. The silicon designs are all coming from Broadcom/Intel/etc.
This is where stuff like Cumulus comes into play, and is a lot more excited than vendor lock-in on "open" technologies provided by Arista.
Please correct me if I'm wrong.
Maybe I'm just obtuse, but I found the blog post confusing about what it really is about and long winded, taking a very long time to come to the point. There is a severe lack of context in the beginning, it would have tremendously benefited from the what and the why of what they are trying to do.