Reimagining the future of routers

hueving · on June 8, 2016

"What modern IP routers do is exactly this. Every forwarding entry has to be fast. Analysis of real backbone traffic data, says it does not have to be. For all practical purposes, today, forwarding tables are oversized by a factor of 10x today."

What he is suggesting will drastically impact the latency of packets as well as the throughput. Following his analogy of cache hierarchies in a regular computer, some prefix lookups are going to be 'downgraded' to the main memory and will take maybe 100x the time.

If a router is forwarding at 10gbps it has ~51 nanoseconds in worst case (64 byte packets). So this is the overhead you can probably expect at each router as your traffic traverses the net. I'm 17 hops from my old university's network. That's just under a microsecond of lookup times. Not bad thanks to TCAM or other specialized lookup chips.

If all of these routers adopted this, my lowly traffic would have to read from main memory on each route lookup because I would never be in the top tier for bandwidth usage even if I was connected all day long. A full routing table is >600k addresses now[1] so each lookup may have to reference main memory several times as it walks a trie. Let's assume about 10 times to be generous which (at 100ns a reference) comes to about 1 microsecond. This is just for 1 packet.

As you start to pile on the thousands of other small connections (we're talking about provider network routers) that would be put into this steerage class, there is going to be contention and queuing that could easily push it into 1ms-10ms depending on bursts.

So if my whole path adopted these routers, I could experience jitter of ~170ms or worse. Gross.

This is essentially turning into a crappy QoS system where the biggest bandwidth hogs get the good service and everything else gets garbage.

Thumbs down from me.

1. http://www.cidr-report.org/as2.0/

catwell · on June 8, 2016

I have heard some companies have already started doing that a slightly different way, without the need for different hardware design. They buy several inexpensive routers that can only hold the top 10% of the routes (that represent 90% of the traffic) in memory, and a single expensive router which holds the whole table. Then they just set the expensive router as the default gateway on the unexpensive routers.

> I would never be in the top tier for bandwidth usage even if I was connected all day long

Maybe, but it probably doesn't depend on you unless you are your own AS. Routing tables do not contain individual IP addresses so if you are a "normal" ISP consumer chances are you will be in a block that is in the top 10%.

iheartmemcache · on June 8, 2016

Yep. The author of the article seems to ignore the entire line of (relatively inexpensive) Cisco ASR's designed specifically for this. Juniper I'm sure has a line to do something similar. Residential Comcast in the early 2000's could get me from Boston to the Bay Area in ~35ms (granted, I'm sure they were using millions in 7206VXRs and whatnot). Either way I'm sure that 90/10 split-type methodology has been going on for ages. There are plenty of clever ways the CCIE's work in as much of the routing table as possible within their fixed budget.

I remember the days when ARIN would give out AS's to basically anyone. Before all the cable companies merged, I spent my teenage years developing close relationships to my ISP. One of the high level techs told me to throw up BGP on a FreeBSD box and got on the ARIN mailing list with me to get a /30. That's the type of AS that'd fall into the 90 percent category (along with small businesses that who, for whatever reason, have their own AS). Even on my own AS, I'd rarely see latency above 50 ms domestically via BGP or OSPF.

hueving · on June 8, 2016

>Routing tables do not contain individual IP addresses so if you are a "normal" ISP consumer chances are you will be in a block that is in the top 10%.

My home IP falls into a slash /22 block I can see in the global routing table. It's highly unlikely that this particular 1000ish IP addresses are responsible for enough traffic on the internet to make top 10%. Top 10% is going to basically be a bunch of hosting providers and CDN prefixes.

pas · on June 8, 2016

Maybe, maybe not.

If routers become cheaper and smarter the network will grow, you might get more direct routes (less hops), or simply better load balancing (based on latency and-or throughput, maybe even indicated by the size of the flow, so your SSH session will be low latency low jitter but also if you start to download a file via a subchannel it might get shifted to different routers), and with better balancing flows will get distributed not just considering packet drops, but maybe cache hits too (latency basically).

Anyhow, upkeep costs are very important, because they limit the size and efficiency of the network. (High barriers to entry limit growth, high upkeep encourages centralization, which is bad from a reliability and fault tolerance aspect.)

hannesgredler · on June 8, 2016

your reasoning is correct (the path latency gets higer) its just that you got the final numbers wrong ;-)

on modern software forwarding core (fd.io/VPP and DPDK) you can forward with a sub 100us latency. so your total latency ends up being roughly 1.7ms "slower".

have a look at this: https://www.youtube.com/watch?v=T66BTHnENY8

hueving · on June 8, 2016

That video is pretty light on details. I would be interested in seeing how much the routes were summarized and if it was enough that they all fit into the processor's cache. The thing that kills me with these network performance benchmark videos is that there are so many things that drastically impact performance and you never get any details about them. Just the marketing pitch.

Cerium · on June 8, 2016

I've done some work with DPDK about a year ago. At that time I was getting about 12us latency for small packets. Interestingly the power consumption is maximum at minimum load. DPDK uses poll mode drivers to minimize latency, when load is light it uses more power constantly polling than the cpu uses doing computations.

Hikikomori · on June 8, 2016

Performance within one numa space seems to be great. What can you get between numa spaces? That would be a common forwarding path for any router with enough interfaces.

feld · on June 8, 2016

If you want good results you don't build a server with more than one CPU socket. NUMA is too expensive for high performance networking.

Hikikomori · on June 8, 2016

So you're basically stuck with the number of cores of a single socket per 1U, seems like a waste of space and possibly power usage when 2 cores are utilized per 10GE port (in this video). Doesn't seem to scale well now that 100GE is getting more common, so 2 100GE ports per 1U? Depending on where this device would be used the recent 32port 100Ge 1U switches seems to be more interesting, small fib, but with the correct protocol support it could fit a lot of use cases, especially with something like sir[1].

1. https://github.com/dbarrosop/sir

ra1n85 · on June 8, 2016

Except that DPDK has a hard ceiling in terms of total PPS.

signa11 · on June 8, 2016

> If all of these routers adopted this, my lowly traffic would have to read from main memory on each route lookup...

yes, but current COTS hardware, via userspace networking can very easily saturate a 10g interface at 64b packet-size / core.

imho, it is the convergence of a large number of advances in COTS hardware over the last couple of years, that has changed the field to mostly a software problem rather than a combination of hardware + software one. when you have proprietary hardware, software tends to be more or less proprietary, which is where most of the networking vendors are currently.

a large number of projects are actively attacking this space e.g. snabb(https://github.com/lukego/snabb), routebricks (http://routebricks.org/) etc.

Cerium · on June 8, 2016

I wouldn't say that they easily reach 10g on small packets. I got to about 8g before I started to have trouble. It certainly takes some optimization. If I remember correctly Intel's DPDK documents warn that you only have about 80 cycles between packets.

signa11 · on June 8, 2016

Hmm, for our case I 'program' flows which then end up being used by per-core tasks for forwarding decisions, and with couple of tweaks etc was able to handle line rate across 4 10g interfaces for 64b packets, with one physical core per port.

This is on bare metal though, so might be different experience if you are running stuff on vm's where without sr-iov you would have multiple layers of indirection + context switches before someone can actually do something meaningful to the pdu...

edit: slight clarification added

netten · on June 9, 2016

can you tell me how you did this, i would really like to try this out.

amazon_not · on June 8, 2016

> yes, but current COTS hardware, via userspace networking can very easily saturate a 10g interface at 64b packet-size / core.

Citation needed.

feld · on June 8, 2016

Heard of Netmap or DPDK?

amazon_not · on June 8, 2016

Yes. As far as I know neither of them can push 15Mpps per core. Please please prove me wrong. I'd love to be.

tyingq · on June 8, 2016

It's also comparing basically an assembled automobile with a pile of parts. I suspect there will be a well vetted router-centric distribution that leverages DPDK soon, but I don't think that exists yet.

bogomipz · on June 9, 2016

Huh? All 10 Gig NICs can and do push 14 Mpps using minimum MTU:

http://www.ieee802.org/3/10G_study/public/speed_adhoc/email/...

What are you talking about?

signa11 · on June 8, 2016

> Citation needed.

http://lmgtfy.com/?q=dpdk+15mpps

reitanqild · on June 8, 2016

I hate lmgtfy used for this. Possibly more than Citation needed.

iyx0c7rj3tsm · on June 10, 2016

> "where the biggest bandwidth hogs get the good service and everything else gets garbage."

I think the worst-case, steady-state, scenario is actually equal time sharing, and the whole value is that while you aren't talking, other packets flow faster. That's not a doomsday scenario. Noone is out anything. (If this equal-sharing isn't fast enough, your upstream links/peers are over-subscribed.)

Also, your cache/memory analogy disrgards the possibility of simplifying and/or distributing the routing tables.

Route consolidation is one solution. Having no table at all is ideal. My little underpowered home router has 1 entry for it's upstream gateway. There's no reason to keep billions of records in a table in any central router either, given there are only going to be so many uplinks.

The more scalable solution is to distribute the table(s) in parallel to other peers. If the routing tables really are too big and can't be consolidated in core routers, the answer is simply more routers, geographically distributed to best meet demand, not fatter ones in the core, with higher latency, higher costs, and all the downsides of centralization like survaillance.

QoS is a solution in search of a problem; networks scale fine when we scale them out, not up.

Hikikomori · on June 8, 2016

How would the forwarding path with a higher and possibly variable delay work with the small buffers that this article also suggest?

hueving · on June 8, 2016

"Every added feature will make future feature additions harder. If you just make the number of supported software features large enough, you can extrapolate, that at some point, this will become unmaintainable. External measure of such a condition, is too hard it is to get functionality into a particular main line release. If you are already using software which can never get “de-featured”, I have bad news — you are doomed to spend your life in the “eternal bug hell.” Availability goes down, operational cost goes up, and your vendor cannot possibly fix it. Time to change vendors is the only way out."

Only if you assume a terrible code base. It's very easy to make routing protocols as modules because they just maintain pokey old slow routing information bases that can live in main memory that don't have to react on the nanosecond scale.

I've worked on modules for OSPF on a vendor router and if the customer isn't using OSPF, that daemon and its code are never even executed. No "eternal bug hell".

This whole blog is basically just pitching major feature-gaps as a feature to prepare us for some MVP I expect to see from him in the coming months that only supports BGP and ethernet or something like that.

hannesgredler · on June 8, 2016

good guess :-) - our MLP (minimum lovable product) is BGP and IS-IS along with a VPP based software forwarding module.

sargun · on June 8, 2016

Since when have you guys been leveraging VPP? How do you find it? I tried to write some code for it, and I found it kinda difficult in comparison to Click / Snabb, but I realize that they're two totally different systems.

hannesgredler · on June 9, 2016

we started integrating it in March. Arguable the compile chain is a bit heavy weight, but the code is very structured, easy to extend and a clean architecture. - Dave Barach and the VPP crew were very quick answering questions we have.

BTW Snabb is cool, but VPP is more feature complete.

makomk · on June 8, 2016

The obvious reason not to encode assumptions about the statistical distribution of internet traffic in your router hardware is that if the assumptions ever fail (say, because some new P2P service takes off, or video viewing becomes less centralized) your routers will fall over. He's essentially proposing to build centralization and the end of the peer-to-peer internet into routers at the hardware level. Not only that, but anyone who can generate traffic that breaks that assumption can launch a denial-of-service attack against your routers.

hannesgredler · on June 8, 2016

history traffic patterns show clearly a rising in-equality/ power-law distribution. so unless that multi-year (decade) long trend reverses, forwarding lookup hierarchies are going to work even better.

hueving · on June 8, 2016

That's for bandwidth. You're still screwing low volume connections.

Consider 1 million people watching netflix from the perspective of a transit provider. If you're just looking at bandwidth you can obviously prioritize lookups to netflix servers. But then you have 1 million streams to different client IP addresses throughout the Internet. Each on its own will be a small fraction of the bandwidth, so are you going to punish them all? Not much gain from lookup hierarchies there.

hannesgredler · on June 9, 2016

netflix caches are serving millions of subscribers using a thousands to ten-thousands of prefixes, well below 100K. their caches are highly regionalized so its no practical problem.

ghshephard · on June 8, 2016

The title is clickbait, of course, (L3 routing is obviously not going away) but the essay is insightful.

Given that he mentioned Amazon, I'm surprised to see that there wasn't more in this essay regarding Amazon (and Googles) efforts to build their own routers. Also, a number of networking companies have started off by discarding all the legacy networking functions, and starting afresh (Juniper). It would be interesting to review the field and see who else is doing this, particularly in the last 5 years, and what their success has been.

Also - surprised that SDN only gets a brief mention in the conclusion - I thought, reading the essay, that's the direction he was going, and then it was over.

ealexhudson · on June 8, 2016

Presumably part of the solution he envisages is the rtbrick thing he's working on, which doesn't unstealth until next month. Feels like the start of a marketing trail to me.

danellis · on June 8, 2016

Yeah, I thought SDN was the link to the title: "replace routers with L3 switches that populate their forwarding tables using OpenFlow queries", or something.

dang · on June 8, 2016

What would be a better (i.e. accurate and neutral) title?

ghshephard · on June 9, 2016

From the conclusion:

The router, and the dynamic control-plane, as basic forwarding paradigm of the Internet, remains undisputed. However, it gets challenged using new concepts like SDN and NFV, which promise much faster network adoption, automated control, reduced time-to-revenue, which all are good business solutions. In order for router designs to be competitive to those challenges, requires to re-imagine how router hardware and software get engineered.

So, "Re-imagining the future of routers"

dang · on June 9, 2016

Good idea. Thanks!

(All: suggesting a good title is the best way to complain about a bad one.)

hueving · on June 8, 2016

"I am proposing to fundamentally rethink the router, adopt modern software architecture and paradigms, and urge the industry to catch up after 10 years of stagnation."

I think the author may have been living in a hole. This is not a new idea. Datacenter routing in many of the big companies is already being done with SDN ala OpenFlow or some other custom protocol.

Right now you can buy a whitebox 'switch' with 40gbps interfaces and load various operating systems that enable different management styles (e.g. OpenFlow control like Google http://opennetsummit.org/archives/apr12/hoelzle-tue-openflow...).

The router has already been 're-thought', it's currently just hiding under the term 'whitebox datacenter switch'.

CodingInfra · on June 8, 2016

This is not mainstream. I forget which OFC this was first presented at, but the Google presenter said they had ~120 employees dedicated to this project. Most companies don't have Google's talent or resources to implement something equivalent. Everyone is quick to reference Google, but can you tell me 2 other commercial customers (non-research) who have deployed OpenFlow at scale - probably not. ;-) Operators who are doing this today are the exception.

Going back to Linux, fact is using Linux on a bare metal switch today is like using Slackware in 1995. The need for a more turn-key Linux distro, with better apps (better routing, telemetry, ...etc) is needed to compliment white box hardware today. For a long time the HW has been the limiting factor since most merchant PFE platforms supported <50k FIB entries. Now with 1M+ (on chip) around the corner, SW quality and scale will need to improve accordingly.

hueving · on June 8, 2016

>who have deployed OpenFlow at scale

I didn't say it was OpenFlow everywhere. I meant custom software in some way (i.e. not an off the shelf Juniper/Cisco/whatever). OpenFlow has pretty narrow use cases once you dump reactive flows so it's definitely rare in the wild.

For other customers using whitebox switches with different OS's booted, look at whoever is paying companies like Cumulus/Big Switch.

pas · on June 8, 2016

Probably every OpenStack hosting company uses something like OpenFlow to provide the virtual/tenant/overlay network, so Rackspace et al.

Also see Project Calico for containers, they use the linux kernel as FIB, but obviously they could just as well use OpenFlow to program switches for bare metal machines (which then might run containers).

pquerna · on June 8, 2016

As deployed, most OpenStack SDNs will use OVS or similar software on top of a (xen,kvm) Hypervisor, connected to some kind of central controller, which just maybe would look kinda like OpenFlow. But all of this is running as an overlay on top of a much more traditional network.

This is very different than a TOR switch or core-router running OpenFlow. Only a few places are running more than a couple thousand VMs in a single SDN, and its nothing like what is being discussed above.

pas · on June 10, 2016

Sure, and OVS (especially with OVN) is pretty capable. (And we haven't even touched OVS DPDK.)

But if you need the big guns there are a ton of OpenStack Neutron plugins, OpenDaylight being one of them (networking-odl) and there are the classical/traditional vendor ones (Cisco, Arista, and so on).

I think the future is very much means this jungle of APIs integrating in whatever tangled ways, and eventually the reliable, robust and sane ones will prevail.

hannesgredler · on June 8, 2016

where theory hits reality is when those whiteboxes and their routing-stacks encounter the 600K routes coming from 40 different exits. still no viable solution out there today- quagga, bird, any ?

hueving · on June 8, 2016

>where theory hits reality is when those whiteboxes and their routing-stacks encounter the 600K routes coming from 40 different exits. still no viable solution out there today- quagga, bird, any ?

Well handling the routes themselves can be offloaded to regular servers. The issue is indeed managing the forwarding tables. "Something something aggregation, non contiguous bitmasks, NDA" runs wildly for the door. :)

bogomipz · on June 8, 2016

The IPv4 table today is about 610K routes. There shouldn't be a problem fitting the RIB, associated BGP communities and AS Path info in 32 Gigs of DDR 4 RAM.

There are plenty of boards that will hold 750Mbs of RAM from the usual folks - DELL, HP and Supermicro. So why is there no viable option? Also this is a hardware concern. How does choice of open source solution - Vyatta, Qugga or Bird matter?

What is the issue?

wmf · on June 8, 2016

The issue is that Tomahawk has far less than 610K entries of TCAM and a dozen different teams are exploring various types of RIB caching to accommodate that.

scurvy · on June 9, 2016

Tomahawk is old and busted. Jericho is the new hotness.

Full tables in FIB.

https://www.broadcom.com/press/release.php?id=s902223

wmf · on June 9, 2016

Shh! You're ruining it.

bogomipz · on June 9, 2016

Sorry, what is Tomahawk?

hannesgredler · on June 9, 2016

Its a 3TBits/s chipset from Broadcom https://www.broadcom.com/products/ethernet-communication-and...

bogomipz · on June 9, 2016

Thanks for the link, but how does this relate to my comment asking comment asking why you can't build your own router with commodity hardware? Isn't the Tommahawk and Jericho intended for ODMs who are building traditional switches?

wmf · on June 9, 2016

It's pointless to use anything other than these ASICs because their price/performance is so far ahead. Software routing is especially pointless.

scurvy · on June 9, 2016

A new whitebox based on Jericho is your solution.

walrus01 · on June 9, 2016

> Right now you can buy a whitebox 'switch' with 40gbps interfaces and load various operating systems that enable different management styles (e.g. OpenFlow control like Google http://opennetsummit.org/archives/apr12/hoelzle-tue-openflow...).

This is all very useful for in-building traffic until you get the traffic to the edge of your datacenter/colo/hosting environment network and need to exchange traffic with other ISPs... At which point you need a serious chassis based router with redundancy and full layer 3 capabilities (example: Juniper MX960, Cisco ASR9006 or 9010).

whitebox datacenter switches are a great way to get a lot of 10, 40 and 100 Gbps layer 2 ethernet switch capacity within a datacenter cheaply, they're NOT routers and should not be mistaken with them. They are things that you connect your hypervisor hardware platforms to (example: a whole shitload of 1U servers or facebook/OCP type servers each with dual socket, 16-core xeons in them), is a switch.

wmf · on June 9, 2016

It's only a matter of time until whitebox routers exist; Accton recently submitted one to OCP. Then it will take a few years for the software stack to be built.

hackney · on June 8, 2016

Personally, I think what needs to be rethought is the fact they are a separate device from the actual pc. Imo, they should exist inside the computer, cable modem included. As far as switches are concerned, obviously those will take more time to absorb into the general infrastructure of things. Simply put from a user perspective, it would be nice to have the 'router' inside the box and not separated from it.

forgottenpass · on June 8, 2016

they should exist inside the computer, cable modem included

Not _that_ kind of router. The type of router connected to you cable modem is a box that performs NAT, not IP routing.

hackney · on June 8, 2016

Yes, that kind. I have a cable modem router that has no buttons or controls on it whatsoever, separate from the pc, obviously. Then, I have the actual wireless router attached to that. I could have bought a single cable modem/wireless router. Regardless, depending upon the application there are essentially all one and the same in intent.

icebraining · on June 8, 2016

There are many such devices, e.g. the Netgear N600. In any case, you're not talking about the kind of routers being discussed in this thread.

hackney · on June 8, 2016

If the 'thread' applies to rethinking routers then I have every right to have posted what I have. Seems you are the one that wants to make an argument while I was making my own personal suggestion. Instead of assuming that you are smart while I am not, I suggest you let it go.

icebraining · on June 8, 2016

All I'm saying is that you probably won't have much response here; no personal attack intended.

hackney · on June 20, 2016

I am nowhere near skilled enough to even understand the majority. Shoots, I just installed my very first personal home router. Also, I might add, this reminds me of the Intel management engine, which to my eye is like a router built-in in some ways.

moonshinefe · on June 8, 2016

I can appreciate there are better systems out there now, but yeah, it won't be the end of the router any time soon in my opinion. If the transition from IPv4 -> IPv6 is any indication it's going to be a very, very long time before any of these new technologies gain traction, if at all.

Remember, IPv6 is a necessity in the future if we want to continue to allocate addresses without running out, while better routing methods are simply an upgrade. So I'm not sure there will be as strong as a push either.

Sarki · on June 8, 2016

Also think about sub networks: your entry point being in IPv6 while the rest remains in IPv4.

Even though the use of hostnames is convenient (and encouraged in IPv6) it's already a headache for knowledgeable people to set up, so imagine non tech guys (you're father or your grandma).

Unless we come up with a simple and easy process for this we're still miles away from a full IPv6 world...

soneil · on June 8, 2016

Honestly, it's probably easier for your grandparents. It'll just show up one day. it'll just work, and they'll never know it exists. (this isn't just hypothetical - it's already happening)

chrishacken · on June 8, 2016

I wouldn't really say that. Our whole network is IPv6 using NAT64 to translate for IPv4 only networks.

tetha · on June 8, 2016

Thinking about it, I think the death of a router is impossible as long as there is a network. In the very very end, routers and firewalls are responsible for decoupling responsibilities between network segments - this part of the network belongs to company A, this part is the network is part of company B. This part of the network is trusted, this one is not. How are you going to get rid of that notion?

al2o3cr · on June 8, 2016

"Lack of micro-services architecture renders technical debt possible."

[citation needed]

Also, the statement seems designed to encourage the reader to accept the converse: that somehow a microservice architecture will render technical debt impossible. High-grade bovine excrement, that...

pconner · on June 8, 2016

Yup. I like microservices as an architecture paradigm, but they are a tool, and can be used well or misused just like any other tool. And sometimes they are not the right tool for a job.

TickleSteve · on June 8, 2016

Absolutely.... microservices are good... but they are nothing new.

Microservices are just "low coupling, high cohesion" reinvented for the web-heads.

https://en.wikipedia.org/wiki/Coupling_(computer_programming...

NKCSS · on June 8, 2016

This is so true for many software projects.

The premise is asking to remove a piece of functionality from software:

"Not possible? — Reason it is not possible is because things have been constructed as a monolithic system, mostly by just compiling a new feature. Most often, a given feature is intimately linked to the underlying infrastructure (like an in-memory database, or some event queue processor), and, removing it out of the code base, may get to an effort as large as originally developing the feature. In most cases there is no dedication on how to clean things up later. Every added feature will make future feature additions harder. If you just make the number of supported software features large enough, you can extrapolate, that at some point, this will become unmaintainable. External measure of such a condition, is too hard it is to get functionality into a particular main line release. If you are already using software which can never get “de-featured”, I have bad news — you are doomed to spend your life in the “eternal bug hell.” Availability goes down, operational cost goes up, and your vendor cannot possibly fix it. Time to change vendors is the only way out."

walrus01 · on June 9, 2016

I was not expecting to see a photo of a 12 year old kid holding a Juniper T640 FPC with PICs in it. How often do you let a 12 year old hold something worth possibly $50,000?

mcguire · on June 8, 2016

From About the author:

"Therefore i co-founded rtbrick.com where those Hyper-scale design principles are followed, to build the next generation distributed routing and forwarding platform with unbounded scale on your choice of open hardware."

And the comments:

"We are building a routing/system stack which both runs on vanilla ubuntu 14.04 as well as open network linux. The nice thing about our system is that it does not make any locality assumptions. — You can run the BGP control-plane distributed over several compute nodes and the IS-IS control running on different nodes. Yet the whole thing acts as a coherent system and can drive a set of bare-metal switches (e.g. A Dell Z9100)."

chaz6 · on June 8, 2016

It seems every year there is a new layer of complexity added to networks. If we kept it simple and put the intelligence in the application layer where it belongs, routers would be more efficient than they are.

jewel · on June 8, 2016

Is it possible to simplify the routing table based on the local topology? In other words, if I have a core router that has 100 local peers, can I take the full routing table and find multiple entries that have the same common prefix and the same next hop and combine them, reducing the number of entries that need to be kept in memory?

I imagine if this provided potential gains that it'd already be a known technique, but I can't seem to find any information about it one way or the other.

amazon_not · on June 8, 2016

It's called route aggregation or summarization. The problem is that the routing table is not static, but changes constantly. You can't just aggregate once, you'll have to deaggregate and update the routing table with each route update. You also have to be careful that you don't lose information when you aggregate routes, otherwise you'll end up with routing problems.

bogomipz · on June 8, 2016

The author states:

"Yet, most routers still support a 100ms+ buffer depth for 100GB/s circuits. Just do the math. You need 1.25 GB DDR4 RAM for each 100GB/s port in a given router."

What is the math? It's not clear at all how he arrived at that calculation. That seems like quite an important detail to omit in your first supporting paragraph. Just saying "just do the math" when its not clear what that is is a bit ridiculous.

mpitt · on June 8, 2016

(Assuming GB/s is really Gb/s, gigaBIT vs gigaBYTE.)

  buffer size = throughput * latency
  1.25 GB = (100 Gb/s / 8) * 0.1 s

cataflam · on June 8, 2016

100 Gigabit/s * 100 ms = 10 Gigabit = 1.25 Gigabyte ?

hannesgredler · on June 8, 2016

thanks, i have taken the liberty to C&P your explanation into the blog.

Animats · on June 8, 2016

It's an ad for the guy's startup.

And why is he capitalizing like it's 1390 AD?

oofabz · on June 8, 2016

Based on his name I'm guessing his first language is German. He's accustomed to capitalizing all nouns instead of only proper nouns as we do in English.

scurvy · on June 9, 2016

While the author has an pointed view of the world and how to solve all of these "problems", his answer is basically an Arista switch/router based on the Jericho chipset. Full routes, wirespeed, big FIB, basic software.

But yeah, it's still a router. It's not "carrier grade" big expensive router in the words of Dave Temkin, but it's still a router and will probably smoke the market for lower-end Juniper MX's and Cisco ASR's.

caseymarquis · on June 8, 2016

Genuinely curious, is there an alternative to spanning tree? He'd mentioned not implementing it, but that feature is a life saver the 1% of the time that you need it.

wkz · on June 8, 2016

Shortest Path Bridging (802.1aq)

Uses IS-IS to distribute the link-state db. And allows you to utilize an entire mesh, not just a sub-tree.

technimad · on June 8, 2016

TRILL. Basically is-is applied at OSI Layer 2

pas · on June 8, 2016

Manual configuration, or using iBGP (so you do L3 switching basically), see Project Calico, see what Facebook does (IP-IP encapsulation where the outer header is basically their L2 the switched fabric).

hueving · on June 8, 2016

>so you do L3 switching basically

Ugh this term bothers me. Just call it routing! That hails from a day when routers sucked so much wind the marketing team at Cisco had to invent a new term for the stuff that did it fast.

pas · on June 8, 2016

I usually call it routing, but I hoped OP would recognize the term.

faded242 · on June 8, 2016

iBGP happens at L3, how is that going to stop a L2 loop exactly? There are plenty of alternatives to using spanning tree in a L2 network these days, you just have to have the right equipment to support it. There's TRILL, cisco's FabricPath (L2MP for everyone else).

pas · on June 8, 2016

You disable L2 forwarding except to your iBGP link-local peers.

Yes, I'd very much like to see TRILL gaining more widespread usage and attention, but fundamentally it's equivalent to IP-IP encapsulation plus IS-IS for link-state and iBGP for the outer IP layer, just with fancy terminology (RBridges and so on) and standardized administration/operational semantics. (Which is a good thing of course.)

scurvy · on June 9, 2016

Layer 3 routing everywhere. No need for spanning tree. You've got things like VXLAN that can emulate layer 2 inside of layer 4 where necessary (mostly things like vmotion).

I run several spanning-tree free networks and will never go back.

walrus01 · on June 9, 2016

an alternative to spanning-tree? don't use it and don't try to extend layer 2 switch fabrics beyond a reasonable distance. things at unique geographical locations should not be able to ARP each other.

the8472 · on June 8, 2016

How much more hardware would be needed if we turned on IP-multicast (at least the source-scoped kind) for everyone?

bgilroy26 · on June 8, 2016

https://news.ycombinator.com/item?id=10936132

digi_owl · on June 8, 2016

I must admit this is one dense article.

pottspotts · on June 8, 2016

[flagged]

hannesgredler · on June 8, 2016

apologies ! - get them constantly wrong in german and so do i in english.