What caused today's Internet hiccup

pilif · on Aug 13, 2014

>The 512,000 route limitation can be increased to a higher number, for details see this Cisco doc

and that doc goes ahead to explain how to increase the limit at the cost of space for IPv6. Worse: The sample code (which everybody is going to paste) doubles the space for IPv4 at the cost of nearly all the IPv6 space, even though we should soon cross the threshold when we're going to see more IPv6 growth than IPv4 growth.

mprovost · on Aug 13, 2014

There are only 18k IPv6 routes at the moment so the 256k in the default CAM allocation is way too much. In fact you could argue that this whole incident was made worse by Cisco believing people when they said v6 was going to take off, and wasting a bunch of memory on it.

omh · on Aug 13, 2014

Is there a better workaround that doesn't hinder IPv6?

I agree that it would be better not to harm IPv6 growth. But if the alternative is to break the IPv4 internet then the choice seems obvious.

pilif · on Aug 13, 2014

The default sample code (which everybody is going to copy without reading about the consequences) could only slightly increase the V4 space instead of doubling it which we'll hopefully never need as V6 is going to grow much more quickly now (again: hopefully)

the_mitsuhiko · on Aug 13, 2014

It does not harm IPv6 growth. Even years from now IPv4 will be larger than IPv6. IPv6 routes are huge so even just taking a small chunk of that allocated space for IPv4 will help.

scarygliders · on Aug 13, 2014

I'd presume that the best workaround would be to begin replacing the older Cisco kit with newer models.

AdamN · on Aug 13, 2014

That's not a workaround - but yes, these things need to be upgraded. They're probably unmaintained anyway since good networking departments don't want to be working with super old gear.

scarygliders · on Aug 13, 2014

Yes it's the solution rather than a workaround - my very British, dry sense of humour, at work there.

Although it could be argued that "workaround" is a sub-set of "solution" ;)

Alupis · on Aug 13, 2014

So, we go back to Verizon not upgrading and maintaining their network? (a la net neutrality debate vs. Level3)[1]

[1] http://blog.level3.com/global-connectivity/verizons-accident...

tedunangst · on Aug 13, 2014

Who said it's Verizon running the old Cisco routers?

Alupis · on Aug 13, 2014

The OP's article (Verizon probably wasn't alone):

> So whatever happened internally at Verizon caused aggregation for these prefixes to fail which resulted in the introduction of thousands of new /24 routes into the global routing table. This caused the routing table to temporarily reach 515,000 prefixes and that caused issues for older Cisco routers.

pilif · on Aug 13, 2014

Verizon might very well be running current hardware. They have announced too many routes which in turn let the global table grow too big which was causing problems with routers all over the place.

The connectivity issue was caused by the routing table to grow too big for old routers used all over the place.

The routing table grew too big because of a mistake by Verizon which might or might not have been running old routers.

mprovost · on Aug 13, 2014

It might not have even been Verizon's mistake. When you peer with an ISP it is the customer's router that announces the routes. Announcing all of your networks individually is literally a one line change in most router configs so it's an easy mistake to make. If the ISP is being defensive, and good ones are, they filter incoming routes to make sure they belong to the customer. But they may not have a filter that mandates that they're aggregated since that usually isn't a problem.

the_mitsuhiko · on Aug 13, 2014

And where does that say Verizon is old Cisco routers?

dllthomas · on Aug 14, 2014

"which everybody is going to paste"

I question the assertion that many people in charge of Cisco routers doing BGP are in the habit of pasting in configuration changes without thinking about them.

VLM · on Aug 13, 2014

That's a nice site with some interesting graphs. They are a bit higher level than the simplest level of surveillance systems so I wouldn't start at an inappropriately higher level to see if there even is a problem. One lower level simple technique to determine or isolate if a problem even exists is to monitor TCP port 179 traffic rate (aka BGP) between your BGP speakers and your peers / customers. If the routers have nothing to talk about between each other, then there IS nothing to talk about, at least WRT routing problems. Or if one of "my" routers was having an intense discussion with another router, I knew something was up in that general direction. And it can be basically completely passive and completely isolated from the routing systems, which is cool. Just sniff -n- graph TCP 179 bandwidth over time. You'd like to see a nice horizontal low line of keepalives. Reboots or restarts make a nasty spike, never got much agreement but log-y-axis is probably for the best.

Obviously this only finds routing level problems. We can send a /17 to you just fine, but if you're having an IGP problem and sending every byte of it to null, well, from the BGP perspective that's just fine. Much as if you insist on sending us RFC1918 traffic we'll drop that route and traffic for you just fine, just like we had to eat your 0/0 route you're trying to get us to advertise to the entire internet. I think my head still has a flat spot from hitting it on the desk arguing with people.

Its been a decade since I did that stuff professionally at a regional ISP and I really don't miss it. Not much, anyway.

BrandonMarc · on Aug 14, 2014

I like Renesys's take [1] on the subject as well:

Note that there’s no good exact opinion about the One True Size of the Internet — every provider we talk to has a slightly different guess. The peak of the distribution today (the consensus) is actually only about 502,000 routes, but recognizably valid answers can range from 497,000 to 511,000, and a few have straggled across the 512,000 line already.

[1] http://www.renesys.com/2014/08/internet-512k-global-routes/

It's interesting how they explain that since there's no true consensus for the actual size of the routing table, the "event" of crossing the 512k barrier has frankly already begun ... and, so far, hasn't been catastrophic, nor likely to be.

kosinus · on Aug 13, 2014

It doesn't go on to say what exactly happens on the routers in question, but I guess they simply close the session and log an error?

kv85s · on Aug 13, 2014

No, the router still forwards the traffic, but in software rather than hardware. Read the section entitled "Background Information" inside the Cisco document linked at the bottom of the article.

In particular, the telling error message is:

%MLSCEF-DFC4-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched

mprovost · on Aug 13, 2014

On something like a Sup720 that's a 600MHz MIPS processor, so it's not going to break any speed records. And forwarding traffic is considered lower priority than essential things like routing protocols so once you start hitting the CPU you'll see packet loss and high latency.

yusyusyus · on Aug 13, 2014

a few things...

1) there is a 1 gig inband channel to CPU. the traffic prioritization referred to (selective packet discard) really doesn't matter at the point in which the inband channel is saturated. when you start punting large amounts of traffic to CPU, it will take down your routing protocols, kill ARP, etc, even with selective packet discard, due to this saturation. the only way to prevent this is CoPP in the fast path.

2) inband traffic is interrupt driven on this platform. high inband traffic, by itself, will cause the CPU to spike and drop protocols (missed hellos, etc).

the result will be, without a doubt, an outage. packet loss and high latency would be the best case scenario, but only on a box that doesn't carry much traffic (typically not the case for anything taking full tables).

as a side note, these boxes should have started alarming well before overrunning the TCAM (iirc, it begins at 97% utilization), so operators should have had notice to implement the necessary TCAM carving changes.

kosinus · on Aug 14, 2014

3% of 512,000 is 15,360. So if the table truly spiked up in the 15k, that notice may have been very short.

elchief · on Aug 13, 2014

Probably just the NSA upgrading some software.

freeasinfree · on Aug 13, 2014

I'm curious what Verizon's story is here.

ChuckMcM · on Aug 13, 2014

"Netflix made us do it." Ok, I agree that is a bit too cynical.

If anyone recalls Anonymous threatening to 'take down the internet' by DDOSing DNS servers, who knew they could have done it much more simply by dumping 100K BGP paths into the network.

madsushi · on Aug 14, 2014

It is relatively easy to trace who injected new BGP routes though, versus a DDoS from a botnet of machines that are difficult to link to an individual.

ChuckMcM · on Aug 14, 2014

Very true, but all it takes is the 'right' compromised server and that seems to be quite achievable with APT types.