Hacker News new | past | comments | ask | show | jobs | submit login
Hunting down the stuck BGP routes (benjojo.co.uk)
222 points by bswinnerton on April 21, 2021 | hide | past | favorite | 32 comments



From the article:

> With the current “default free zone” containing around 1,000,000 routes

Back in ~1998 I was tasked with building a route collector/looking glass machine for an internet exchange point (sadly defunct). I remember the day we switched the collector on and acquired "all the routes", there were ~98,000 of them, you could've knocked me over with a feather. It was like looking into the Total Perspective Vortex. Having been out of that game for many years now I'd no idea we were up to 1M routes...wow. One of the RIPE conferences I attended back then there was much concern about the rapidly increasing size of the global routing table and whether vendors could build hardware powerful enough to keep up.

For anyone interested the route collector was built on FreeBSD (3.0 I think) and Zebra[0].

And finally, what cracking blog, especially stuff like this:

https://blog.benjojo.co.uk/post/eve-online-bgp-internet

[0]: https://en.wikipedia.org/wiki/GNU_Zebra


Just looked at my router in docklands, 833,000 IPv4 routes, from 1.0.0.0/24 to 223.255.64.0/18

32,528 of them are in the 103.0/8 range, but on the other hand 21.0.0.0/8 is advertised once, no subnets at all. (same with 26, 28, 30, 33, 73. I don't have a route for 9.0.0.0/8 aside from 9.9.9.0/24.

Only 108,000 IPv6 routes.


Two very different things are going on to cause the smaller number of IPv6 routes.

One of course is that some Autonomous Systems don't advertise IPv6, either they have no globally routable IPv6 or they only achieve IPv6 via a tunnel and so that's only advertised via their tunnel provider.

But more important, many Autonomous Systems only need to advertise one prefix in IPv6 because it's big enough. Even if your needs grow, because we'd done this before and because IPv6 addresses are plentiful the allocations were deliberately sparse - so your RIR can give you the adjacent addresses, meaning you still only need one route entry for your larger space.

With IPv4 a provider may find itself advertising hundreds or even sometimes thousands of routes to the same Autonomous System since the addresses they need to advertise aren't contiguous.


Crazy! I first started working with BGP in 1996. I remember there being under 30,000 routes... total.


I remember back in 2008 when we reached 256k prefixes, the then limit of Cisco 6500 and 7600 routers, which were kind of the workhorses for a bunch of ISPs. Lots of places were buying memory expansion for those devices to cope. 6 years later, 512k day occurred.


Interesting historical perspective, thanks. And to put that growth into more recent perspective, in the last 7 years we have had a "512 K" day and "768 K" day. See:

https://cumulusnetworks.com/blog/768k-day-importance-adaptab...


This reminds me of when YouTube was down for a lot of the world when Pakistan banned YouTube and one of the country's telecom company forgot to switch off their BGP route (if that is what the correct terminology would be).[0] Half as Interesting made a nice YouTube video about it.[1]

[0] https://www.cnet.com/news/how-pakistan-knocked-youtube-offli...

[1] https://www.youtube.com/watch?v=K9gnRs33NOk


This is less likely to happen these days as BGP routes are now validated through an open registry like RPKI https://www.arin.net/resources/manage/rpki/

I am not aware how popular and which company are using it, but I doubt that youtube is today as vulnerable as it was in 2008. BGP securities has a lot of tractions these days and is an interesting topic to follow.


Thinking out loud: When I read the BGP spec, I got the feeling that it was optimized for reduced churn. As the Internet routing table size increased and increase in CPU power of routers was an uncertainty, the architects of the Internet wanted to avoid extra BGP exchanges.

However, now it seems like the Internet is facing new challenges and a different trade-off might make sense. Why not add a "valid until" attribute on each route? The originating router would have to re-announce a new route every 24 hours. Failure to propagate the update at any point would automatically withdraw it. Of course, re-announcing 1M routes every day might be a lot, but at this point it feels worth considering.


I think 24 hours would be too slim. BGP is "routing by rumor" across a lot of AS's. I think a week would be more interesting allowing for route propagation to be a little slower. Obv this requires the spec really changing which in the case of BGP is unlikely to happen other than little tweaks


Interesting read. Its interesting that this is a big in the specification and not implementation since bgp is so old. We must not hit this case often


It's actually becoming more frequent due to BGP speakers being increasingly multithreaded. In the olden days, if you were overloaded, that also meant no more keepalives being sent. With the power of multithreading you can now simultaneously be overloaded and still send keepalives! :D


> With the power of multithreading you can now simultaneously be overloaded and still send keepalives! :D

Now that's progress!

Reminds me of some of the naive comments in a few of the recent posts on HN about programming multithreading. It's not as easy as just adding more threads.

Or sending in more trains :) https://www.youtube.com/watch?v=-hyttagGsz0


Ah yes; that's one of my favorites - health check returning 200s instantaneously; actual service is a black hole.


One of many problems that are only solvable with a software watchdog.


I wonder if a robust consensus algorithm might be a better investment than a timeout. I would imagine there are other bugs in BGP implementations so having a routing table that's going to trend towards eventual consistency regardless of the starting point might be a more robust solution than just focusing on this one corner case. Might be a more intrusive change though & hard to get middleware to roll out such a change?


The goal of BGP or other routing protocol isn't consensus though. Each router really just wants to find a next hop for every destination, and there are lots of reasons for differences.

In the case where there are multiple next hops to choose from, it might be nice to have some sort of quality metric to decide, but that's really tricky to measure and integrate. It's really outside the scope of BGP.

You would need to instrument packet loss or transfer speed or something like that by destination path on your application servers (or load balancers), and be able to adjust the proportion of traffic through various paths; keeping in mind that you can't really influence the path beyond your routers or the return path.

It's a lot of work, and I don't think the tools are built for it, but I would love to work on it for someone though. I did something similar for SMS routing, but that's a ton easier, fewer choices, less traffic, clearer success, etc.


There isn't actually a consensus to be formed on the Internet. Communities, local configuration, etc. cause BGP routers to make local decisions about routes to advertise and re-advertise that aren't going be part of a concensus.


Given the size and complexity of the Internet, it might be worth considering making BGP tolerant to Bizantine failures.


There isn't a really useful metric for failure, though. Not every prefix of every AS needs to be reachable from every other. Unlike consensus problems where everyone wants to agree on the same state it's sufficient for the Internet to be in a working state where each AS has enough routes that they care about, and BGP is pretty good at achieving that.


Nice article on the basic functionalities of the Internet backbone. I really like the animation explaining this article with nice pictures. In short, BGP has a bug that potentially created a huge outage in August 2020. The proposed fix is to imrove the BGP protocol with a new feature. It's not easy because, it's the backbone of internet. Let's see where this will go.


Is a protocol change necessary here? Keep alives are already sent... and they would be held up if the TCP window hit 0? At which point the BGP/TCP session can be terminated and re-established.


BGP Keepalives are not request-reply, they are simple scheduled transmissions. Which means even if one side is not reading, it may still be sending keepalives. So the other side keeps the session open, despite its own keepalives sitting in its send queue.

Also, any valid BGP message resets the keepalive timer, so the reading side just needs to occasionally pop something off the full queue and process it. Which, say, if you're swapping to hell and back, can still get done. (Assuming it even has the scheduling get to killing things due to holdtime expiry. It might just not be expiring anything anymore for reasons of floating face-down in the river.)


I think the argument is if _your_ keep-alives are held up then currently you wait on _them_ terminating the session. If they are malicious or just not working well they may not do this.


You can see the window size is zero though, so I think GP is suggesting sending a TCP reset or something similar.

Maybe this isn’t a good option because it would have too many undesirable side effects?


The RFC proposes to change the BGP finite-state machine, not the protocol.


Like you I don't see how a change in protocol is requried, an update to the RFC to say something SHOULD time out the connection if the send window is zero. That said I haven't read the specs with a toothcomb and perhaps there's something about how you MUST NOT drop the connection if you're getting keepalives?

Get Cisco and Juniper to implement it and that's 75% of LINX covered at least, I assume other exchanges have similar equipment makeup.

It seems reasonable behaviour to me.

It doesn't prevent the problem of the malicious BGP peer of course, but we know that already - if they choose to ignore your messages (while being happy with a high send-window) but continue to send keepalives you're equally screwed.


If you don't put it in the RFC then you'll end up with five different solutions to this problem from five different vendors, and a nice 5x5 matrix of new hilarious edge cases when these are talking to each other and something wonky is happening to the TCP session.


So I keep coming into situations where I think this is the problem that's occurring (a stuck route). While I'd certainly love to be able to diagnosis this, would it even matter? There's no recourse that I can take as an end user is there?


As an end user, not really. If you can get in touch with a NOC that understands the problem and is willing to listen to you, maybe. The problem is it can be pretty hard to find someone.


there's the NANOG mailing list


... hm, how come withdraw (and announce) messages are not ACKed in-band? or maybe they are, but due to explicit demonic of certain routers (and/or ASes) they still don't take effect?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: