Introducing the Fan – simpler container networking

tobbyb · on June 23, 2015

This looks unbelievably simple, on the lines of why hasn't it been done before.

So you ping 10.3.4.16 and your host automatically 'knows' to just send it to 17.16.4.16 where lying in wait, the receiving host simply forwards it to 10.3.4.16. I like it.

This is a vexing problem for containers and even VM networking. If they are in a NAT you need to create a mesh of tunnels across hosts, or you create a flat network so they are all on the same subnet. But you can't do this for containers on the cloud with a single IP and limited control of the networking layer.

Current solutions include L2 overlays, L3 overlays, a big mishmash of GRE and other type of tunnels, or VXLAN multicast unavailable in most cloud networks, or proprietary unicast implementations. It's a big hassle.

Ubuntu have taken a simple approach, no per node database to maintain state and uses commonly used networking tools. And more importantly it seems fast. And it's here and now. That 6gbps suggests this does not compromise performance like a lot of other solutions tend to do. It won't solve all multi-host container networking use cases but will address many.

jpgvm · on June 24, 2015

You can use any method to program the VXLAN forwarding table, you don't need to use multicast.

This can even be done on the command line using iproute2 utilities: https://www.kernel.org/doc/Documentation/networking/vxlan.tx...

Though you should probably use netlink to do it programatically. Personally I like to combine netlink + Zookeeper or similar to trigger edge updates via watches.

tobbyb · on June 24, 2015

Are you referring to the fdb tables, I tried that some months ago but it didn't seem to work. Maybe its changed now. I will give it a shot. Any tips?

I remember seeing a patch floating around that added support for multiple default destinations in VXLAN unicast but I think some objections were raised and it's not made it through. At least it's not there in 4.1-rc7. That would be quite nice to have.

http://www.spinics.net/lists/netdev/msg238046.html#.VNs3rIdZ...

jpgvm · on June 24, 2015

Oh, multiple default destinations - that would be very cool!

Right now I am using netlink to manage FDB entries, last time I tried iproute2 utility worked too...

The only tricky thing about doing it with netlink is that the FDB uses the same API as the ARP table. Specifically RRTM_NEWNEIGH/RTM_DELNEIGH/RTM_GETNEIGH, apart from that it's pretty simple though.

epistasis · on June 23, 2015

Yes, the only thing that's lost is live migration of IPs between hosts. Which may or may not be a big thing for containers, depending on things are clustered.

stephengillie · on June 24, 2015

Why does NAT require a mesh of tunnels? Are you trying to separate the containers onto securely-separate networks?

Why isn't DHCP used? What does it not do that this service does?

frequent · on June 23, 2015

I would want to object and say "There is IPv6 in the cloud!".

We have developed re6stnet in 2012. You can use it to create an ipv6 network on top of an existing ipv4 network. It's open source and we are using it ourselves internally and in client implementations since then.

I wrote a quick blogpost on it: http://www.nexedi.com/blog/blog-re6stnet.ipv6.since.2012

The repo is here in case anyone is interested: http://git.erp5.org/gitweb/re6stnet.git/tree/refs/heads/mast...

dustinkirkland · on June 23, 2015

There's IPv6 in some clouds. https://forums.aws.amazon.com/thread.jspa?messageID=536049

teraflop · on June 23, 2015

From the link in the parent comment:

> Well, most importantly we have stable IPv6 everywhere - including on IPv4 legacy networks.

rsync · on June 24, 2015

rsync.net cloud storage has been ipv6 accessible since 2006.

I'm surprised that is still interesting/noteworthy.

HN Readers discount, since we're on the subject. Just email.

p1mrx · on June 24, 2015

The evidence suggests otherwise: http://ip6.nl/#!rsync.net

> rsync.net does not appear to be IPv6 capable at all :-( 0 out of 5 stars.

rsync · on June 24, 2015

That's our website.

We're a cloud storage provider. There's not much for you to do on our website.

kbaker · on June 23, 2015

Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet' and develop a whole system that will very likely break things in random ways.

> More importantly, we can route to these addresses much more simply, with a single route to the “fan” network on each host, instead of the maze of twisty network tunnels you might have seen with other overlays.

Maybe I haven't seen the other overlays (they mention flannel), but how does this not become a series of twisty network tunnels? Except now you have to manually add addresses (static IPv4 addresses!) of the hosts in the route table? I see this as a huge step backwards... now you have to maintain address space routes amongst a bunch of container hosts?

Also, they mention having up to 1000s of containers on laptops, but then their solution scales only to 250 before you need to setup another route + multi-homed IP? Or wipe out entire /8s?

> If you decide you don’t need to communicate with one of these network blocks, you can use it instead of the 10.0.0.0/8 block used in this document. For instance, you might be willing to give up access to Ford Motor Company (19.0.0.0/8) or Halliburton (34.0.0.0/8). The Future Use range (240.0.0.0/8 through 255.0.0.0/8) is a particularly good set of IP addresses you might use, because most routers won't route it; however, some OSes, such as Windows, won't use it. (from https://wiki.ubuntu.com/FanNetworking)

Why are they reusing IP address space marked 'not to be used?' Surely there will be some router, firewall, or switch that will drop those packets arbitrarily, resulting in very-hard-to-debug errors.

--

This problem is already solved with IPv6. Please, if you have this problem, look into using IPv6. This article has plenty of ways to solve this problem using IPv6:

https://docs.docker.com/articles/networking/

If your provider doesn't support IPv6, please try to use a tunnel provider to get your very own IPv6 address space.

like https://tunnelbroker.net/

Spend the time to learn IPv6, you won't regret it 5-10 years down the road...

coldtea · on June 23, 2015

>Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet'

Yeah, how pragmatic of them. Instead of for pie in the sky, let's all get together to pressure people to improve tons of infrastructure we don't own, action (which can always happen in parallel anyway) they solved their real problem NOW.

>and develop a whole system that will very likely break things in random ways.

Citation needed else it's just FUD. The links explained how it works well enough for that.

epistasis · on June 23, 2015

And what about those environments where only IPV4 is available, as specifically addressed in this article?

There are lots of overlay networks, are those all hacks too?

>Except now you have to manually add addresses (static IPv4 addresses!) of the hosts in the route table?

This does not appear to be true at all, based on the configuration that's posted at the bottom of the article.

kbaker · on June 23, 2015

The overlay networks are not necessarily hacks - just a souped-up, more distributed, auto-configured VPN. Also, especially in flannel's case, you hand it IPv4 address space to use for the whole network, so there is a bit more coordination of which space gets used.

Even so, there are are lots of ways to get IPv6 now, I would think anywhere where you could use this fan solution to change firewall settings and route tables on the host, you could also setup an IPv6 tunnel or address space. Even with some workarounds for not having a whole routed subnet, like using Proxy NDP.

It seems like a much more future-proof solution than working with something like this. Just my 2c...

epistasis · on June 23, 2015

You do not appear to be familiar with the problem domain that this addresses, and I think the fan device addresses the problem very very well compared to its competitors! It's nothing like a VPN, it's just IP encapsulation without any encryption or authentication. And it's far far far less of a hack than the distributed databases currently used for network overlays, like Calico or MidoNet or all those other guys, IMHO. For example take this sentence from the article:

> Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce than they need to be in the first place.

There are a lot of people that are using AWS and not in control of the entire network. If they were in control of the entire network, they could just assign a ton of internal IP space to each host. IPV6 is great, sure, but if its not on the table its not on the table.

We will be testing the fan mechanism very soon, and it will likely be used as part of any LXC/Docker deploy, if we ever get to deploying them in production.

kbaker · on June 23, 2015

No, I get what this addresses... It is really just a wrapper around building a network bridge and having the host do routing / packet forwarding and encapsulation. Nothing a little iptables can't handle :)

I wasn't aware that AWS still doesn't have support for IPv6, that's just amazingly bad in 2015. I'll shift my blame onto them then for spawning all these crazy workarounds.

vosper · on June 23, 2015

I don't know much about networking, but if you're on AWS and are using VPC then don't you have full control of the entire (virtual) network?

epistasis · on June 23, 2015

Sure, but it's only IPV4:

> Additionally, VPCs currently cannot be addressed from IPv6 IP address ranges.

http://aws.amazon.com/vpc/faqs/

And then you still have the problem of only so many IPs per host, so it doesn't help with lots of containers.

rwmj · on June 23, 2015

Anyone got the inside story on why in 2015 Amazon doesn't support IPv6?

falcolas · on June 23, 2015

IPv6 is hard. It's hard to optimize, it's hard to harden, and it's hard to protect against.

One small example: How do you implement a IPv6 firewall which keeps all of China and Russia out of your network? (My apologies to folks living in China and Russia, I've just seen a lot of viable reasons to do this in the past).

Another small example: How do you enable "tcp_tw_recycle" or "tcp_tw_reuse" for IPv6 in Ubuntu?

X-Istence · on June 24, 2015

> How do you implement a IPv6 firewall which keeps all of China and Russia out of your network?

You do this by blocking the IP ranges that are assigned to China and Russia. Same as you would with IPv4, why would that change?

Also, tcp_tw_recycle when set for IPv4 also applies for IPv6, despite the name...

api · on June 24, 2015

Maybe we should start thinking of security in terms of 'how can we build things that are actually secure by design' instead of 'how can we use stupid IP-level hacks to block things because our stuff is swiss cheese'?

stormbrew · on June 23, 2015

None of this really applies to VPC (which is a private virtual network for only your own hosts and access is restricted lower down than at the ip layer). You actually can have a public IPv6 address on AWS, it just has to go through ELB.

teraflop · on June 23, 2015

You actually have it a bit backwards: you can only assign an IPv6 address to an ELB if it's not in a VPC.

http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...

Crazy, right? Especially since new customers are forced to use VPC and don't even have the option of falling back to EC2-Classic.

stormbrew · on June 24, 2015

To be clear, I was not saying that you can give an ELB in a VPC an IPv6 address. I was saying you can give a non-VPC ELB an IPv6 address. Basically I was pointing out that, however imperfect, Amazon has chosen to prioritize public access to IPv6 over private use of it.

teraflop · on June 24, 2015

Ah, sorry for the misunderstanding then.

falcolas · on June 24, 2015

Actually, it does apply, even in a VPC. The inability to tune TCP sockets affects your ability to scale certain services.

It also makes routing within your VPC so much more entertaining to manage.

mrbill · on June 23, 2015

> you might be willing to give up access to ... Halliburton (34.0.0.0/8)

As someone who used to manage systems in 34.0.0.0/8 and 134.132.0.0/16, oh god I hope nobody does this...

vidarh · on June 23, 2015

> Surely there will be some router, firewall, or switch that will drop those packets arbitrarily, resulting in very-hard-to-debug errors.

Presumably the packets will be encapsulated so that should not be an issue.

kbaker · on June 23, 2015

Unless they filter encapsulated packets (protocol 4)... or are filtering ICMP...

vidarh · on June 23, 2015

In which case you should either pick a provider that doesn't intentionally break your network, or failing that, tunnel your traffic.

maccam94 · on June 23, 2015

This is only for containers communicating with each other within the FAN. Any traffic bound to external networks would have to be NAT'd, which is fine.

ademarre · on June 23, 2015

I'd like to see a better explanation of how this compares to the various Flannel backends (https://github.com/coreos/flannel#backends), and also how this would be plugged into a Kubernetes cluster.

dustinkirkland · on June 23, 2015

IP-IP is the first encapsulation supported, but the Fan is engineered in a way that any encapsulation scheme can be added easily. We'll be adding support for VXLAN, GRE, STP tunnels as well. My colleagues will have blog posts about how to enable Kubernetes clusters with Fan Newtorking.

regularfry · on June 23, 2015

Or you could go somewhere with IPv6. The number of places with an IPv4-only restriction is only going to drop.

falcolas · on June 23, 2015

That's been the promise for... how many years now? IIRC, it was before EC2 even existed, and we're obviously not there yet.

Also, it's worth noting that IPV6 is nowhere near as battle hardened as IPV4; there's too many optimization and security gaps to depend on it in production. I've watched a few network gurus burn themselves out attempting to harden a corporate network against IPV6 attacks while keeping it usable.

jandrese · on June 24, 2015

But seriously, what is Amazon's big holdup here? Is it really that hard to turn on IPv6 support on their hardware? I can kind of see end user ISPs dragging their heels on IPv6 because they don't want to deal with grandmas wondering why they can't access their quilting forums (which have a busted IPv6 configuration) suddenly, but for something like EC2 you have to assume that the people using it have at least a little technical proficiency.

regularfry · on June 23, 2015

> That's been the promise for... how many years now? IIRC, it was before EC2 even existed, and we're obviously not there yet.

For some values of "we". If you're stuck on EC2, yeah, you've got a problem.

coldtea · on June 23, 2015

But if you don't use EC2 but some bizarro provider noone really uses then you can have IPv6...

betaby · on June 24, 2015

I've heard OVH is second biggest in the world and they do have IPv6.

regularfry · on June 24, 2015

Nobody ever got fired for buying IBM, right?

coldtea · on June 24, 2015

Because AWS is some legacy dinosaur and not the world leader in such infrastructure, right?

rsync · on June 24, 2015

"Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce than they need to be in the first place."

We've[1] had ipv6 addressable cloud storage since 2006.

Currently our US (Denver), Hong Kong (tsuen kwan o) and Zurich locations have working ipv6 addresses.

[1] You know who we are.

nacs · on June 24, 2015

What percent of the traffic is actually flowing through the ipv6 ones compared to v4 -- <10% I'd guess (just curious)?

paulasmuth · on June 23, 2015

I don't seem to get it. How is this different from just using a non-routed IP per container?

maccam94 · on June 23, 2015

To build on epistasis's comment a bit, this creates a private network for all containers on all hosts that reside in the same /16 network. So if you have a VPC of up to 65k machines, each machine can run up to ~250 containers that can all talk directly to each other by just relying on basic network routing. This is better than your typical private NAT bridge networking because containers on different hosts can talk to each other without having to set up port forwarding or discovering what port that particular application server is running on.

falcolas · on June 23, 2015

I don't quite get how your solution is equivalent: How many of IPs do you have available per host in, say an EC2 instance? How do you talk to a container which isn't on the same host?

It feels similar to vxlan, only without the centralized repository of IP -> host mappings.

epistasis · on June 23, 2015

Benefits:

  * cross-host container networking
  * no overlay IP database to sync and maintain
  * only a single IP used per host

They still use encapsulation, like other network overlay technologies, it's just that by using a specific addressing scheme they can eliminate a lot of cross-host communication and all the database lookups.

carapace · on June 23, 2015

Yeah, I'm bemused. It seems like a solution to an issue that should have been designed away in the first place. But I'm speaking out of ignorance here so...

api · on June 24, 2015

Probably doesn't matter much here, but 240.0.0.0/4 is hard coded to be unusable on Windows systems. It's in the Windows IP stack somewhere. Packets to/from that network will simply be dropped.

rspeer · on June 24, 2015

That will certainly not be the only obstacle to running your containers on Windows.

dustinkirkland · on June 24, 2015

You can use absolutely any /8 that you want. I used 250.0.0.0/8 in my examples, but the FanNetworking wiki page uses 10.0.0.0/8. You can use any one you want. Have at it ;-)

api · on June 25, 2015

On Windows?

stephengillie · on June 24, 2015

I've read the article twice. Did they just reinvent putting DHCP behind a NAT? What does that combination of systems not do that Fan does?

  *Remap 50 addresses from one range to another.
  *Dynamically assign those addresses to servers.
  *Special Something that Fan does.

What's the benefit of using a full class A subnet when you are only using 250 addresses?

dustinkirkland · on June 24, 2015

Sure, DHCP/NAT is used by each container, to get out. But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.

stephengillie · on June 24, 2015

> Sure, DHCP/NAT is used by each container, to get out.

So...Each physical host has its software networking pass through a NAT before hitting the physical adapter? And it's already using DHCP to assign addresses to its containers?

> But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.

You use DNS to create a lookup table matching container to IP? Since this isn't being done, it must not work. What does Fan do instead?

Is there some other complicating factor to which I'm ignorant? Are we talking about having multiple Kubernetes clusters inside containers inside VMs inside a physical host?

Also, HOW does Fan address this problem? What does it use instead of one kind of database lookup or another, like a distributed database system such as DNS? Or does Fan use fancy subnet math?

geku · on June 23, 2015

Seems to be a smart solution but it only works when you have control over the "real" /16 network if I understand it correct? E.g. having multiple nodes on multiple cloud providers with completely different IP addresses not in the same /16 network will not work, correct?

GauntletWizard · on June 23, 2015

Why do people keep giving whole IP addresses to every little container? It's a terrible management paradigm compared to service-discovery and using hostports for every address.

falcolas · on June 23, 2015

Because the alternative methods for handling intra-container communication are even bigger messes.

The 1.6 docker solution involved a double nat and relied on iptables - resulting in some fairly serious bottlenecks and pathalogical edge cases. It also required a third party solution for handling discovery of IPs for services.

Opening up containers to access the host network interfaces breaks the encapsulation promises of containers, and is thus not available to all people. Conceptually, it also creates holes in the idempotent service model, since they have to be aware of port conflicts.

The 1 IP per container model, such as VXLan, Flannel, Kubernetes, Docker 1.7 etc. is one of the more effective methods of countering the problem, at the cost of guzzling IP address space, and requiring a gateway to escape the virtual network tunnel.

otterley · on June 23, 2015

> Opening up containers to access the host network interfaces breaks the encapsulation promises of containers

Who made that promise? It was never a feature of Linux containers to virtualize the NIC.

jpgvm · on June 24, 2015

You would be surprised the number of people that don't understand containers are a form namespacing + isolation not a form of virtualisation.

vidarh · on June 23, 2015

Because a lot of software stuff is written without intrinsic support for service discovery, which means all kinds of hacks (e.g. proxying; dynamically rewriting config files and reloading) to work around it if the ports may change. With an IP per container, things like low TTL DNS tied to a service registry (e.g. Consul, or SkyDNS with something to update Etcd) is viable and often a much easier alternative.

I agree with you from a purity point of view that proper service-discovery is better, but in terms of practicality, IP per container is often a lot simpler to implement.

toast0 · on June 23, 2015

This is all for your internal network though, which you should be able to control, right?

vidarh · on June 23, 2015

Having control of your internal network does not fix missing service discovery support in all the applications you're running that likes to assume that port numbers are static and unchanging.

rubiquity · on June 23, 2015

Because service discovery itself is ridiculous. I have to deploy three or more Consul/etcd nodes to run service discovery for my handful of EC2 instances and containers?

otterley · on June 23, 2015

Of course not. You can maintain configuration files if you like, but anything beyond a relatively small network is going to be a pain to maintain. It's also not going to react very quickly to node failures or topology changes.

rcarmo · on June 23, 2015

This is very neat indeed, and I'd love to try it out, but the launchpad links are broken. Anyone know where I can get the package for Ubuntu armhf? Or the source?