Hacker News new | past | comments | ask | show | jobs | submit login
Introducing the Fan – simpler container networking (markshuttleworth.com)
142 points by TranceMan on June 23, 2015 | hide | past | favorite | 69 comments



This looks unbelievably simple, on the lines of why hasn't it been done before.

So you ping 10.3.4.16 and your host automatically 'knows' to just send it to 17.16.4.16 where lying in wait, the receiving host simply forwards it to 10.3.4.16. I like it.

This is a vexing problem for containers and even VM networking. If they are in a NAT you need to create a mesh of tunnels across hosts, or you create a flat network so they are all on the same subnet. But you can't do this for containers on the cloud with a single IP and limited control of the networking layer.

Current solutions include L2 overlays, L3 overlays, a big mishmash of GRE and other type of tunnels, or VXLAN multicast unavailable in most cloud networks, or proprietary unicast implementations. It's a big hassle.

Ubuntu have taken a simple approach, no per node database to maintain state and uses commonly used networking tools. And more importantly it seems fast. And it's here and now. That 6gbps suggests this does not compromise performance like a lot of other solutions tend to do. It won't solve all multi-host container networking use cases but will address many.


You can use any method to program the VXLAN forwarding table, you don't need to use multicast.

This can even be done on the command line using iproute2 utilities: https://www.kernel.org/doc/Documentation/networking/vxlan.tx...

Though you should probably use netlink to do it programatically. Personally I like to combine netlink + Zookeeper or similar to trigger edge updates via watches.


Are you referring to the fdb tables, I tried that some months ago but it didn't seem to work. Maybe its changed now. I will give it a shot. Any tips?

I remember seeing a patch floating around that added support for multiple default destinations in VXLAN unicast but I think some objections were raised and it's not made it through. At least it's not there in 4.1-rc7. That would be quite nice to have.

http://www.spinics.net/lists/netdev/msg238046.html#.VNs3rIdZ...


Oh, multiple default destinations - that would be very cool!

Right now I am using netlink to manage FDB entries, last time I tried iproute2 utility worked too...

The only tricky thing about doing it with netlink is that the FDB uses the same API as the ARP table. Specifically RRTM_NEWNEIGH/RTM_DELNEIGH/RTM_GETNEIGH, apart from that it's pretty simple though.


Yes, the only thing that's lost is live migration of IPs between hosts. Which may or may not be a big thing for containers, depending on things are clustered.


Why does NAT require a mesh of tunnels? Are you trying to separate the containers onto securely-separate networks?

Why isn't DHCP used? What does it not do that this service does?


I would want to object and say "There is IPv6 in the cloud!".

We have developed re6stnet in 2012. You can use it to create an ipv6 network on top of an existing ipv4 network. It's open source and we are using it ourselves internally and in client implementations since then.

I wrote a quick blogpost on it: http://www.nexedi.com/blog/blog-re6stnet.ipv6.since.2012

The repo is here in case anyone is interested: http://git.erp5.org/gitweb/re6stnet.git/tree/refs/heads/mast...



From the link in the parent comment:

> Well, most importantly we have stable IPv6 everywhere - including on IPv4 legacy networks.


rsync.net cloud storage has been ipv6 accessible since 2006.

I'm surprised that is still interesting/noteworthy.

HN Readers discount, since we're on the subject. Just email.


The evidence suggests otherwise: http://ip6.nl/#!rsync.net

> rsync.net does not appear to be IPv6 capable at all :-( 0 out of 5 stars.


That's our website.

We're a cloud storage provider. There's not much for you to do on our website.


Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet' and develop a whole system that will very likely break things in random ways.

> More importantly, we can route to these addresses much more simply, with a single route to the “fan” network on each host, instead of the maze of twisty network tunnels you might have seen with other overlays.

Maybe I haven't seen the other overlays (they mention flannel), but how does this not become a series of twisty network tunnels? Except now you have to manually add addresses (static IPv4 addresses!) of the hosts in the route table? I see this as a huge step backwards... now you have to maintain address space routes amongst a bunch of container hosts?

Also, they mention having up to 1000s of containers on laptops, but then their solution scales only to 250 before you need to setup another route + multi-homed IP? Or wipe out entire /8s?

> If you decide you don’t need to communicate with one of these network blocks, you can use it instead of the 10.0.0.0/8 block used in this document. For instance, you might be willing to give up access to Ford Motor Company (19.0.0.0/8) or Halliburton (34.0.0.0/8). The Future Use range (240.0.0.0/8 through 255.0.0.0/8) is a particularly good set of IP addresses you might use, because most routers won't route it; however, some OSes, such as Windows, won't use it. (from https://wiki.ubuntu.com/FanNetworking)

Why are they reusing IP address space marked 'not to be used?' Surely there will be some router, firewall, or switch that will drop those packets arbitrarily, resulting in very-hard-to-debug errors.

--

This problem is already solved with IPv6. Please, if you have this problem, look into using IPv6. This article has plenty of ways to solve this problem using IPv6:

https://docs.docker.com/articles/networking/

If your provider doesn't support IPv6, please try to use a tunnel provider to get your very own IPv6 address space.

like https://tunnelbroker.net/

Spend the time to learn IPv6, you won't regret it 5-10 years down the road...


>Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet'

Yeah, how pragmatic of them. Instead of for pie in the sky, let's all get together to pressure people to improve tons of infrastructure we don't own, action (which can always happen in parallel anyway) they solved their real problem NOW.

>and develop a whole system that will very likely break things in random ways.

Citation needed else it's just FUD. The links explained how it works well enough for that.


And what about those environments where only IPV4 is available, as specifically addressed in this article?

There are lots of overlay networks, are those all hacks too?

>Except now you have to manually add addresses (static IPv4 addresses!) of the hosts in the route table?

This does not appear to be true at all, based on the configuration that's posted at the bottom of the article.


The overlay networks are not necessarily hacks - just a souped-up, more distributed, auto-configured VPN. Also, especially in flannel's case, you hand it IPv4 address space to use for the whole network, so there is a bit more coordination of which space gets used.

Even so, there are are lots of ways to get IPv6 now, I would think anywhere where you could use this fan solution to change firewall settings and route tables on the host, you could also setup an IPv6 tunnel or address space. Even with some workarounds for not having a whole routed subnet, like using Proxy NDP.

It seems like a much more future-proof solution than working with something like this. Just my 2c...


You do not appear to be familiar with the problem domain that this addresses, and I think the fan device addresses the problem very very well compared to its competitors! It's nothing like a VPN, it's just IP encapsulation without any encryption or authentication. And it's far far far less of a hack than the distributed databases currently used for network overlays, like Calico or MidoNet or all those other guys, IMHO. For example take this sentence from the article:

> Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce than they need to be in the first place.

There are a lot of people that are using AWS and not in control of the entire network. If they were in control of the entire network, they could just assign a ton of internal IP space to each host. IPV6 is great, sure, but if its not on the table its not on the table.

We will be testing the fan mechanism very soon, and it will likely be used as part of any LXC/Docker deploy, if we ever get to deploying them in production.


No, I get what this addresses... It is really just a wrapper around building a network bridge and having the host do routing / packet forwarding and encapsulation. Nothing a little iptables can't handle :)

I wasn't aware that AWS still doesn't have support for IPv6, that's just amazingly bad in 2015. I'll shift my blame onto them then for spawning all these crazy workarounds.


I don't know much about networking, but if you're on AWS and are using VPC then don't you have full control of the entire (virtual) network?


Sure, but it's only IPV4:

> Additionally, VPCs currently cannot be addressed from IPv6 IP address ranges.

http://aws.amazon.com/vpc/faqs/

And then you still have the problem of only so many IPs per host, so it doesn't help with lots of containers.


Anyone got the inside story on why in 2015 Amazon doesn't support IPv6?


IPv6 is hard. It's hard to optimize, it's hard to harden, and it's hard to protect against.

One small example: How do you implement a IPv6 firewall which keeps all of China and Russia out of your network? (My apologies to folks living in China and Russia, I've just seen a lot of viable reasons to do this in the past).

Another small example: How do you enable "tcp_tw_recycle" or "tcp_tw_reuse" for IPv6 in Ubuntu?


> How do you implement a IPv6 firewall which keeps all of China and Russia out of your network?

You do this by blocking the IP ranges that are assigned to China and Russia. Same as you would with IPv4, why would that change?

Also, tcp_tw_recycle when set for IPv4 also applies for IPv6, despite the name...


Maybe we should start thinking of security in terms of 'how can we build things that are actually secure by design' instead of 'how can we use stupid IP-level hacks to block things because our stuff is swiss cheese'?


None of this really applies to VPC (which is a private virtual network for only your own hosts and access is restricted lower down than at the ip layer). You actually can have a public IPv6 address on AWS, it just has to go through ELB.


You actually have it a bit backwards: you can only assign an IPv6 address to an ELB if it's not in a VPC.

http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...

Crazy, right? Especially since new customers are forced to use VPC and don't even have the option of falling back to EC2-Classic.


To be clear, I was not saying that you can give an ELB in a VPC an IPv6 address. I was saying you can give a non-VPC ELB an IPv6 address. Basically I was pointing out that, however imperfect, Amazon has chosen to prioritize public access to IPv6 over private use of it.


Ah, sorry for the misunderstanding then.


Actually, it does apply, even in a VPC. The inability to tune TCP sockets affects your ability to scale certain services.

It also makes routing within your VPC so much more entertaining to manage.


> you might be willing to give up access to ... Halliburton (34.0.0.0/8)

As someone who used to manage systems in 34.0.0.0/8 and 134.132.0.0/16, oh god I hope nobody does this...


> Surely there will be some router, firewall, or switch that will drop those packets arbitrarily, resulting in very-hard-to-debug errors.

Presumably the packets will be encapsulated so that should not be an issue.


Unless they filter encapsulated packets (protocol 4)... or are filtering ICMP...


In which case you should either pick a provider that doesn't intentionally break your network, or failing that, tunnel your traffic.


This is only for containers communicating with each other within the FAN. Any traffic bound to external networks would have to be NAT'd, which is fine.


I'd like to see a better explanation of how this compares to the various Flannel backends (https://github.com/coreos/flannel#backends), and also how this would be plugged into a Kubernetes cluster.


IP-IP is the first encapsulation supported, but the Fan is engineered in a way that any encapsulation scheme can be added easily. We'll be adding support for VXLAN, GRE, STP tunnels as well. My colleagues will have blog posts about how to enable Kubernetes clusters with Fan Newtorking.


Or you could go somewhere with IPv6. The number of places with an IPv4-only restriction is only going to drop.


That's been the promise for... how many years now? IIRC, it was before EC2 even existed, and we're obviously not there yet.

Also, it's worth noting that IPV6 is nowhere near as battle hardened as IPV4; there's too many optimization and security gaps to depend on it in production. I've watched a few network gurus burn themselves out attempting to harden a corporate network against IPV6 attacks while keeping it usable.


But seriously, what is Amazon's big holdup here? Is it really that hard to turn on IPv6 support on their hardware? I can kind of see end user ISPs dragging their heels on IPv6 because they don't want to deal with grandmas wondering why they can't access their quilting forums (which have a busted IPv6 configuration) suddenly, but for something like EC2 you have to assume that the people using it have at least a little technical proficiency.


> That's been the promise for... how many years now? IIRC, it was before EC2 even existed, and we're obviously not there yet.

For some values of "we". If you're stuck on EC2, yeah, you've got a problem.


But if you don't use EC2 but some bizarro provider noone really uses then you can have IPv6...


I've heard OVH is second biggest in the world and they do have IPv6.


Nobody ever got fired for buying IBM, right?


Because AWS is some legacy dinosaur and not the world leader in such infrastructure, right?


"Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce than they need to be in the first place."

We've[1] had ipv6 addressable cloud storage since 2006.

Currently our US (Denver), Hong Kong (tsuen kwan o) and Zurich locations have working ipv6 addresses.

[1] You know who we are.


What percent of the traffic is actually flowing through the ipv6 ones compared to v4 -- <10% I'd guess (just curious)?


I don't seem to get it. How is this different from just using a non-routed IP per container?


To build on epistasis's comment a bit, this creates a private network for all containers on all hosts that reside in the same /16 network. So if you have a VPC of up to 65k machines, each machine can run up to ~250 containers that can all talk directly to each other by just relying on basic network routing. This is better than your typical private NAT bridge networking because containers on different hosts can talk to each other without having to set up port forwarding or discovering what port that particular application server is running on.


I don't quite get how your solution is equivalent: How many of IPs do you have available per host in, say an EC2 instance? How do you talk to a container which isn't on the same host?

It feels similar to vxlan, only without the centralized repository of IP -> host mappings.


Benefits:

  * cross-host container networking
  * no overlay IP database to sync and maintain
  * only a single IP used per host
They still use encapsulation, like other network overlay technologies, it's just that by using a specific addressing scheme they can eliminate a lot of cross-host communication and all the database lookups.


Yeah, I'm bemused. It seems like a solution to an issue that should have been designed away in the first place. But I'm speaking out of ignorance here so...


Probably doesn't matter much here, but 240.0.0.0/4 is hard coded to be unusable on Windows systems. It's in the Windows IP stack somewhere. Packets to/from that network will simply be dropped.


That will certainly not be the only obstacle to running your containers on Windows.


You can use absolutely any /8 that you want. I used 250.0.0.0/8 in my examples, but the FanNetworking wiki page uses 10.0.0.0/8. You can use any one you want. Have at it ;-)


On Windows?


I've read the article twice. Did they just reinvent putting DHCP behind a NAT? What does that combination of systems not do that Fan does?

  *Remap 50 addresses from one range to another.
  *Dynamically assign those addresses to servers.
  *Special Something that Fan does.
What's the benefit of using a full class A subnet when you are only using 250 addresses?


Sure, DHCP/NAT is used by each container, to get out. But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.


> Sure, DHCP/NAT is used by each container, to get out.

So...Each physical host has its software networking pass through a NAT before hitting the physical adapter? And it's already using DHCP to assign addresses to its containers?

> But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.

You use DNS to create a lookup table matching container to IP? Since this isn't being done, it must not work. What does Fan do instead?

Is there some other complicating factor to which I'm ignorant? Are we talking about having multiple Kubernetes clusters inside containers inside VMs inside a physical host?

Also, HOW does Fan address this problem? What does it use instead of one kind of database lookup or another, like a distributed database system such as DNS? Or does Fan use fancy subnet math?


Seems to be a smart solution but it only works when you have control over the "real" /16 network if I understand it correct? E.g. having multiple nodes on multiple cloud providers with completely different IP addresses not in the same /16 network will not work, correct?


Why do people keep giving whole IP addresses to every little container? It's a terrible management paradigm compared to service-discovery and using hostports for every address.


Because the alternative methods for handling intra-container communication are even bigger messes.

The 1.6 docker solution involved a double nat and relied on iptables - resulting in some fairly serious bottlenecks and pathalogical edge cases. It also required a third party solution for handling discovery of IPs for services.

Opening up containers to access the host network interfaces breaks the encapsulation promises of containers, and is thus not available to all people. Conceptually, it also creates holes in the idempotent service model, since they have to be aware of port conflicts.

The 1 IP per container model, such as VXLan, Flannel, Kubernetes, Docker 1.7 etc. is one of the more effective methods of countering the problem, at the cost of guzzling IP address space, and requiring a gateway to escape the virtual network tunnel.


> Opening up containers to access the host network interfaces breaks the encapsulation promises of containers

Who made that promise? It was never a feature of Linux containers to virtualize the NIC.


You would be surprised the number of people that don't understand containers are a form namespacing + isolation not a form of virtualisation.


Because a lot of software stuff is written without intrinsic support for service discovery, which means all kinds of hacks (e.g. proxying; dynamically rewriting config files and reloading) to work around it if the ports may change. With an IP per container, things like low TTL DNS tied to a service registry (e.g. Consul, or SkyDNS with something to update Etcd) is viable and often a much easier alternative.

I agree with you from a purity point of view that proper service-discovery is better, but in terms of practicality, IP per container is often a lot simpler to implement.


This is all for your internal network though, which you should be able to control, right?


Having control of your internal network does not fix missing service discovery support in all the applications you're running that likes to assume that port numbers are static and unchanging.


Because service discovery itself is ridiculous. I have to deploy three or more Consul/etcd nodes to run service discovery for my handful of EC2 instances and containers?


Of course not. You can maintain configuration files if you like, but anything beyond a relatively small network is going to be a pain to maintain. It's also not going to react very quickly to node failures or topology changes.


This is very neat indeed, and I'd love to try it out, but the launchpad links are broken. Anyone know where I can get the package for Ubuntu armhf? Or the source?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: