This looks unbelievably simple, on the lines of why hasn't it been done before.
So you ping 10.3.4.16 and your host automatically 'knows' to just send it to 17.16.4.16 where lying in wait, the receiving host simply forwards it to 10.3.4.16. I like it.
This is a vexing problem for containers and even VM networking. If they are in a NAT you need to create a mesh of tunnels across hosts, or you create a flat network so they are all on the same subnet. But you can't do this for containers on the cloud with a single IP and limited control of the networking layer.
Current solutions include L2 overlays, L3 overlays, a big mishmash of GRE and other type of tunnels, or VXLAN multicast unavailable in most cloud networks, or proprietary unicast implementations. It's a big hassle.
Ubuntu have taken a simple approach, no per node database to maintain state and uses commonly used networking tools. And more importantly it seems fast. And it's here and now. That 6gbps suggests this does not compromise performance like a lot of other solutions tend to do. It won't solve all multi-host container networking use cases but will address many.
Though you should probably use netlink to do it programatically. Personally I like to combine netlink + Zookeeper or similar to trigger edge updates via watches.
Are you referring to the fdb tables, I tried that some months ago but it didn't seem to work. Maybe its changed now. I will give it a shot. Any tips?
I remember seeing a patch floating around that added support for multiple default destinations in VXLAN unicast but I think some objections were raised and it's not made it through. At least it's not there in 4.1-rc7. That would be quite nice to have.
Oh, multiple default destinations - that would be very cool!
Right now I am using netlink to manage FDB entries, last time I tried iproute2 utility worked too...
The only tricky thing about doing it with netlink is that the FDB uses the same API as the ARP table. Specifically RRTM_NEWNEIGH/RTM_DELNEIGH/RTM_GETNEIGH, apart from that it's pretty simple though.
Yes, the only thing that's lost is live migration of IPs between hosts. Which may or may not be a big thing for containers, depending on things are clustered.
I would want to object and say "There is IPv6 in the cloud!".
We have developed re6stnet in 2012. You can use it to create an ipv6 network on top of an existing ipv4 network. It's open source and we are using it ourselves internally and in client implementations since then.
Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet' and develop a whole system that will very likely break things in random ways.
> More importantly, we can route to these addresses much more simply, with a single route to the “fan” network on each host, instead of the maze of twisty network tunnels you might have seen with other overlays.
Maybe I haven't seen the other overlays (they mention flannel), but how does this not become a series of twisty network tunnels? Except now you have to manually add addresses (static IPv4 addresses!) of the hosts in the route table? I see this as a huge step backwards... now you have to maintain address space routes amongst a bunch of container hosts?
Also, they mention having up to 1000s of containers on laptops, but then their solution scales only to 250 before you need to setup another route + multi-homed IP? Or wipe out entire /8s?
> If you decide you don’t need to communicate with one of these network blocks, you can use it instead of the 10.0.0.0/8 block used in this document. For instance, you might be willing to give up access to Ford Motor Company (19.0.0.0/8) or Halliburton (34.0.0.0/8). The Future Use range (240.0.0.0/8 through 255.0.0.0/8) is a particularly good set of IP addresses you might use, because most routers won't route it; however, some OSes, such as Windows, won't use it. (from https://wiki.ubuntu.com/FanNetworking)
Why are they reusing IP address space marked 'not to be used?' Surely there will be some router, firewall, or switch that will drop those packets arbitrarily, resulting in very-hard-to-debug errors.
--
This problem is already solved with IPv6. Please, if you have this problem, look into using IPv6. This article has plenty of ways to solve this problem using IPv6:
>Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't trust on a production system. Instead of doubling down and working on better IPv6 support with providers and in software configuration, and defining best practices for working with IPv6, they just kinda gloss over with a 'not supported yet'
Yeah, how pragmatic of them. Instead of for pie in the sky, let's all get together to pressure people to improve tons of infrastructure we don't own, action (which can always happen in parallel anyway) they solved their real problem NOW.
>and develop a whole system that will very likely break things in random ways.
Citation needed else it's just FUD. The links explained how it works well enough for that.
The overlay networks are not necessarily hacks - just a souped-up, more distributed, auto-configured VPN. Also, especially in flannel's case, you hand it IPv4 address space to use for the whole network, so there is a bit more coordination of which space gets used.
Even so, there are are lots of ways to get IPv6 now, I would think anywhere where you could use this fan solution to change firewall settings and route tables on the host, you could also setup an IPv6 tunnel or address space. Even with some workarounds for not having a whole routed subnet, like using Proxy NDP.
It seems like a much more future-proof solution than working with something like this. Just my 2c...
You do not appear to be familiar with the problem domain that this addresses, and I think the fan device addresses the problem very very well compared to its competitors! It's nothing like a VPN, it's just IP encapsulation without any encryption or authentication. And it's far far far less of a hack than the distributed databases currently used for network overlays, like Calico or MidoNet or all those other guys, IMHO. For example take this sentence from the article:
> Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce than they need to be in the first place.
There are a lot of people that are using AWS and not in control of the entire network. If they were in control of the entire network, they could just assign a ton of internal IP space to each host. IPV6 is great, sure, but if its not on the table its not on the table.
We will be testing the fan mechanism very soon, and it will likely be used as part of any LXC/Docker deploy, if we ever get to deploying them in production.
No, I get what this addresses... It is really just a wrapper around building a network bridge and having the host do routing / packet forwarding and encapsulation. Nothing a little iptables can't handle :)
I wasn't aware that AWS still doesn't have support for IPv6, that's just amazingly bad in 2015. I'll shift my blame onto them then for spawning all these crazy workarounds.
IPv6 is hard. It's hard to optimize, it's hard to harden, and it's hard to protect against.
One small example: How do you implement a IPv6 firewall which keeps all of China and Russia out of your network? (My apologies to folks living in China and Russia, I've just seen a lot of viable reasons to do this in the past).
Another small example: How do you enable "tcp_tw_recycle" or "tcp_tw_reuse" for IPv6 in Ubuntu?
Maybe we should start thinking of security in terms of 'how can we build things that are actually secure by design' instead of 'how can we use stupid IP-level hacks to block things because our stuff is swiss cheese'?
None of this really applies to VPC (which is a private virtual network for only your own hosts and access is restricted lower down than at the ip layer). You actually can have a public IPv6 address on AWS, it just has to go through ELB.
To be clear, I was not saying that you can give an ELB in a VPC an IPv6 address. I was saying you can give a non-VPC ELB an IPv6 address. Basically I was pointing out that, however imperfect, Amazon has chosen to prioritize public access to IPv6 over private use of it.
This is only for containers communicating with each other within the FAN. Any traffic bound to external networks would have to be NAT'd, which is fine.
I'd like to see a better explanation of how this compares to the various Flannel backends (https://github.com/coreos/flannel#backends), and also how this would be plugged into a Kubernetes cluster.
IP-IP is the first encapsulation supported, but the Fan is engineered in a way that any encapsulation scheme can be added easily. We'll be adding support for VXLAN, GRE, STP tunnels as well. My colleagues will have blog posts about how to enable Kubernetes clusters with Fan Newtorking.
That's been the promise for... how many years now? IIRC, it was before EC2 even existed, and we're obviously not there yet.
Also, it's worth noting that IPV6 is nowhere near as battle hardened as IPV4; there's too many optimization and security gaps to depend on it in production. I've watched a few network gurus burn themselves out attempting to harden a corporate network against IPV6 attacks while keeping it usable.
But seriously, what is Amazon's big holdup here? Is it really that hard to turn on IPv6 support on their hardware? I can kind of see end user ISPs dragging their heels on IPv6 because they don't want to deal with grandmas wondering why they can't access their quilting forums (which have a busted IPv6 configuration) suddenly, but for something like EC2 you have to assume that the people using it have at least a little technical proficiency.
To build on epistasis's comment a bit, this creates a private network for all containers on all hosts that reside in the same /16 network. So if you have a VPC of up to 65k machines, each machine can run up to ~250 containers that can all talk directly to each other by just relying on basic network routing. This is better than your typical private NAT bridge networking because containers on different hosts can talk to each other without having to set up port forwarding or discovering what port that particular application server is running on.
I don't quite get how your solution is equivalent: How many of IPs do you have available per host in, say an EC2 instance? How do you talk to a container which isn't on the same host?
It feels similar to vxlan, only without the centralized repository of IP -> host mappings.
* cross-host container networking
* no overlay IP database to sync and maintain
* only a single IP used per host
They still use encapsulation, like other network overlay technologies, it's just that by using a specific addressing scheme they can eliminate a lot of cross-host communication and all the database lookups.
Yeah, I'm bemused. It seems like a solution to an issue that should have been designed away in the first place. But I'm speaking out of ignorance here so...
Probably doesn't matter much here, but 240.0.0.0/4 is hard coded to be unusable on Windows systems. It's in the Windows IP stack somewhere. Packets to/from that network will simply be dropped.
You can use absolutely any /8 that you want. I used 250.0.0.0/8 in my examples, but the FanNetworking wiki page uses 10.0.0.0/8. You can use any one you want. Have at it ;-)
Sure, DHCP/NAT is used by each container, to get out. But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.
> Sure, DHCP/NAT is used by each container, to get out.
So...Each physical host has its software networking pass through a NAT before hitting the physical adapter? And it's already using DHCP to assign addresses to its containers?
> But how does it route to another container on some other host elsewhere in your cloud? That's what the Fan addresses.
You use DNS to create a lookup table matching container to IP? Since this isn't being done, it must not work. What does Fan do instead?
Is there some other complicating factor to which I'm ignorant? Are we talking about having multiple Kubernetes clusters inside containers inside VMs inside a physical host?
Also, HOW does Fan address this problem? What does it use instead of one kind of database lookup or another, like a distributed database system such as DNS? Or does Fan use fancy subnet math?
Seems to be a smart solution but it only works when you have control over the "real" /16 network if I understand it correct? E.g. having multiple nodes on multiple cloud providers with completely different IP addresses not in the same /16 network will not work, correct?
Why do people keep giving whole IP addresses to every little container? It's a terrible management paradigm compared to service-discovery and using hostports for every address.
Because the alternative methods for handling intra-container communication are even bigger messes.
The 1.6 docker solution involved a double nat and relied on iptables - resulting in some fairly serious bottlenecks and pathalogical edge cases. It also required a third party solution for handling discovery of IPs for services.
Opening up containers to access the host network interfaces breaks the encapsulation promises of containers, and is thus not available to all people. Conceptually, it also creates holes in the idempotent service model, since they have to be aware of port conflicts.
The 1 IP per container model, such as VXLan, Flannel, Kubernetes, Docker 1.7 etc. is one of the more effective methods of countering the problem, at the cost of guzzling IP address space, and requiring a gateway to escape the virtual network tunnel.
Because a lot of software stuff is written without intrinsic support for service discovery, which means all kinds of hacks (e.g. proxying; dynamically rewriting config files and reloading) to work around it if the ports may change. With an IP per container, things like low TTL DNS tied to a service registry (e.g. Consul, or SkyDNS with something to update Etcd) is viable and often a much easier alternative.
I agree with you from a purity point of view that proper service-discovery is better, but in terms of practicality, IP per container is often a lot simpler to implement.
Having control of your internal network does not fix missing service discovery support in all the applications you're running that likes to assume that port numbers are static and unchanging.
Because service discovery itself is ridiculous. I have to deploy three or more Consul/etcd nodes to run service discovery for my handful of EC2 instances and containers?
Of course not. You can maintain configuration files if you like, but anything beyond a relatively small network is going to be a pain to maintain. It's also not going to react very quickly to node failures or topology changes.
This is very neat indeed, and I'd love to try it out, but the launchpad links are broken. Anyone know where I can get the package for Ubuntu armhf? Or the source?
So you ping 10.3.4.16 and your host automatically 'knows' to just send it to 17.16.4.16 where lying in wait, the receiving host simply forwards it to 10.3.4.16. I like it.
This is a vexing problem for containers and even VM networking. If they are in a NAT you need to create a mesh of tunnels across hosts, or you create a flat network so they are all on the same subnet. But you can't do this for containers on the cloud with a single IP and limited control of the networking layer.
Current solutions include L2 overlays, L3 overlays, a big mishmash of GRE and other type of tunnels, or VXLAN multicast unavailable in most cloud networks, or proprietary unicast implementations. It's a big hassle.
Ubuntu have taken a simple approach, no per node database to maintain state and uses commonly used networking tools. And more importantly it seems fast. And it's here and now. That 6gbps suggests this does not compromise performance like a lot of other solutions tend to do. It won't solve all multi-host container networking use cases but will address many.