The duplication is worse than that. It's a data structure problem. Docker deals in opaque disk images, a linear, order-dependent sequence of them. The data structure is built this way because Docker has no knowledge of what the dependency graph of an application really is. This greatly limits the space/bandwidth efficiency Docker can ever hope to have. Cache hits are just too infrequent.
So how do we improve? Functional package and configuration management, such as with GNU Guix. In Guix, a package describes its full dependency graph precisely, as does a full-system configuration. Because this is a graph, and because order doesn't matter (thanks to being functional and declarative), packages or systems that conceptually share branches really do share those branches on disk. The consequence of this design, in the context of containers, is that shared dependencies amongst containers running on the same host are deduplicated system-wide. This graph has the nice feature of being inspectable, unlike Docker where it is opaque, and allows for maximum cache hits.
> The duplication is worse than that. It's a data structure problem. Docker deals in opaque disk images, a linear, order-dependent sequence of them. The data structure is built this way because Docker has no knowledge of what the dependency graph of an application really is. This greatly limits the space/bandwidth efficiency Docker can ever hope to have. Cache hits are just too infrequent.
This is only true when you're building your images. Distributing them doesn't have this problem. And the new content-addressability stuff means that you can get reproducible graphs (read: more dedup).
> So how do we improve? Functional package and configuration management, such as with GNU Guix. In Guix, a package describes its full dependency graph precisely, as does a full-system configuration. Because this is a graph, and because order doesn't matter (thanks to being functional and declarative), packages or systems that conceptually share branches really do share those branches on disk. The consequence of this design, in the context of containers, is that shared dependencies amongst containers running on the same host are deduplicated system-wide. This graph has the nice feature of being inspectable, unlike Docker where it is opaque, and allows for maximum cache hits.
For what it's worth, I would actually like to see proper dependency graph support with Docker. I don't think it'll happen with the current state of Docker, but if we made a fork it might be practical. At SUSE, we're working on doing rebuilds when images change with Portus (which is free software). But there is a more general problem of keeping libraries up to date without rebuilding all of your software when using containers. I was working on a side-project called "docker rebase" (code is on my GitHub) that would allow you to rebase these opaque layers without having to rebuild each one. I'm probably going to keep working on it at some point.
So how do we improve? Functional package and configuration management, such as with GNU Guix. In Guix, a package describes its full dependency graph precisely, as does a full-system configuration. Because this is a graph, and because order doesn't matter (thanks to being functional and declarative), packages or systems that conceptually share branches really do share those branches on disk. The consequence of this design, in the context of containers, is that shared dependencies amongst containers running on the same host are deduplicated system-wide. This graph has the nice feature of being inspectable, unlike Docker where it is opaque, and allows for maximum cache hits.