My default question I give to interviewees looking for a dev or devops position is to tell me what a container is. Usually I get the "well it's like a vm but not a vm" and then I ask the followup which is to ask what are the differences between a container and vm? A lot of hemming and hawing but in the 6 or so years I've been interviewing people who are going to be using Docker, I have yet to hear the answer that I give when I interview, which is, "a container is a process." I usually leave it at that and wait to see if I get a confused look on their face. Then I say, "with cgroups and namespaces". It's really not that complicated (to explain anyways).
I think it's fairly important to say that a container is a process tree, not necessarily a single process. A lot of things wouldn't behave properly if a forked child process wasn't part of the same container as its parent.
(Of course, it might not truly be a tree either, because you can use tools like nsenter to put additional processes into the PID namespace without them being children of the container's PID 1. But a tree is the common case.)
I would argue that the virtualization aspect is more important than the processiness of a container. Sure it's a process but it's a process that you've carefully controlled the resources it can access. You've put it in its own network namespace, you've rebuilt it's filesystem so it sees an entirely different / than the rest of the system, etc etc. If the process isn't checking, you can make it behave like it's in its own little world, with it's own kernel. Because of that, it's much more predictable than just a process.
Saying it's a process is mostly correct, and saying it's a process tree is a guess even more correct I think they both miss the point.
That's not usually complete. Minimally, it's a process in its own PID namespace.
But assuming it's an OCI container, which will be the case if you're using any common managed container runtime and not rolling your own, it's a process in its own user namespace, mount namespace, UTS namespace, cgroup namespace, IPC namespace, and time namespace. It's assigned its own hostname and IP in the new UTS namespace. It runs in a chroot in the new mount namespace to a root filesystem assembled via overlay with a single writeable layer on top of N read-only layers shared with any other container that is launched from the same image. It gets bind mounts, kernel VFS mounts, an environment, and an entrypoint command from a config file colocated with the root filesystem in a container bundle created from the image defaults, system defaults, and command line overrides.
The simplest way to compare it to a VM is a container is the Linux kernel using namespaces and cgroups to scope the services and resources it presents to a single userspace process. A VM is a hypervisor presenting virtualized hardware and BIOS services to guest kernels.
You're being a bit obtuse or maybe you just wanted to show off your knowledge a bit. I said basically the same, "it's a process... with namespaces and cgroups." If they can answer that and know what that means, I'm good and I go to my other interview questions. I'm not hiring a Docker developer or someone who needs to know the internals of Linux. I'm hiring SREs who manage and oversee Docker and Kubernetes production environments.
Technically you don’t need to know what a container “is” to oversee Docker and Kubernetes production environments (knowing that it’s not a VM is a plus, but then you already knew that by calling it a container rather than a VM). A container can just be a container, and you can leave it at that.
If a zookeeper starts talking about jackdaws and crows in an interview, who’s the one being obtuse and trying to show off their knowledge?
The important thing is that the animals are taken care of and the zoo visitors are happy.
An SRE responsible for Kubernetes and Docker production environments who doesn't understand the fundamentals of how they work wouldn't be of much use to me. Sure they can probably follow scripted runbooks but I want people who can deep dive into advanced trouble-shooting. TBF, there are uses for those that don't get it, but they're usually the lesser paid folks who work night and weekend on-call rotations.
> I want people who can deep dive into advanced trouble-shooting
In my experience, people who need to know every little thing rarely end up knowing every little thing, and are actually the absolute worst at fixing issues of high urgency - due to needing to know every little thing.
I'd much rather have people who are able to learn fast on the fly. Those are the people who actually end up knowing the little things that are actually useful, as opposed to useless trivia.
I have apps running in docker and I also deployed apps to VMs. I would get this answer wrong according to OP. Honestly I don't see how this knowledge is necessary.
No. A container shares the same kernel with the host OS. It doesn't do any of its own hardware setup. At most the host kernel will create some virtual network interfaces and create a filesystem for it. The container's processes are the same as other processes on the host OS, just with some special accounting flags that govern how many resources they get and what they're allowed to see from the host.
A VM host creates a guest environment with a bunch of virtual hardware devices and starts up the guest's kernel that talks to them through its own drivers. The guest does its own hardware initialization, formats and mounts its own block storage devices, does its own bootup and process scheduling, etc.
If a container shares the same kernel with the host OS, how can Docker containers run in different OSes? Are people saying that what Docker containers really are "Docker VMs"?
"Docker for Mac" and "Docker for Windows" generally spin up a local Linux VM to run containers, but all of the containers run inside a single VM. (I'm going to gloss over running Windows containers on Windows, which I believe operate more like Linux containers do on Linux).
If you mean "how can Docker containers based on Linux run on Mac/Windows hosts", they don't. There is a single VM running Linux and the containers run inside of that.
I'm not sure a "A process with cgroups and namespaces" makes things much clearer though somebody with understanding of those might infer certain properties.
"a kind of lightweight VM" seems a much more intuitive answer, with limitations.
Not OP but I also interview. I see CGroups and namespaces is a follow up to “how does it work or how is it usually isolated?” And I ask “why would I use it?” Or “when would I choose containers over vm?” Or vice versa to get to actual architectural decisions
The wrong answer usually involves a lot of hemming and hawing
But they're trying to figure out if the person they're interviewing has the level of knowledge necessary to understand what a process sandboxed from other processes in its own cgroup and namespaces means, not trying to give an intuitive description to someone who doesn't understand how Linux works.
Except "a kind of lightweight VM" is an incorrect answer. I'm trying to gauge if the person knows what they're talking about because running Docker and Kubernetes in production is hard. I don't want someone thinking they can go into some vsphere console to reboot their VMs when a pod is having problems.
Usually, Docker used to be "a process with cgroups and namespaces". But it has evolved since then [1], although I'd agree that fundamentally, the "process + cgroups + namespaces" is a correct answer.
It is just that nowadays the definition of a "Container" depends on the scope, and presumably "Docker" is an instance of a container. But I'd expect an interviewee to reply with "what type of OS-level virtualization tool are you referring to?", in which case you would get both a proper answer and a request for clarification :)
A container as just an isolated process makes enough sense to me, but where I get confused is how it allows me to use alpine or ubuntu or some other base image that differs from the actual OS I'm running on. That's what makes it feel more like a lightweight VM to me, and which doesn't seem explained by just cgroups and namespaces. Or if it is, there's something I'm still failing to understand here.
I'd guess that the gap you're missing is that one of the namespace types containers use is a mount namespace.
During container setup, you get a new mount namespace so that any changes to mounted filesystems are only seen by processes sharing that new mount namespace. Then you mount the filesystem from the docker image, replacing the filesystems mounted in the root namespace.
I don't know if it'll help, but that process is very similar to what happens during boot, where you've initially got a root filesystem mounted from the initrd, but you replace it with the root filesystem from a disk after you've loaded the right drivers from the initrd.
Yes, that is definitely one of the gaps I had, thank you (also the other replies to my comment -- upvoted all of you). I'm generally on a Mac (which runs docker in a linux VM -- the reason why is much clearer now) and so my exposure to linux is mostly in the context of ssh-ing into a server. I think the fact that OS X is linux-like without actually being a linux distro also subtly led me astray, as I've been thinking of different distros in the same way I think of OS X as different from Windows, and not really thinking about the "userland" vs "kernel" distinction.
The layers typically go: hardware > hypervisor instance (vm) > linux+distro > process. I think it's accurate to say that containers shuffle around those last two so it looks more like: hw > vm > linux > distro+process. The complication is that the lone linux in the middle is still deployed with a distro, it's just that the distro part is abstracted away from the perspective of the container.
That's not necessarily true though; you could very well implement containers as lightweight VMs that use their own specific kernels, though the image itself doesn't contain any. Thinking in terms of containers vs VMs is missing the mark a bit.
I love seeing posts like this. The best way to understand things is to go back to basics. When you understand the fundamentals, all the "crazy sophisticated" stuff becomes a lot easier to reason about and internalize.
Does anybody know if the functionality mentioned in the "One final note" — having all children, grandchildren, great-grandchildren, etc. of a designated "main" process reliably terminated when that "main" process terminates — is available via something more lightweight than the containers?
Traditional process groups/sessions are laughably inappropriate for this purpose since breaking out of them is trivial and in fact, about half of applications that internally use worker (sub)processes do exactly that, in order to (re)implement their own job control.
Are these things properly "nested"? Imagine a scenario like this: A process is started with "unshare A" or "cgexec A" or whatever, then it creates children one of which does "unshare B" or "cgexec B", then the first process, "A", is killed, or its cgroup is signalled, whatever. Will the process "B", that was put into a new namespace/cgroup, be terminated too or not? With traditional pgroups/sessions the answer is "no", and I'd like it to be "yes", so no runaway processes should be possible.
I personally think the only way to make a "detached" process should be by asking some process from an entirely different "group" (over RPC, presumably) to launch something for you: so that the newly created process would technically be a children of this other "launcher" process.
Technically, it could be implemented if the subreaper process were getting notified somehow of new processes being reparented to it. Alas, subreapers don't get such notifications.
Also, you need some protection against children who do "kill(-getpid(), SIGKILL)" before the exit.
EDIT: I found this link at the bottom of the OP's blog that was even better at simplifying things, https://jvns.ca/blog/2016/10/10/what-even-is-a-container/