Life of a Container (2020)

stuff4ben · on July 29, 2021

My default question I give to interviewees looking for a dev or devops position is to tell me what a container is. Usually I get the "well it's like a vm but not a vm" and then I ask the followup which is to ask what are the differences between a container and vm? A lot of hemming and hawing but in the 6 or so years I've been interviewing people who are going to be using Docker, I have yet to hear the answer that I give when I interview, which is, "a container is a process." I usually leave it at that and wait to see if I get a confused look on their face. Then I say, "with cgroups and namespaces". It's really not that complicated (to explain anyways).

EDIT: I found this link at the bottom of the OP's blog that was even better at simplifying things, https://jvns.ca/blog/2016/10/10/what-even-is-a-container/

teraflop · on July 29, 2021

I think it's fairly important to say that a container is a process tree, not necessarily a single process. A lot of things wouldn't behave properly if a forked child process wasn't part of the same container as its parent.

(Of course, it might not truly be a tree either, because you can use tools like nsenter to put additional processes into the PID namespace without them being children of the container's PID 1. But a tree is the common case.)

tinalumfoil · on July 29, 2021

I would argue that the virtualization aspect is more important than the processiness of a container. Sure it's a process but it's a process that you've carefully controlled the resources it can access. You've put it in its own network namespace, you've rebuilt it's filesystem so it sees an entirely different / than the rest of the system, etc etc. If the process isn't checking, you can make it behave like it's in its own little world, with it's own kernel. Because of that, it's much more predictable than just a process.

Saying it's a process is mostly correct, and saying it's a process tree is a guess even more correct I think they both miss the point.

nonameiguess · on July 29, 2021

That's not usually complete. Minimally, it's a process in its own PID namespace.

But assuming it's an OCI container, which will be the case if you're using any common managed container runtime and not rolling your own, it's a process in its own user namespace, mount namespace, UTS namespace, cgroup namespace, IPC namespace, and time namespace. It's assigned its own hostname and IP in the new UTS namespace. It runs in a chroot in the new mount namespace to a root filesystem assembled via overlay with a single writeable layer on top of N read-only layers shared with any other container that is launched from the same image. It gets bind mounts, kernel VFS mounts, an environment, and an entrypoint command from a config file colocated with the root filesystem in a container bundle created from the image defaults, system defaults, and command line overrides.

The simplest way to compare it to a VM is a container is the Linux kernel using namespaces and cgroups to scope the services and resources it presents to a single userspace process. A VM is a hypervisor presenting virtualized hardware and BIOS services to guest kernels.

stuff4ben · on July 29, 2021

You're being a bit obtuse or maybe you just wanted to show off your knowledge a bit. I said basically the same, "it's a process... with namespaces and cgroups." If they can answer that and know what that means, I'm good and I go to my other interview questions. I'm not hiring a Docker developer or someone who needs to know the internals of Linux. I'm hiring SREs who manage and oversee Docker and Kubernetes production environments.

sombremesa · on July 29, 2021

Technically you don’t need to know what a container “is” to oversee Docker and Kubernetes production environments (knowing that it’s not a VM is a plus, but then you already knew that by calling it a container rather than a VM). A container can just be a container, and you can leave it at that.

If a zookeeper starts talking about jackdaws and crows in an interview, who’s the one being obtuse and trying to show off their knowledge?

The important thing is that the animals are taken care of and the zoo visitors are happy.

stuff4ben · on July 29, 2021

An SRE responsible for Kubernetes and Docker production environments who doesn't understand the fundamentals of how they work wouldn't be of much use to me. Sure they can probably follow scripted runbooks but I want people who can deep dive into advanced trouble-shooting. TBF, there are uses for those that don't get it, but they're usually the lesser paid folks who work night and weekend on-call rotations.

sombremesa · on July 29, 2021

> I want people who can deep dive into advanced trouble-shooting

In my experience, people who need to know every little thing rarely end up knowing every little thing, and are actually the absolute worst at fixing issues of high urgency - due to needing to know every little thing.

I'd much rather have people who are able to learn fast on the fly. Those are the people who actually end up knowing the little things that are actually useful, as opposed to useless trivia.

avgDev · on July 29, 2021

I have apps running in docker and I also deployed apps to VMs. I would get this answer wrong according to OP. Honestly I don't see how this knowledge is necessary.

stuff4ben · on July 29, 2021

Your username is apt. Want to be better than average? Buck up and learn the technology.

avgDev · on July 29, 2021

Kinda harsh coming from an IBM employee to be honest.

stuff4ben · on July 30, 2021

This tired trope? Average dev makes average insult, par for the course I guess.

zekrioca · on July 29, 2021

So much to just say that a container is a VM without BIOS?

maccam94 · on July 29, 2021

No. A container shares the same kernel with the host OS. It doesn't do any of its own hardware setup. At most the host kernel will create some virtual network interfaces and create a filesystem for it. The container's processes are the same as other processes on the host OS, just with some special accounting flags that govern how many resources they get and what they're allowed to see from the host.

A VM host creates a guest environment with a bunch of virtual hardware devices and starts up the guest's kernel that talks to them through its own drivers. The guest does its own hardware initialization, formats and mounts its own block storage devices, does its own bootup and process scheduling, etc.

zekrioca · on July 29, 2021

If a container shares the same kernel with the host OS, how can Docker containers run in different OSes? Are people saying that what Docker containers really are "Docker VMs"?

maccam94 · on July 30, 2021

"Docker for Mac" and "Docker for Windows" generally spin up a local Linux VM to run containers, but all of the containers run inside a single VM. (I'm going to gloss over running Windows containers on Windows, which I believe operate more like Linux containers do on Linux).

remram · on July 30, 2021

If you mean "how can Docker containers based on Linux run on Mac/Windows hosts", they don't. There is a single VM running Linux and the containers run inside of that.

mellosouls · on July 29, 2021

I'm not sure a "A process with cgroups and namespaces" makes things much clearer though somebody with understanding of those might infer certain properties.

"a kind of lightweight VM" seems a much more intuitive answer, with limitations.

The link is useful tho, she's a good writer.

nomoreplease · on July 29, 2021

Not OP but I also interview. I see CGroups and namespaces is a follow up to “how does it work or how is it usually isolated?” And I ask “why would I use it?” Or “when would I choose containers over vm?” Or vice versa to get to actual architectural decisions

The wrong answer usually involves a lot of hemming and hawing

mellosouls · on July 29, 2021

Yeah tbf it was offered as a follow on question, I'm just not sure it makes things clearer in describing functional and design differences.

nonameiguess · on July 29, 2021

But they're trying to figure out if the person they're interviewing has the level of knowledge necessary to understand what a process sandboxed from other processes in its own cgroup and namespaces means, not trying to give an intuitive description to someone who doesn't understand how Linux works.

stuff4ben · on July 29, 2021

Except "a kind of lightweight VM" is an incorrect answer. I'm trying to gauge if the person knows what they're talking about because running Docker and Kubernetes in production is hard. I don't want someone thinking they can go into some vsphere console to reboot their VMs when a pod is having problems.

mellosouls · on July 29, 2021

Except "a kind of lightweight VM" is an incorrect answer

It's a pretty reasonable starting point as a description compared to the meaningless rejoinder which says nothing about what the point of them is.

zekrioca · on July 29, 2021

Usually, Docker used to be "a process with cgroups and namespaces". But it has evolved since then [1], although I'd agree that fundamentally, the "process + cgroups + namespaces" is a correct answer.

[1] From their own FAQ: https://docs.docker.com/engine/faq/#what-does-docker-technol...

stuff4ben · on July 29, 2021

Agreed, if I was asking what "Docker" was. I ask what a "container" is to skirt around that. Although I sometimes do ask what is Docker too.

zekrioca · on July 29, 2021

It is just that nowadays the definition of a "Container" depends on the scope, and presumably "Docker" is an instance of a container. But I'd expect an interviewee to reply with "what type of OS-level virtualization tool are you referring to?", in which case you would get both a proper answer and a request for clarification :)

anyonecancode · on July 29, 2021

A container as just an isolated process makes enough sense to me, but where I get confused is how it allows me to use alpine or ubuntu or some other base image that differs from the actual OS I'm running on. That's what makes it feel more like a lightweight VM to me, and which doesn't seem explained by just cgroups and namespaces. Or if it is, there's something I'm still failing to understand here.

tene · on July 29, 2021

I'd guess that the gap you're missing is that one of the namespace types containers use is a mount namespace.

During container setup, you get a new mount namespace so that any changes to mounted filesystems are only seen by processes sharing that new mount namespace. Then you mount the filesystem from the docker image, replacing the filesystems mounted in the root namespace.

I don't know if it'll help, but that process is very similar to what happens during boot, where you've initially got a root filesystem mounted from the initrd, but you replace it with the root filesystem from a disk after you've loaded the right drivers from the initrd.

anyonecancode · on July 30, 2021

Yes, that is definitely one of the gaps I had, thank you (also the other replies to my comment -- upvoted all of you). I'm generally on a Mac (which runs docker in a linux VM -- the reason why is much clearer now) and so my exposure to linux is mostly in the context of ssh-ing into a server. I think the fact that OS X is linux-like without actually being a linux distro also subtly led me astray, as I've been thinking of different distros in the same way I think of OS X as different from Windows, and not really thinking about the "userland" vs "kernel" distinction.

infogulch · on July 29, 2021

The layers typically go: hardware > hypervisor instance (vm) > linux+distro > process. I think it's accurate to say that containers shuffle around those last two so it looks more like: hw > vm > linux > distro+process. The complication is that the lone linux in the middle is still deployed with a distro, it's just that the distro part is abstracted away from the perspective of the container.

Diggsey · on July 29, 2021

All of those OSes use the same linux kernel, it's only the user-space part that differs.

When you run eg. an ubuntu docker image, the entire user-space of ubuntu is running inside the container. Only the kernel is shared with the host.

jahlove · on July 29, 2021

Aren't centos:7 and centos:8 using different kernels?

piperswe · on July 29, 2021

Nope, they're just the CentOS 7 and CentOS 8 userlands running on top of the host kernel.

chousuke · on July 30, 2021

That's not necessarily true though; you could very well implement containers as lightweight VMs that use their own specific kernels, though the image itself doesn't contain any. Thinking in terms of containers vs VMs is missing the mark a bit.

whalesalad · on July 29, 2021

I love seeing posts like this. The best way to understand things is to go back to basics. When you understand the fundamentals, all the "crazy sophisticated" stuff becomes a lot easier to reason about and internalize.

Joker_vD · on July 29, 2021

Does anybody know if the functionality mentioned in the "One final note" — having all children, grandchildren, great-grandchildren, etc. of a designated "main" process reliably terminated when that "main" process terminates — is available via something more lightweight than the containers?

Traditional process groups/sessions are laughably inappropriate for this purpose since breaking out of them is trivial and in fact, about half of applications that internally use worker (sub)processes do exactly that, in order to (re)implement their own job control.

paulfurtado · on July 29, 2021

If running as a privileged user, you can run the process in just a PID namespace with no other container features enabled using the unshare command.

Systemd can also handle it by killing the processes by cgroup i believe, and for non services you could take advantage of that with systemd-run.

Joker_vD · on July 29, 2021

Are these things properly "nested"? Imagine a scenario like this: A process is started with "unshare A" or "cgexec A" or whatever, then it creates children one of which does "unshare B" or "cgexec B", then the first process, "A", is killed, or its cgroup is signalled, whatever. Will the process "B", that was put into a new namespace/cgroup, be terminated too or not? With traditional pgroups/sessions the answer is "no", and I'd like it to be "yes", so no runaway processes should be possible.

I personally think the only way to make a "detached" process should be by asking some process from an entirely different "group" (over RPC, presumably) to launch something for you: so that the newly created process would technically be a children of this other "launcher" process.

yrro · on July 29, 2021

Interesting idea. A sort of PR_SET_CHILD_SUBREAPER + "if I die, send SIGKILL to all my children" option. That would be handy...

Joker_vD · on July 29, 2021

Technically, it could be implemented if the subreaper process were getting notified somehow of new processes being reparented to it. Alas, subreapers don't get such notifications.

Also, you need some protection against children who do "kill(-getpid(), SIGKILL)" before the exit.

0x0 · on July 29, 2021

Did the cgroup names change randomly in the middle of the examples? It changes from "child" to "demo" between

# mkdir /sys/fs/cgroup/memory/child

and

# ls -lh /sys/fs/cgroup/memory/demo/

singlow · on July 29, 2021

evidence of a spliced speedrun...

jeffbee · on July 29, 2021

Small quibble: cgroup fs may be accessible at /sys/fs/cgroup, but it may also be mounted anywhere. Use `mount -t cgroup` to find out.

arpa · on July 29, 2021

check out bocker[1] for a tl;dr version (~100 lines of bash) without any unnecessary flavouring.

1. https://github.com/p8952/bocker