Hacker News new | past | comments | ask | show | jobs | submit login
CRIU, a project to implement checkpoint/restore functionality for Linux (criu.org)
208 points by JeremyNT 78 days ago | hide | past | favorite | 73 comments



I built crik[1] to orchestrate CRIU operations inside a container running in Kubernetes so that you can migrate containers when spot node gets a shutdown signal. Presented it at KubeCon Paris 2024 [2] with a deep dive for those interested in the technical details.

[1]: https://github.com/qawolf/crik

[2]: The Party Must Go On - Resume Pods After Spot Instance Shutdown, https://kccnceu2024.sched.com/event/1YeP3


My process connects to, say, Postgres. What's going to happen to that connection upon restore?

Does crik guarantee the order of events (saving a checkpoint should be followed by killing the old process/pod, which should be followed by a restoration - the order of these 3 events is strict) and given that criu can checkpoint and restore sockets state correctly - how does that work for kubernetes? The new pod will have a different IP.


TCP connections are identified with source IP:port and target IP:port tuples. When a new pod is created, it gets a new IP so there is not much way to restore the TCP connections. So crik drops all TCP connections and lets the application handle the reconnection logic. There are some CNIs that can give a static IP to pod, but that’s rather unorthodox in k8s.


Right, and this shouldn't be a big issue for [competent] cloud-native software: it's a transient fault. If your software can't recover from transient faults then this is the wrong ecosystem to be considering.


> The new pod will have a different IP.

Usually clients would connect to a Kubernetes svc to not have the problem with changing IPs. Even for just a single pod I would do that.


The app in the pod is the client (of a DBMS server). The client's IP gets changed. A service in k8s is a network node with an address, but it is used for inbound connections, outbound connections (like from the app to a DBMS server, which may be outside of k8s cluster) usually do not use services (as it gives no benefits).


great talk! I’m curious about an approach like this combined with CUDA checkpoint for GPU workloads https://github.com/NVIDIA/cuda-checkpoint


This makes sense for checkpointing and restoring long ML training runs.

Doing this on a networked application is going to be iffy. The restored program sees a time jump. The world in which it lives sees a replay of things the restore program already did once, if restore is from a checkpoint before a later crash.

If you just want to migrate jobs within a cluster, there's Xen.


I pulled apart the innards of CRIU because I needed to be able to checkpoint and restore a process within a few microseconds.

The project ended up being a dead end because it turned out running my program in a QEMU whole system vm and then fork()ING QEMU worked faster.


There is a QEMU fork used by Nyx fuzzer, may be interesting to you https://github.com/nyx-fuzz/QEMU-Nyx

Basically, for the fuzzing purposes speed is paramount so they made some changes to speed up snapshot restoring. Don't know the limitations but since it is used to fuzz full operating systems, there should not be many.

I believe it should be faster than forking because why even patch QEMU otherwise.


could you tell me a bit more about what you're doing?


The goal was to have a web browser (chromium) able to 'guess' stuff about what response it will get from the network (ie. Will the server return the same JavaScript blob as last time). We start executing the JavaScript as-if the guess is correct. If the guess is wrong, we revert to a snapshot.

It lets you make good use of CPU time whilst waiting for the network.

It turns out simple heuristics can get 99% accuracy on the question of 'will the server return the same result as last time for this non-cachable response'.

However, since my machine has many CPU cores it made sense to have many 'speculative' copies of the browser going at once.

A regular fork() call would have worked, if not for the fact chromium is multi thread and multi process, and it's next to impossible to fork multiple processes as a group.


Terrifying, I love it :) How was the performance in the end? Did you get a good speculation success rate?

It'd be cool to predict which resources are speculation safe (ie the cache headers don't permit it, but the content in practice doesn't change) and speculate those resources but not ones which you have repeatedly had a speculation abort (ie actual dynamic resources). If your predictor gets a high enough hit rate, you could probably do okay with just a single instance/no snapshot and use an expensive rollback mechanism (reload the whole page non-speculatively?).


Sorry if I'm being thick, but why not just cache the response?

If you are guessing at the data anyway, what's the difference?

Why set up an entire speculative execution engine / runtime snapshot rollback framework when it sounds like adding heuristic decision caching would solve this problem?


Sounds like they were caching it since they could execute it before getting the response. The difference is that they wanted to avoid the situation where they execute stale code that the server never would've served. So they can execute the stale code while waiting for the response then either toss the result or continue on with it once they determine if the server response changed.


How else will you discover new and exciting speculative execution vulnerabilities? /s


Couldn't you just change chrome so that it forks the tabs and runs them in the background? That seems a lot easier.


I once used CRIU to implement the hacky equivalent of save-lisp-and-die to speed up the startup process of a low-powered embedded system where the main application was misguidedly implemented in Erlang and loading all the code took minutes each time the device started. It worked better than it should have (though in the end it wasn't shipped because nobody (except the customer) cared enough about the startup behavior and eventually the product got canned (for different reasons)).


What was misguided about using Erlang? That it was so expensive CPU-wise to start up?


It was just a bad fit for the problem at hand. It was picked mainly because there were a bunch of Elixir backend devs in the organization that were out of work and every problem looks like a web service (or any other system architecture you're familiar with) if you want it badly enough. There was a bunch of hardware interaction that ended up getting externalized into a separate C++ component that exposed a gRPC interface towards the Elixir component both because of Conway's Law and because nobody wanted to (or knew how to) write NIFs and deal with blocking operations on the Erlang side.

A two-digit number of these devices was meant to form a cluster of sorts and using Erlang clustering sounds like a nice solution for that until you realize that the base load of the full mesh that this implies is high enough to use a meaningful chunk of the device's resources.


> A two-digit number of these devices was meant to form a cluster of sorts...

That embedded constraint is fascinating. With the benefit of 20/20 hindsight, given the same substrate constraint, what direction would you have taken? I was thinking along the lines of maybe a C/C++ state machine library.


I would guess they compiled locally.


I discovered CRIU in this video below (1h) "Container Migration and CRIU Details with Adrian Reber (Red Hat)", it has a live demo and the details about how much "user space" it really is. Here with the RH podman fork of docker.

Since everyone is treating containers as cattle CRIU doesn't seem to get much attention, and might be why a video and not a blog post was my first introduction.

https://www.youtube.com/watch?v=-7DgNxyuz_o


> Since everyone is treating containers as cattle CRIU doesn't seem to get much attention

Nah, it's more like "I don't trust that thing to not cause weird behavior in production".

VM-level snapshots are standard practice[1] because the abstraction there is right-sized for being able to do that reliably. CRIU isn't, because it's trying to solve a much harder problem.

[1]: And even there, beware cloning running memory state, you can get weird interactions from two identical parties trying to talk to the same 3rd service, separated by time. Cloning disk snapshots is much safer, and even there you can screw up because of duplicate machine IDs, crypto keys, nonces, etc.


The thing with VMs is that there is much more overhead to booting a Linux VM which makes checkpointing much more attractive. For a container running with Linux namespaces/cgroups the container can be started in a few milliseconds.

Im sure there are some niche applications for container checkpointing, but I don’t really see the complexity being worth it. Maybe checkpointing some long running batch jobs could save you some money, but you should just make your jobs checkpoint their state to an external store such a ceph or s3 and make the jobs smart enough to load any state from those stores if they are preempted.


Firecracker starts running the application in as low as 125ms. Most of the overhead in a "cloud cold start" comes from the cloud infrastructure, not from the virtualization mechanism.


Yeah, I’ve always held a soft spot for CRIU, but I don’t see it battle-tested enough to trust big, gnarly closed-source third-part vendor enterprise Java products to run under it. And if I’ve reduced an open source piece of kit’s execution footprint enough to trust it, I’d probably reach for Unikraft with checkpointing to a Persistent Volume before CRIU, which feels like early days of VMWare.

Hopefully though, my trepidation is wrong. What is the most complex piece of software others have run under CRIU in production, and for how long?


> Here with the RH podman fork of docker

small nit: podman is not a docker fork, it's a completely different codebase written from scratch


It is absolutely not from scratch. It is definitely reusing docker code (or at least it used to).


> Since everyone is treating containers as cattle CRIU doesn't seem to get much attention

Yeah, I guess that's probably the reason. If you're engineering your workloads with the idea that the world might "poof" out from under you at any moment you'd never wonder about / reach for something like CRIU.

It's a trick that I'd never much thought about, but now that I've learned it exists (so many years late) I find myself wondering about the path not taken here. It feels like it should be incredibly useful... but I can't figure out exactly what I'd want to do with it myself.


> ...I find myself wondering about the path not taken here.

Check out mainframes and Tandem systems for a peek at that path. Lots of support in those systems for the notion your application’s substrate might suddenly go poof, and you need it to recover from where it left off as instantaneously as possible.

It’s expensive.


I'm keeping an eye on this project as a way to give containers used with immutable distro installs (eg. silverblue) a kind of user-space hibernation feature. So I could hibernate different container workspaces at will. I would find this very useful for development projects where I often have a lot of state that I lose whenever I need to reboot or whatever. Last time I looked there were still to many limitations on what it could checkpoint, but maybe one day.


We do this using CRIU right now! https://github.com/cedana/cedana

In fact one of our customer's use cases is exactly what you describe, allowing users to "hibernate" container workspaces.


CRIU is 11 years old, don't expect it to be any more usable in the near future.


rr took about 9 years to get first class aarch64 support.

https://github.com/rr-debugger/rr/issues/1373


Interesting. I built a very primitive prototype for a hosting company a while a back where I wanted to figure out if we could offer something close to a live migration of one Linux account on host x to host y without causing a lot of downtime. The product didn't support containers and isolation was just based on Linux user accounts so we couldn't just use Docker.

Just a few months ago I was talking to a startup founder at KubeCon who built a product based on CRIU. Unfortunately I forgot the company's name. (And I can't find that git repo with the prototype anywhere, even in my backups. Sad.)


I'm probably the cofounder of the guy you spoke with! Here's our repo: https://github.com/cedana/cedana


Indeed, that was it. All the best with your startup!


Some one uses it to start Emacs very quickly - https://gitlab.com/blak3mill3r/emacs-ludicrous-speed


CRIU is used by LXD to save the state of an LXD container, very similar to suspending or snapshotting a virtual machine.

Unfortunately, I was disappointed to find `lxd stop --stateful` couldn't save any of my LXD containers. There was always some error or other. This is how I learned about CRIU, as it was due to limitations of CRIU when used with the sorts of things running in LXD.

  # lxc stop --stateful test
  (00.121636) Error (criu/namespaces.c:423): Can't dump nested uts namespace for 2685261
  (00.121645) Error (criu/namespaces.c:682): Can't make utsns id
  (00.150794) Error (criu/util.c:631): exited, status=1
  (00.190680) Error (criu/util.c:631): exited, status=1
  (00.191997) Error (criu/cr-dump.c:1768): Dumping FAILED.
  Error: snapshot dump failed
LXD is generally used with "distro-like" containers, like running a small Debian or Ubuntu distro, rather than single-application containers as are used with Docker.

It turns out CRIU can't save the state of those types of containers, so in practice `lxd stop --stateful` never worked for me.

I'd have to switch to VMs if I want their state saved across host reboots, but those don't have other behaviours regarding host-guest filesystem sharing that I needed.

In practice this meant I had to live with never rebooting the host. Thankfully Linux just keeps on working for years without a reboot :-)


Stéphane Graber (key Incus née LXD contributor) just did a video about developing placement scriptlets in the Starlark language but the interesting thing is, if I'm interpreting what I saw correctly, his cluster was 6 beefy servers plus 3 decent-sized VMs and the idea was, I think, that containers could get placed on the nested VMs, neatly solving the migration issue with containers. The interesting part was it looked like the 3 VMs in the cluster may have been themselves in the cluster.

I could be wrong, though. Interesting approach if true


> Linux just keeps on working for years without a reboot

Except I would strongly suggest not doing that as there have been some very nasty security issues fixed as of late.


> (00.121636) Error (criu/namespaces.c:423): Can't dump nested uts namespace for 2685261

Found a GitHub issue for this: https://github.com/checkpoint-restore/criu/issues/1430

The issue apparently is newer systemd versions create their own UTS namespace, so suddenly running systemd in a container results in nested UTS namespace. Containers with older versions of systemd, or which don't use systemd, shouldn't have the issue.

One commenter posted in April 2021 that they had a patch to add support for nested UTS namespaces, but they don't appear to have submitted it: https://github.com/checkpoint-restore/criu/issues/1430#issue...

Comment on another issue has suggestion on how to implement nested UTS namespace support: https://github.com/checkpoint-restore/criu/issues/1011#issue...

It doesn't sound like nested UTS namespace support is impossible, just something nobody has got around to implementing.

Comment in CRIU source code says nested namespaces are only supported for mount namespaces (CLONE_NEWNS) and network namespaces (CLONE_NEWNET): https://github.com/checkpoint-restore/criu/blob/b5e2025765b9...

But if you look at the OpenVZ fork of CRIU, you see it also supports PID (CLONE_NEWPID), UTS (CLONE_NEWUTS) and IPC (CLONE_NEWIPC) namespaces: https://bitbucket.org/openvz/criu.ovz/src/d9bf55896015a27df9...

I don't know why these additional features in OpenVZ CRIU don't exist in the upstream.

I think the main blocker to supporting nesting of the other namespace types (user, cgroup, time), is someone getting around to write the code for the support. It is possible some of them pose some kind of architectural issue where some kernel enhancement might be necessary (if that's true of any, I'd say most likely of user), but I suspect for most of them it is simply a matter that nobody has gotten around to it.

The other issue is eventually someone will add another namespace type to the Linux kernel, and then CRIU will need to support that too.


How do things like this handle sockets? Is there some kind of first class event that the app can detect, or does it just "close" them all and assume the app can cleanly reconnect to reestablish them (once they detect that the socket has rudely closed on them)?


There are many ways to go about it. Standard way recommended with libsoccr (the criu library to handle tcp socket checkpoint/restore) is to install a firewall rule to filter packets during checkpoint and let tcp resend whatever it needs to resync whenever the socket is restored.

If you want your original process to continue living after the checkpoint and not lose packets during checkpoint, you can go a pretty long way with the 'plug' tc, IFBs. And if you're aventurous, lots of support for getsockopt/setsockopt and ioctls have been or are being merged within io_uring so checkpointing a big-buffered TCP socket can cost under 100us, even less IIRC.


https://github.com/checkpoint-restore/criu?tab=readme-ov-fil...

> One of the CRIU features is the ability to save and restore state of a TCP socket without breaking the connection. This functionality is considered to be useful by itself, and we have it available as the libsoccr library.


We considered to use sth like this to cache some Python program state to speed up the startup time, as the startup time was quite long for some of our scripts (due to slow NFS, but also importing lots of libs, like PyTorch or TensorFlow). We wanted to store the program state right after importing the modules and loading some static stuff, before executing the actual script or doing other dynamic stuff. So updating the script is still possible while keeping the same state.

Back then, CRIU turned out to not be an option for us. E.g. one of the problems was that it was not possible to be used as non-root (https://github.com/checkpoint-restore/criu/pull/1930). I see that this PR was merged now, so maybe this works now? Not sure if there are other issues.

We also considered DMTCP (https://github.com/dmtcp/dmtcp/) as another alternative to CRIU, but that had other issues (I don't remember).

The solution I ended up was to implement a fork server. Some server proc starts initially and only preloads the modules and maybe other things, and then waits. Once I want to execute some script, I can fork from the server and use this forked process right away. I used similar logic as in reptyr (https://github.com/nelhage/reptyr) to redirect the PTY. This worked quite well.

https://github.com/albertz/python-preloaded


How were you handling GPU state w/ pytorch? We added some custom code around CRIU to enable GPU checkpointing fwiw: https://docs.cedana.ai/setup/gpu-checkpointing/


Not at all. I forked before I used anything with CUDA. I didn't need it but I guessed this could cause all kind of weird problems.


This sounds similar to what's been done to speedup FaaS cold starts - snapshot the VM after the startup code runs, then launch functions from the snapshot. E.g., https://www.sysnet.ucsd.edu/~voelker/pubs/faasnap-eurosys22.....


For my OS class's final project last quarter, I built a way to live-migrate a process (running on a custom OS we built from scratch) from one Raspberry Pi to another, essentially using checkpoint/restore!

Getting the code cleaned up enough to post it has been on my to-do list for quite some time, and this has inspired me to do it soon!


Should mention, the coolest part is that I never sent over "all" the memory used by the process, because it was difficult to tell what is needed and what isn't. Instead, I was clever with virtual memory, and when a page of memory was needed that wasn't loaded by the recipient Pi, it would request and lazy-load just that page from the provider Pi, and with some careful bookkeeping mark that the page was owned by the recipient Pi.


> Instead, I was clever with virtual memory, and when a page of memory was needed that wasn't loaded by the recipient Pi, it would request and lazy-load just that page from the provider Pi, and with some careful bookkeeping mark that the page was owned by the recipient Pi.

I wonder if that "trick" can be extended to a full implementation of distributed shared memory, i.e. multiple nodes running separate tasks in a single address space and implementing cache coherence over the network. Probably needs quite a bit of extra compiler/runtime support so it wouldn't really apply to standard binaries, but it might still be useful nonetheless.


Partitioned Global Address Space (PGAS) compilers/runtimes do something similar to that. Unified Parallel C (UPC,https://upc.lbl.gov/) and Coarray Fortran/Coarray C++ (https://docs.nersc.gov/development/programming-models/coarra...) are good examples commonly used in HPC. Fabric Attached Memory (OpenFAM, https://openfam.github.io/) is another example.


“commonly used in HPC” is a bit of a stretch if you’re talking about production applications.


That's actually basically what I was doing! Was able to run programs compiled for a "normal" OS on a single unified distributed virtual address space!


Sounds like a cool class project! If I understand your approach correctly, this is how live virtual machine migration typically works (e.g., https://none.cs.umass.edu/~shenoy/courses/spring18/readings/...). It also sounds similar to this "remote-fork" concept: https://www.usenix.org/system/files/osdi23-wei-rdma.pdf.


Would love a low-tech version of this which simply suspends the process and puts all mapped pages in swap (no persistence across reboot ofc). I think it could be used for scheduling large memory-bound jobs whose resource usage is not known in advance.


Not sure that’s needed. Sending SIGSTOP (or using cgroup freeze) and letting the linux memory management do its job should do most of that already.


I've used this to speed-up CLI commands which have a slow startup phase. I run the process up to the point where it finishes initializing things and starts reading input, SIGSTOP there, and resume it later. You can identify that "reading input" library call with strace, and intercept it dynamically with an LD_PRELOAD shim.

I haven't yet figured out how to (neatly) persist this to disk, so I just sort of make a mini server-loop that catches signals and dispatches a fork() for each one, and that's my fast version of the CLI command. Delightfully ugly :)

(The killer app I'm trying to apply this to is LaTeX, so that I can write math notes in Emacs, incrementally, without visible latency. Unfortunately the running LaTeX process is slightly convoluted, and needs a few more tricks to get working in this way. This trick works on the plain TeX command out-of-the-box (it's like a 50x speedup), so I think I'm on the right track...)


> The killer app I'm trying to apply this to is LaTeX, so that I can write math notes in Emacs, incrementally, without visible latency.

See texpresso [1] for one solution that does something like this with the LaTeX process.

Another, more conservative solution is the upcoming changes to Org mode's LaTeX previews [2] which can preview live as you type, with no Emacs input lag (Demos [3,4]).

[1] https://github.com/let-def/texpresso

[2] https://abode.karthinks.com/org-latex-preview/

[3] http://tinyurl.com/olp-auto-1

[4] https://tinyurl.com/ms2ksthc


I wasn't aware of either of those, so thank you very much for those referrals I shall take a close look at :)

(Are you by chance the author of org-latex-preview, or is it a coincidence of usernames?)


I am one of the authors, should have mentioned. It's not part of Org yet, but should be some time this year.


Yes I confirm it works - I’ve been using this very behavior for years


Yep, nothing fancy needed


Can this be used for something like Steam Deck? It would be nice for when you are running a game and needs to stop but will resume gameplay later.


I'd say unlikely, games do a lot of work within graphics cards that are not as easily dumpable/restorable as memory.


I've interacted with some of these features as a means of code injection into running processes. (checkpoint, patch the checkpoint data, restore)

It's useful because, by design, it's difficult for the process to even notice it's been stopped. And while it's stopped, you can apply arbitrary patches completely atomically.


This is what supports Docker's checkpoint create/restore.

And Docker is a very convenient way to do this, e.g. workaround the PID limitation.

(Though I really wish it got more attention https://github.com/docker/cli/issues/4245 )


Great project!

For long running containerised simulations, this saves a lot of time on failures ( as long as you have a safe place to write the snapshots to ) by not restarting from 0 every time.


seriously cool project, used it at a prev workplace to checkpoint http servers for absolutely dirt nasty start speeds


nginx gone wild?


How does it compare to dumping a core, or regarding what the process is doing for reverse debugging?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: