I built crik[1] to orchestrate CRIU operations inside a container running in Kub...

JoosToopit · 2024-06-22T10:56:29 1719053789

My process connects to, say, Postgres. What's going to happen to that connection upon restore?

Does crik guarantee the order of events (saving a checkpoint should be followed by killing the old process/pod, which should be followed by a restoration - the order of these 3 events is strict) and given that criu can checkpoint and restore sockets state correctly - how does that work for kubernetes? The new pod will have a different IP.

monus · 2024-06-22T15:38:15 1719070695

TCP connections are identified with source IP:port and target IP:port tuples. When a new pod is created, it gets a new IP so there is not much way to restore the TCP connections. So crik drops all TCP connections and lets the application handle the reconnection logic. There are some CNIs that can give a static IP to pod, but that’s rather unorthodox in k8s.

zamalek · 2024-06-22T18:25:11 1719080711

Right, and this shouldn't be a big issue for [competent] cloud-native software: it's a transient fault. If your software can't recover from transient faults then this is the wrong ecosystem to be considering.

rmetzler · 2024-06-22T13:47:53 1719064073

> The new pod will have a different IP.

Usually clients would connect to a Kubernetes svc to not have the problem with changing IPs. Even for just a single pod I would do that.

mathfailure · 2024-06-22T15:00:18 1719068418

The app in the pod is the client (of a DBMS server). The client's IP gets changed. A service in k8s is a network node with an address, but it is used for inbound connections, outbound connections (like from the app to a DBMS server, which may be outside of k8s cluster) usually do not use services (as it gives no benefits).

alexeldeib · 2024-06-22T07:01:30 1719039690

great talk! I’m curious about an approach like this combined with CUDA checkpoint for GPU workloads https://github.com/NVIDIA/cuda-checkpoint

Animats · 2024-06-22T19:16:45 1719083805

This makes sense for checkpointing and restoring long ML training runs.

Doing this on a networked application is going to be iffy. The restored program sees a time jump. The world in which it lives sees a replay of things the restore program already did once, if restore is from a checkpoint before a later crash.

If you just want to migrate jobs within a cluster, there's Xen.