Hacker News new | past | comments | ask | show | jobs | submit login

We considered to use sth like this to cache some Python program state to speed up the startup time, as the startup time was quite long for some of our scripts (due to slow NFS, but also importing lots of libs, like PyTorch or TensorFlow). We wanted to store the program state right after importing the modules and loading some static stuff, before executing the actual script or doing other dynamic stuff. So updating the script is still possible while keeping the same state.

Back then, CRIU turned out to not be an option for us. E.g. one of the problems was that it was not possible to be used as non-root (https://github.com/checkpoint-restore/criu/pull/1930). I see that this PR was merged now, so maybe this works now? Not sure if there are other issues.

We also considered DMTCP (https://github.com/dmtcp/dmtcp/) as another alternative to CRIU, but that had other issues (I don't remember).

The solution I ended up was to implement a fork server. Some server proc starts initially and only preloads the modules and maybe other things, and then waits. Once I want to execute some script, I can fork from the server and use this forked process right away. I used similar logic as in reptyr (https://github.com/nelhage/reptyr) to redirect the PTY. This worked quite well.

https://github.com/albertz/python-preloaded




How were you handling GPU state w/ pytorch? We added some custom code around CRIU to enable GPU checkpointing fwiw: https://docs.cedana.ai/setup/gpu-checkpointing/


Not at all. I forked before I used anything with CUDA. I didn't need it but I guessed this could cause all kind of weird problems.


This sounds similar to what's been done to speedup FaaS cold starts - snapshot the VM after the startup code runs, then launch functions from the snapshot. E.g., https://www.sysnet.ucsd.edu/~voelker/pubs/faasnap-eurosys22.....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: