Hacker News new | past | comments | ask | show | jobs | submit login

For another example of a similarly beautiful interface that echoes "difficult solution made stupidly simple to use", checkpointing under DragonFly BSD: http://leaf.dragonflybsd.org/cgi/web-man?command=sys_checkpo...

On an unrelated note, I've always had respect for how Theo de Raadt is both the project leader of a complete BSD system, yet also an active hacker. Contrast to Linus Torvalds, who's mostly a manager nowadays.




On an unrelated note, but tangential to your link, why can I not save and restore processes? It seems like something that would be relatively easy to do - suspend all threads, save the thread control blocks, mark all pages as fault-on-write, resume threads, and start saving pages, unmarking them once they are saved. If/when a thread faults, copy (or save immediately) that page and then resume that thread. (Or potentially don't resume the threads until after you save everything, although that may cause problems...) You need to deal with file handles / etc, but that can be done too.

Not really different than hibernation.


You can, there's a process freezing utility for Linux; I'll see if I can remember its name.

EDIT: Here is it: CryoPID (https://github.com/maaziz/cryopid)

CryoPID allows you to capture the state of a running process in Linux and save it to a file. This file can then be used to resume the process later on, either after a reboot or even on another machine.


cryopid and cryopid2 are long abandoned, possibly not working on 3.x and 4.x kernels anymore.

Modern and more sophisticated equivalents are CRIU and DMTCP.



You've tickled on the problem but not quite nailed it.

> You need to deal with file handles / etc, but that can be done too.

That's actually the hard part. To get a real image of that process in time, you need to snapshot the full filesystem state, too. Or it could change out from beneath your program. Even more complicated: network state.


Why is network more complicated? I would think network doesn't have any atomic/uninterruptible states filesystem might?


It's easy to re-open a file (assuming it's still there), but with sockets your IP may have changed, the remote IP may have changed (which you may have stored in your working memory that got checkpointed), DNS may point you to a different service entirely, you could have had to do some kind of port knocking or something to get that connection open in the first place.

I know this kind of stuff is being worked on so VMs/containers/namespaces can be moved around but it seems to be one of those things that gets really complicated when you try to do it transparently for userspace.


IPs, DNS settings change on running programs all the time, that doesn't seem as unusual as re-opening a file that's actually not there. A unix socket is an interesting mixed case :)


If a process has a stream socket open to another process, or to another system over the network, what happens to that socket when the process is "thawed"?

How about if it's listening on a TCP port -- what happens if that port is in use by another process when the original one is thawed?


I understand this can't go 'right', but are those things more difficult than filehandles to files that have been deleted?


Handles to deleted files are relatively uncommon in practice. Network sockets aren't.


"Handles to deleted files are relatively uncommon in practice."

Could you please expand on your reasoning here? We're talking about restoring processes at arbitrary points in the future. That means we're not just talking about handles to files that were deliberately deleted while the process was running, but also anything that the process had open that was frozen that may have been subsequently deleted. That would seem to include any log file that gets rotated, which is not exactly rare, plus a ton more things.

I also think that treating network sockets as if they were disconnected is likely to go better than treating files that way - existing programs probably make more assumptions about disk state not changing unexpectedly than about network state not changing unexpectedly (even if both are technically not well founded).


IIRC the Criu developers went into some detail about this on FLOSS weekly some time back:

https://twit.tv/shows/floss-weekly/episodes/334

I can't remember exactly where in the podcast they discussed it, but I believe it was just before the part where you could hear brains exploding in the background


Elaborate? I don't think there's much of anything that could change out from under a suspended process that couldn't change out from under a running process.

(Case in point: you can have a system hibernate, have a supposedly locked file change, and have the system resume.)


This is a very good point. I definitely see the value in it, and making it simple to use means that programs are more likely to actually use it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: