Wouldn’t it be possible to automatically backup the data before reboot ? Docker ...

bborud · on March 5, 2022

That assumes a willingness to learn enough about Docker to understand how you configure your container to do what you want. That can, in some organizations, be too much to ask of researchers. That's the core of the problem.

dharmab · on March 4, 2022

I believe that was asked for by the SREs- e.g. Tensorflow supports checkpointing to disk and restoring progress- but the ML training software used by the data scientists did not have this feature.