Hacker News new | past | comments | ask | show | jobs | submit login

> these folks often otherwise don't use git enough for their code.

Or at all. This is actually a fairly common problem among certain types of research scientists who think software engineering is "peasant work".

I remember a friend of mine working at a research organization where several researchers lost months of work due to an unscheduled server reboot. Turns out when you log into ephemeral containers and pretend they are VMs, things go poof.




I read a post on Reddit's legal advice subreddit of a data scientist having their apartment robbed, and part of it was saying they'd lost all their code and currently working on projects.

As horrible as the robbery was, I was screaming inside "WHAT ABOUT VERSION CONTROL?!??!"


The worst story I read was of someone working on his PhD thesis for 2 years, and then he forgot his laptop in a bus and everything was gone because in 2 years he didn't store the files anywhere else. I personally met a teacher who stored the only copy of her students' final exam submissions on her everyday thumb drive without backup.

For many people technology really is magic.


That’s a really sad story.

If he doesn’t come from a programming background version can seem foreign.

But like ya know… Dropbox. I’d be more concerned with my hard drive corrupting than anything else


I worked at a company which had a bunch of VMs for their data science teams. They replaced them with containers, then suddenly panicked when they realized that the containers would lose state when the admins of the container hosts applied security patches and rebooted or removed the outdated nodes. (The old VMs were rarely, if ever, patched after the initial boot, which became a compliance problem.) It turned into a monstrous issue where the data scientists wanted 6 months of continuity, the data platform SREs wanted 30 days and the container host admins wanted 15 minutes.

I think they might still be deadlocked to this day.


I love hearing these kinds of stories just to remind myself how not dysfunctional my own company actually is.


I know plenty of companies that rarely apply any patches or do OS upgrades. I'm talking servers with 4 to 5 year uptimes, still running Ubuntu 16.04 or worse. They don't want to reboot because some dude who left 4 years ago set everything up, often in a non-standard way, and nobody is quite sure how it works. They certainly don't want to be blamed when production goes down. I did a contract job for a very large company that was running a 6 year old distro on a "critical" server and was afraid to do anything beyond changing a password. It's easier to have an outside person do it, so they can blame them when it gets screwed up.


Wouldn’t it be possible to automatically backup the data before reboot ?

Docker offers persistent volumes, so my 30 second solution would be that


That assumes a willingness to learn enough about Docker to understand how you configure your container to do what you want. That can, in some organizations, be too much to ask of researchers. That's the core of the problem.


I believe that was asked for by the SREs- e.g. Tensorflow supports checkpointing to disk and restoring progress- but the ML training software used by the data scientists did not have this feature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: