Hacker News new | past | comments | ask | show | jobs | submit login

War story time. Long ago, I worked for an interesting company that insisted on running its entire business on Linux desktops, all the way back between 1999-2002. Imagine running StarOffice/OpenOffice, Thunderbird, Netscape Navigator, etc, for your entire business back in 2000, including your executive team, marketing teams, everyone, most of whom had never even heard of Linux before.

Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.

Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.

And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.

As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.

By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.

We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.




That sinking feeling and cold panic when you realise what you've done. God that is horrible.


My favorite version is when that UPDATE or DELETE SQL query that you expected to finish instantly takes a few seconds before giving you your cursor back.


If someone just gave me a tool to show me the expected wall time of query before actually running it, I would be quite happy. I would not even need that much of accuracy, anything up to one order of magnitude would be useful, and even up to two orders of magnitude I would use occasionally.



Nobody has ever been able to give me a function query cost -> wall time with any accuracy.


You probably knew this already, and there's probably better solutions if you're not in the manual sysadmin world, but after I did that on a personal machine a few decades ago (I think it was?), I got in the habit of using `--one-file-system` when doing major recursive rm operations that weren't meant to cross filesystems. Or `find -xdev … -delete` for anything more selective.


It seems better to alias rm to "rm --one-file-system", assuming major cross-filesystem deletes aren't something you do all the time that should be made as ergonomic as possible.


Similar story, except we were using an NFS appliance that took hourly snapshots. As soon as we figured out what was happening, we had the storage team save off the latest snapshot. It was 1TB of data (a lot for the time) and took a week for us to restore.


A lot of companies still work in a similar fashion to what you described, maybe with root squashed, but still, very possible to have something like that happen now a days!

I remember someone hit a bug with docker exec --rm years ago where it started deleting some NFS files that it shouldn't...


This reminds me of a time when a colleague and I were investigating some persistent D-State processes that were occurring when container processes were being exec-ed.

Once on the box, we wanted to create a container with utilities in the fs but didn't want to download an image tarball or look through the rootfs layer directories for one to use, so we just bind mounted host root onto another directory, beside the config file we were using.

This worked like a charm. Until we rm -rf'd the config directory and deleted host root in the process.

In our case, fortunately the consequences were minimal as all workloads were stateless. The container scheduler moved all the workloads to other hosts and the host scheduler noticed this VM wasn't responding any more and rolled a new one. The whole thing resolved itself in about 5 minutes with no interaction from us - so that was pretty neat.


That's a very sad worry story, hope it turned out OK. Sorry you and the users had to go through that.


Oh man - this one is anxiety inducing. I feel like this would haunt me for years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: