Couldn't `mv` or `cp` from the temp file to `/etc/passwd` be interrupted as well...

AdamJacobMuller · on April 15, 2022

mv can't, or, more correctly the rename system call can not.

rename is an atomic operation from any modern filesystem's perspective, you're not writing new data, you're simply changing the name of the existing file, it either succeeds or fails.

Keep in mind that if you're doing this, mv (the command line tool) as opposed to the `rename` system call, falls back to copying if the source and destination files are on different filesystems since you can not really mv a file across filesystems!

In order to have truly atomic writes you need to:

open a new file on the same filesystem as your destination file

write contents

call fsync

call rename

call sync (if you care about the file rename itself never being reverted).

This is some very naive golang code (from when I barely knew golang) for doing this which has been running in production since I wrote it without a single issue: https://github.com/AdamJacobMuller/atomicxt/blob/master/file...

JJMcJ · on April 15, 2022

Not clear on need for fsync and sync.

Are those for networked like NTFS or just as security against crashes.

Logically on a single system there would be no effect assuming error free filesystem operation. Unless I'm missing something.

AnssiH · on April 15, 2022

Without the fsync() before rename(), on system crash, you can end up with the rename having been executed but the data of the new file not yet written to stable storage, losing the data.

ext4 on Linux (since 2009) special-cases rename() when overwriting an existing file so that it works safely even without fsync() (https://lwn.net/Articles/322823/), but that is not guaranteed by all other implementations and filesystems.

The sync() at the end is indeed not needed for the atomicity, it just allows you to know that after its completion the rename will not "roll back" anymore on a crash. IIRC you can also use fsync() on the parent directory to achieve this, avoiding sync() / syncfs().

AdamJacobMuller · on April 16, 2022

> ext4 on Linux (since 2009) special-cases rename

This is interesting.

The linked git entry (https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.g...) from the LWN article says "Notice: this object is not reachable from any branch."

Did this never get merged because I definitely saw this issue in production well after 2009.

I guess it either got changed, or, a different patch applied but perhaps this https://github.com/torvalds/linux/blob/master/fs/ext4/namei.... does it?

AnssiH · on April 16, 2022

The patch just got rebased, here's the one that was actually applied in master for v2.6.30: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.g...

And yes, the code you highlighted is exactly this special-case in its current form. The mount option "noauto_da_alloc" can be used to disable these software-not-calling-fsync safety features.

rcoveson · on April 15, 2022

I'd like to know why as well. The inclusion of the fsync before the rename implies to me that the filesystem isn't expected to preserve order between write and rename. It could commit a rename before committing _past_ writes, which could leave your /etc/passwd broken after an outage at a certain time. I can't tell whether that's the case or not from cursory googling (everybody just talks about read-after-write consistency). Maybe it varies by filesystem?

The final sync is just there for durability, not atomicity, like you say.

AdamJacobMuller · on April 16, 2022

> the filesystem isn't expected to preserve order between write and rename

Correct.

The rename can succeed while the write of the file you just renamed gets rolled back.

sedatk · on April 15, 2022

You can use `/etc/passwd.new` as a temporary file to avoid the problems you mentioned. In the worst case, you'll have an orphaned passwd.new file, but /etc/passwd is guaranteed to remain intact.