Things UNIX can do atomically

Skriticos · on Jan 6, 2010

I have a copy of "Advanced Programming in the UNIX Environment, Second Edition" (featuring very good description of C/POSIX/SuS/XSI protocols/api's and programming techniques for Linux/BSD/OSX/Solaris/other UNIX systems). Anyone who seriously wants to do thread safe multiprocessing or other complex low level stuff on UNIX platforms should have a copy of it. It's basically "the book" on the topic.

sophacles · on Jan 6, 2010

This is one of a very small number of books older than 5 years which I still regularly reference. Amusingly, most of the books on that list were written by W Richcard Stevens as well.

leoc · on Jan 6, 2010

The Linux-specific /Linux Application Development/ by Johnson and Troan is very good too. http://ladweb.net/

ajross · on Jan 6, 2010

The very first example ("mv -T" to retarget a symlink, something I didn't know about and that the manpage is silent on) is not atomic! The symlink(2) syscall will not overwrite an existing directory entry. You have to call unlink(2) first, which opens a race.

That doesn't mean this trick is useless, but calling it "atomic" is wrong, and opens up the possibility of bugs (likely security holes, even) if someone actually needs atomicity.

bretthoerner · on Jan 6, 2010

> something I didn't know about and that the manpage is silent on

It's specific to GNU Coreutils mv(1).

bshep · on Jan 6, 2010

To change a symlink I use: ln -snf NEW_SOURCE TARGET

ajross · on Jan 6, 2010

Which works fine, but is still not atomic. To be clear, saying something is "atomic" means that there is no way for another process to see the process in an intermediate state (in the example in the article: a missing or incorrect symlink to a deployed web application). Underneath the hood, the /bin/ln program still needs to call unlink() on the path first, so there is a short period of time during which the desired link isn't present. So software written to assume that the link is always valid (for example, because its author was told that the update was atomic) will have hidden bugs.

This kind of tool merely runs really fast. The race condition is still there.

bshep · on Jan 6, 2010

I never said it was atomic ;-) just that i use that method rather than mv.

I mainly use it when i'm updating apps or SDks to newer versions and want to keep the old ones around just in case.

geocar · on Jan 6, 2010

Beware: rename() isn't atomic on Mac OS X:

http://www.weirdnet.nl/apple/rename.html

jacquesm · on Jan 6, 2010

What's weird is the timeline on that, if you see when it was reported and when it was still present. Surely it can't be that hard to fix this in a proper way ?

rbanffy · on Jan 6, 2010

Could be something with HFS+

geocar · on Jan 8, 2010

[Apparently not](http://www.reddit.com/r/programming/comments/amo3x/on_how_re...).

That's pretty embarrassing.

jrockway · on Jan 6, 2010

Wow, this breaks a lot of UNIX.

prodigal_erik · on Jan 6, 2010

  pipe(&pipefd);
  assert(len <= PIPE_BUF);
  write(pipefd[1], buf, len);

Also, if a file is opened O_APPEND, the seek before every write is atomic, but since writes in general aren't atomic I'm not sure what value that has.

geocar · on Jan 6, 2010

This is a great one: Under POSIX, any write() to a fifo that has at least PIPE_BUF free is atomic.

There's a neat trick I'd like to add that takes advantage of this: Multiple clients can use a message whose size divides evenly into PIPE_BUF without any possibility of interleave, and so they do not need to lock.

    int n=sizeof(struct message);
    struct message m[n/PIPE_BUF];
    assert(sizeof(m)==PIPE_BUF);
    if(pipe(fds)==-1)abort();
    for(i=0;i<cpus;i++){if(!fork())child(fds[1]);}
    while(r=read(fds[0],m,PIPE_BUF)){for(i=r/n-1;i>=0;--i)handle(m+i);}

Each child can write entire messages, without locking and without interleaving, so long as sizeof(struct message)*n is exactly PIPE_BUF.

Btw, your pipe(&pipefd) should've been pipe(pipefd) if pipefd is an array of 2 int

sophacles · on Jan 6, 2010

Makes logging things easier, and in correct order!

dasht · on Jan 6, 2010

I must brag (because hopefully it's interesting):

About 22 or 23 years ago my mentor at the time, a unix guru, beat more or less that same list into my thick skull with a heavy schtick. It was considered part of "basic competency" for unix. The list, back then, came with a list of bitter exceptions (e.g., vendor X had broken "rename(2)" and it was not reliably atomic on those systems).

Why was it important? The notion was that with a unix file system, and "tiny tools" connected by pipes, you really didn't need an SQL-based relational database for most tasks. You had shell programs like "cut(1)", "join(1)", "sort(1)", "grep(1)", and "sed(1)". In a pinch, you could reach for "awk(1)" but don't overdo it because the joke went "Some people, whenever they have a problem to solve, reach for AWK. Then they have two problems."

A serious relational database would give you transactions and isolation - the ACID properties. (Obtaining "durability" (the D in ACID) meant learning when and how to flush and sync the file system.) Back then, on some historic versions of unix, you could also count on it to be atomic if you appended a text file with a modest length, single line, in a single "write(2)". But even back then, that trick wasn't portable - just sorely missed from the good 'ol days.

Around about 2001 I set out to implement the GNU Arch revision control system. I was steamed at how Subversion was large, complicated, hard to get working, and used a centralized server rather than a distributed and decentralized system. I thought "I can do better than that, easily!" So I set out to prove it.

I wrote the initial self-hosting release of Arch as... wait for it ... shell scripts. Initially, there wasn't a single line of original C code in the system. It was all coded in /bin/sh, /bin/awk, /bin/find, with heavy use of sed, grep, cut, join, and sort. It was sluggish but usable and, was indeed, one of the very first truly distributed and decentralized revision control systems. It also delivered on the "smart merging" algorithms that Subversion had promised but that, to this day, has not succeeded in implementing.

One of the challenges of a revision control system is the critical importance of database integrity. When you check your code in, when you make back-ups, etc -- and when multiple users may be doing the same thing at the same time on the same repository -- it's absolutely critical to keep the database consistent. You need ACID updates to the database.

So, here is a fun puzzle: Using just the atomic unix file system operations, you have to implement (as a shell script), a "write once" list-of-directories structure. I'll illustrate what I mean:

We have a directory called "the-list". Initially, it may contain arbitrary "control files/directories" that you design, but it is empty of "data directories". The data directories we add will be named as numbers. So the first one is "the-list/0", then "the-list/1", then "the-list/2" and so forth.

When a directory like "the-list/1" is installed, it should already contain data files. E.g., "the-list/1/patches.tar.gz". You aren't allowed to just "mkdir the-list/1" and then, after that, install "patches.tar.gz". If "the-list/1" is there at all, it must already contain "patches.tar.gz".

The data directories must be created in order. If two users both try to create "the-list/1" at the same time, one should succeed, and the other be forced to retry, perhaps creating "the-list/2" instead (after "the-list/1" already exists).

Finally, a program that starts to create "the-list/1" but then crashes must not harm the database. It is OK if a user has to type a command like "cleanup-after-crash" but it must be possible to write such a command (as a shell script). It is important that if someone types "cleanup-after-crash" while the program trying to create "the-list/1" is still running, that either the "cleanup" program does nothing, or the programing making "the-list/1" is forced to fail (and know that it failed, and report that to the user).

It sounds like something that should be trivially easy, right? Well, it doesn't take a huge amount of code to solve but solutions to the problem are not obvious. And many obvious attempts to solve the problem have subtle bugs that lose the ACID properties of the database.

GNU Arch source code (and perhaps the documentation, I forget) contains an answer key. It might be more fun, though, if you've nothing better to do, to try to solve it yourself, first.

dasht · on Jan 7, 2010

Something to keep in mind if you are inclined to try solving the puzzle is that solutions ought to work on NFS file systems. Why does that matter?

Well, consider a system call like rename(2) or mkdir(2). Suppose that on a local file system, two programs both try to rename "a" to "b" at roughly the same time. At most one will succeed.

Not so, on NFS! An NFS implementation is permitted to behave this way: (1) Client x issues the rename request; (2) Client y issues the same rename request; (3) The server reports "success" to x; (4) The server reports "fail" to y, but the server's report packets get lost so y doesn't get the message. (5) Client y asks the server "Hey, whatever happened to my rename request?"; (6) Server notices that "a" doesn't exist but "b" does and replies "Success". Then both x and y think that they renamed "a" to "b".

The rename itself took place atomically. It's just unclear to each client which client actually triggered the rename. Similarly for other operations.

In other words, return values from file system calls don't reliably tell you when your process was able to perform some atomic action.

sschronk · on Jan 7, 2010

Is the package named "arch" in the ISO?

I'm downloading it right now... :)

JeffJenkins · on Jan 6, 2010

I just recently discovered fcntl for creating file-based locks (and the python function which calls it, fcntl.flock) and was pretty excited. It seems like a much safer way to do multi-process locking than creating lock files on my own

carey · on Jan 6, 2010

There is a catch, though; if you have the same file open multiple times, even with different names, then closing one file descriptor will release all locks on that file in the same process, even in different threads.

SQLite goes to great lengths to work around this. See: http://www.google.com/search?q=broken+by+design+site:sqlite....

cperciva · on Jan 6, 2010

There is a catch, though; if you have the same file open multiple times, even with different names, then closing one file descriptor will release all locks on that file in the same process, even in different threads.

It's even worse than that. A lock is released as soon as any of its owning processes closes any descriptor to that file. As a result, if a child forks and exits, the parent no longer holds any locks...

JeffJenkins · on Jan 7, 2010

That's definitely worth keeping in mind! Does that apply to sibling processes as well (like two apache child processes)?

0xbadcafebee · on Jan 6, 2010

IIRC opening a file and using link() is about the only portable atomic method of file locking. Works on NFS and pretty much any UNIX. There may have been some other trick to making it work right on NFS but if you're going that far you may just need a lock manager.

briansmith · on Jan 7, 2010

As long as you are using NFSv4 with properly-configured, non-buggy clients and servers, fcntl-based locking will work correctly.

0xbadcafebee · on Jan 7, 2010

I agree. If only every client were capable of properly-configured non-buggy NFSv4... Hence why I say the "portable" method is with link(). There are a lot better solutions if you don't have to support crappy clients.

snorkel · on Jan 6, 2010

IPC semaphore ops using the semop() function are also supposed to be atomic at the kernel level.

samuel · on Jan 6, 2010

One more: mkdir(2)

jrockway · on Jan 6, 2010

How would mkdir not be atomic? It is a logical operation that only encapsulates a single physical operation.

Rename is an interesting case, because it encapsulates two operations - a copy and a delete. Doing two operations in one step is interesting atomicity. Doing one operation in one step (mkdir) is not as interesting.

geocar · on Jan 6, 2010

See Phrack 25: http://www.phrack.com/issues.html?issue=25&id=5 for what it was like when mkdir was not atomic.

skorgu · on Jan 7, 2010

mkdir can fail atomicity if non-local storage is involved: http://www.mail-archive.com/freebsd-hackers@freebsd.org/msg2...

I still use it for the quick-and-dirty lockfile case despite this.

sid0 · on Jan 6, 2010

For what it's worth, Windows can do any arbitrary sequence of file operations atomically, using transactional NTFS.

briansmith · on Jan 6, 2010

Transactional NTFS only works on Vista, Server 2003, and later. It only works on local drives. There are many other limitations. Because of these limitations, I have not yet been able to use Transacitonal NTFS in any real program. Somebody always wants to store all their stuff--even critical stuff--on some network share.

gaius · on Jan 6, 2010

And on iSCSI LUNs IIRC, which is how most people would use consolidated storage (SAN vs NAS).

briansmith · on Jan 7, 2010

I mean doing things like accessing an Access database stored on a network share from multiple clients. I don't think many people are using iSCSI as a replacement for SMB yet.

sid0 · on Jan 6, 2010

> Server 2003

Do you mean Server 2008?

I agree that the fact that it's only supported on Vista and later is a problem, but XP will hopefully slide into irrelevancy in the next year or so.

> It only works on local drives.

Are the things described in the blog post atomic over NFS? I remember reading that NFS breaks plenty of the guarantees that a POSIX system provides. Guaranteeing atomicity over a network does seem to be a somewhat harder job than guaranteeing it locally.

briansmith · on Jan 7, 2010

It could very well be Server 2008 instead of Server 2003.

A perfect NFSv4 implemenation, configured properly, provides a very useful set of guarantees. However, the quality of NFSv4 implementations has been an issue. Especially, Linux's NFSv4 server implementation has historically had many flaws. Plus, people often mount NFS using the NFSv3 protocol instead of NFSv4, and/or reduce the safety guarantees with the goal of improving performance.

This is why almost everybody recommends NOT to use NFS to store anything critical. For example, SQLite's documentation states very clearly that you shouldn't store SQLite databases on NFS. And, this is why Oracle supports NFS (a) only with a list of specific NFS server implementations (mostly NetApp), and (b) using its own built in NFS client implementation (not the operating system's transparent NFS client).

I think we're far off from calling XP irrelevant. People were still buying XP pre-installed PCs last year. I would say 2012 will be earliest mass-market software can realistically start ignoring XP.

andreyf · on Jan 7, 2010

mmap, as well.