Appending to a file from multiple processes

danbruc · on Aug 3, 2016

Why not just use a lock shared between all participating processes and call it a day? That seems a way more robust solution and is probably even way quicker to implement than digging up all the edge cases from all the (file) system specifications and then hoping you did not miss anything and all systems will continue to honor the least common denominator you just figured out in the future.

mwcremer · on Aug 3, 2016

I suspect shared lock performance is the sticking point.

Personally, I agree with tlb: it seems simpler to generate one file per thread, and combine them at the end.

danbruc · on Aug 3, 2016

I just tried it on Windows, 4 processes each acquiring the lock 100,000 times and doing no work while holding the lock, and it took about 2000 ms. I don't think that would become the bottleneck for a lot of applications.

EDIT: Corrected the numbers, I had a bug, previously it said 4 processes with 10 million locks per process and 750 ms. It is still pretty fast even if 250 times slower than initially claimed.

kbenson · on Aug 3, 2016

I would be interested in seeing what it's like when the locking process sleeps for a small amount of time, and how that affects lock contention and CPU load. Maybe actually doing some work (such as writing the current timestamp 10 times) as well. Unless you actually are doing something, there could be some interesting differences that are covered up or optimized away.

danbruc · on Aug 3, 2016

I did not do absolutely nothing, I incremented a variable under the lock to hopefully avoid optimizer shenanigans and also tried an unoptimized debug build. Now I tired just spinning and incrementing a variable for a millisecond under the lock, that took 4125 ms for 4 processes with 1000 iterations each. A single process without lock and 4000 iterations took 4005 ms. Those 120 ms would make it a 30 µs locking overhead, the first test without work under the lock hinted at 5 µs. This is probably one of the harder things to profile and depends on quite some factors, but some ten thousand locks per second is probably the correct order of magnitude.

kbenson · on Aug 3, 2016

That matches my expectations. Really, if you are doing something as slow as writing logs to a disk, and the number of processes/threads is not in the tens or hundreds, I don't imagine locking overhead is your problem, given the speed of disk storage.

That said, I think the main problem with that is to do it cross platform, which the article goes to paint to mention quite a bit. I imagine the whole point here is to be portable, and I'm not sure what mechanisms work best with that, and what platforms they are available on. I uncovered some unsettling info about Fcntl locking[1], but generally I just used flock when I had to care about it, but i don't think that exists on windows normally(?).

1: http://0pointer.de/blog/projects/locking.html

danbruc · on Aug 3, 2016

That was also my first thought, why even use a separate lock and not just the one associated with a file? But I was not sure if that would work, whether it was flexible enough to support a single exclusive writer and multiple readers. I am still not absolutely sure how the option you specify when opening a file and when locking a file later exactly interact, but I am now convinced that it is possible, Windows has LockFileEx [1] and UnlockFileEx [2] which even support limiting the lock to specific blocks within the file.

Another idea was, why not just exclusively open the file, write to it and then closed it again? Why even bother having the file open in several processes at the same time? I am not sure what the overhead would be, but I guess it would not be to terrible. And if you can afford to buffer say 1000 records you want to write, then you can simply cut down the overhead by a factor of 1000 by just doing that.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

[2] https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

Intermernet · on Aug 4, 2016

For most purposes LockFileEx and UnlockFileEx can be used in almost the same way as flock and funlock. I recently worked on some code that needed to work on Windows, Linux and Mac and I managed to get the file locking semantics to work equivalently on all platforms.

Regarding having a buffer of records and only locking / unlocking once per batch write, I've used this technique in a program that writes logs to CSV and it works perfectly. You obviously need to tune the buffer size based on the rate and size of new records! One advantage of this is that, depending on the data you're dealing with, you can pre-sort the data in the buffer and end up with mostly sorted (less interleaved) data in the final output. If you need sorted output, then this can dramatically reduce the time taken to sort the final file.

altano · on Aug 3, 2016

The issue is the contention caused by all the processes waiting on file IO behind the lock, not the locking

caf · on Aug 4, 2016

You don't actually need to hold the lock while doing I/O - just while allocating yourself a region of the file that's yours to exclusively write to (eg. by opening without O_APPEND and instead using pwrite(2) to write at a specific offset).

danbruc · on Aug 3, 2016

If I/O were the bottleneck then that would be the case with or without a lock, wouldn't it?

altano · on Aug 4, 2016

It depends on what has to wait for the IO to complete. ETW, for example, is a Windows-wide event framework and it would be untenable for everything using it to wait on the file IO of everything else making use of it. So, it has a solution that makes use of locks but nothing ever has to wait on file IO.

File IO is always expensive, the trick is always in how you work around that.

davvid · on Aug 3, 2016

That does seem simpler. I've run into variants of this problem where each thread/process can potentially produce GBs of data. For that use case performance would suffer if combining were done as a separate step.

kbenson · on Aug 4, 2016

Actually, since the data is sequential, I think you can write a very quick and optimized combining process.

E.g.

Open each file and buffer a small amount, such as 4k.

1) Read the first record of each buffer

2) Display the first record of the buffers

3) Read the next record from the buffer you just consumed from, and re-buffer another 4k from it if needed.

4) Go to 2.

Go ahead and change how much you want to buffer from each file depending on what seems to work well and to take advantage of sequential read times if using media that uses that to it's advantage (spinning disks), but I don't see that being very slow. Worst case, you are naively sorting X records each time to find the first, but since you are always consuming them in order, you can easily keep track of the current order and do a dingle comparison per loop after the first.

eva1984 · on Aug 4, 2016

Namely，MapReduce

infinite8s · on Aug 4, 2016

Or even older - an external merge sort (https://en.wikipedia.org/wiki/Merge_sort#Use_with_tape_drive...). It was designed for doing out of memory sorts backed by a much slower medium (like disk).

ridiculous_fish · on Aug 4, 2016

There may be no locking mechanism available. One example is multiple computers writing to a file on an NFS share with the nolock mount option.

pjc50 · on Aug 4, 2016

> multiple computers writing to a file on an NFS share with the nolock mount option.

Yes, but that's the kind of thing everyone explicitly tells you never to do.

tlb · on Aug 3, 2016

It's really much better for each process to write to its own file, and the consumer can merge-sort the records.

altano · on Aug 3, 2016

This is basically how Event Tracing for Windows (ETW) solves the problem very efficiently. It uses InterlockedIncrement() to quickly and with minimal contention have something to merge all the records at the end.

If you're on Windows you should use ETW instead of reinventing it when possible :)

adzm · on Aug 3, 2016

ETW is awesome. Too bad it's been plagued by such an opaque API and all the registration junk. I hear it can be easier to use now, but I'm still not sure if an exe can emit ETW events without having to register every possible event or whatever.

hirsin · on Aug 4, 2016

The new ETW format is self-documenting at the expense of some overhead. You can set up hello world in ETW in about three lines of code [1], just create the logging channel and log a string or property bag to it.

1. https://blogs.windows.com/buildingapps/2016/06/10/using-devi...

danbruc · on Aug 4, 2016

I separate merge step to produce a single file would tripple the transfered data volume which might or might not be an issue. But in case the ordering of the records does not matter as suggested in the article and if you can get the consumer to consume several files, it would indeed be the easiest solution to just have several files and then consume then one after another.

herge · on Aug 3, 2016

This is a great interview question. It's very easy to formulate and explain, and yet the interviewer can gauge the depth of a candidates os knowledge by digging deeper into any solution, with no easy 'perfect solution'.

Someone1234 · on Aug 3, 2016

I'd argue this has the same problems as many existing interview questions: It relies on arcane knowledge that won't be used day to day at the job in question, and rewards the candidate who is lucky to know the answer or answers.

Can we stop with the silly/trick/arbitrary interview questions? I know we cannot, but I dream for the day.

avar · on Aug 3, 2016

It's a good question because even supposing that you have no prior knowledge of the problem at hand you should be able to think of some obvious problems if you have any programming experience.

You have N number of writers, those writers can write X sized data to location A.

How would those writers go about writing to "A" in a performant manner without causing issues? Whether those issues are interleaved data, slowing down the N writers or having to keep outsized buffers on the OS level.

Even if you have no knowledge of POSIX filesystems you should be able to reason about this in an intelligent way in an interview.

Do we go for locks? That has its own set of problems, what are those? Do we go for fixed sized non-interleaved guarantees? What problems does that cause? Do we buffer arbitrarily sized output and and merge it later etc.

catern · on Aug 3, 2016

Atomic writes are pretty easy - as this article points out, PIPE_BUF sized writes to pipes are atomic.

It seems like atomic reads are harder. You want to read a record from a shared input, but you don't know how big a record might be. Any thoughts?

rpcope1 · on Aug 4, 2016

If you have the ability to (minimally) parse what is coming out of your pipe, and you know that your writes are not interleaved, and you know the size of the record you're writing before you write it, you can prefix each record with the size, and use this on the reader side to fetch only upto the record length (read(2) syscall and friends allow you to specify a maximum number of bytes to fetch), for each record. Does this answer your question?

catern · on Aug 6, 2016

I don't believe what you're describing works. You're suggesting reading first the size of a record (stored itself in a fixed-size field), then reading the rest of it?

But unfortunately that is not atomic. Another process could read into the data between your first and second reads.

GauntletWizard · on Aug 3, 2016

Not just pipes - Files, too (actually, all write calls). Assuring your records are always smaller than PIPE_BUF and just using O_APPEND works pretty well.

fnord123 · on Aug 3, 2016

Writes to files less than PIPE_BUF aren't atomic on parallel file systems. e.g. Panasas, Lustre, GPFS, etc.

fnord123 · on Aug 6, 2016

Clarification: writes to files less than PIPE_BUFF may be atomic, but they aren't necessarily consistent. This is another important property as you want reads following writes to always have the written values. I think I tripped up over myself since in atomicity you expect the 'all written' and 'nothing written' states, but there's also 'all written but nothing visible' which is a third state which feels like it's part of atomicity, but I guess it's really part of consistency.

ridiculous_fish · on Aug 4, 2016

We have this issue in the fish shell. There's a single shared history file, and multiple shell instances may append to it at once. We use fcntl() locking but we do in fact have users who have $HOME on nolock NFS, where locking fails.

Our strategy is to keep the history as append-only, with periodic "vacuums" that rewrite the file into a temporary and move it into place. Even the appends in principle could be split, as NFS provides no guarantees here, but in practice writing individual items seems to be atomic when done with O_APPEND.

gpderetta · on Aug 4, 2016

Posix seems to guarantee that writes with O_APPEND are atomic:

"If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation."

but then it goes on and says:

"This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control."

Also, it does seems to allow a signal to interrupt file I/O.

Anyways, it seems that historical unix behaviour is that signals never interrupt file I/O and O_APPEND can be used for atomic appends. Of course, this is not documented anywhere, but you can find discussions on the topic. In particular, here [1] Linus goes in one of his rants when commenting on a patch to change this behaviour.

[1] http://yarchive.net/comp/linux/wakekill.html

adenner · on Aug 3, 2016

Couldn't you use a named pipe to send data from each process to a single writer process?

herge · on Aug 3, 2016

How would the single writer process know that one 'atom' of writing has been received from each process? The named pipe might be empty if the sending process is filling it's buffer with the second half of the message.

kbenson · on Aug 3, 2016

Presumably you would have some protocol to passing the messages that handles this, such as at the simplest level wrapping the the message with a preceding length and following end of record identifier, which in conjunction should work.

But that's the (implicit) point of all this, now you've added a bunch of logic for an accumulating writer, a protocol to pass messages between the processes, and state so that partially received messages can be continued when you get back to that process. Now the application has a non-trivial amount of logic and code to deal with logging which might be a source of problems itself. As another commenter noted, individual files per process with an aggregation step (or specialized reader) is much simpler and easier to reason about and mirrors actual case in the article where it was eventually inserted into SQLite, which essentially enforces this read ordering as needed (if there is an ordering field).

pjc50 · on Aug 4, 2016

The classic application from appending from multiple sources while potentially modifying a file and potentially having NFS in the way is a mail spool using mbox.

While there are various hard-won implementations that work doing this, it's also one of the reasons maildir was invented.

It's much more portable, comprehensible, and reliable to have a single process writing to the file and all the other producers sending records to it using sockets or named pipes. The consolidator needs to be aware of record boundaries so you don't have to worry about pipe atomicity.

DougWebb · on Aug 4, 2016

It's fairly simple to write a little server that accepts socket connections from multiple processes, and reads a line of data from each connection whenever there is data available and append it to the log file. I wrote a server like that years ago; a write-only socket service which hundreds of separate processes could write high-volume statistics-tracking to. The heart of it was only a few dozen lines.

ww520 · on Aug 4, 2016

Something like syslogd?

DougWebb · on Aug 4, 2016

Similar, but different requirements. All of the messages were grouped by sending process, and when the sending process said its session was over its messages were fed into a post-processing queue. There was support for holding messages in memory or on disk, and data-preserving fallbacks if the post-processing halted or got too backed up. The message rate varied quite a bit; during the US business day the rate was at least doubled. We used that to give the post-processing queue time to catch up off-hours.

This was designed about 16 years ago, so well before the big-data tools, ssds, or powerful machines we have today.

JoeAltmaier · on Aug 3, 2016

how about

    mkfifo foo

    tee -a log.txt < foo &

and then multiple processes write to pipe foo.

ArkyBeagle · on Aug 3, 2016

I fully expect to be disappointed at some point, but I've always seen the "use fflush()" approach work, on both Linux and Windows.

TCP Sockets to a server over localhost might also work ( although the flush IOCTL makes no guarantees w/ sockets), then have the server serialize out. One of the nice things about Tcl is that such a server is relatively easy to do. But you may need to mind newlines - don't write until an entire "line" of text has been read.

pjc50 · on Aug 4, 2016

fflush() over networked drives?

ArkyBeagle · on Aug 4, 2016

Good question - I tended to create a TCP server and use that to write to drives local to the server machine. I never really did a "trust fall" with network mounted drives because the logger is generally not a standard part of the system - it was an add-on.

Nuffinatall · on Aug 4, 2016

My approach would rely on database locking. Seems like a good fit.

leni536 · on Aug 3, 2016

Is it possible to call lockf on an O_APPEND file?

rleigh · on Aug 4, 2016

It should work just fine. But since the current offset will always be changing, you might want to lock the entire file and fcntl has a more flexible interface for doing that.

known · on Aug 4, 2016

Check how Apache implements its log http://httpd.apache.org/