Hacker News new | past | comments | ask | show | jobs | submit login
Appending to a file from multiple processes (nullprogram.com)
109 points by signa11 on Aug 3, 2016 | hide | past | favorite | 49 comments



Why not just use a lock shared between all participating processes and call it a day? That seems a way more robust solution and is probably even way quicker to implement than digging up all the edge cases from all the (file) system specifications and then hoping you did not miss anything and all systems will continue to honor the least common denominator you just figured out in the future.


I suspect shared lock performance is the sticking point.

Personally, I agree with tlb: it seems simpler to generate one file per thread, and combine them at the end.


I just tried it on Windows, 4 processes each acquiring the lock 100,000 times and doing no work while holding the lock, and it took about 2000 ms. I don't think that would become the bottleneck for a lot of applications.

EDIT: Corrected the numbers, I had a bug, previously it said 4 processes with 10 million locks per process and 750 ms. It is still pretty fast even if 250 times slower than initially claimed.


I would be interested in seeing what it's like when the locking process sleeps for a small amount of time, and how that affects lock contention and CPU load. Maybe actually doing some work (such as writing the current timestamp 10 times) as well. Unless you actually are doing something, there could be some interesting differences that are covered up or optimized away.


I did not do absolutely nothing, I incremented a variable under the lock to hopefully avoid optimizer shenanigans and also tried an unoptimized debug build. Now I tired just spinning and incrementing a variable for a millisecond under the lock, that took 4125 ms for 4 processes with 1000 iterations each. A single process without lock and 4000 iterations took 4005 ms. Those 120 ms would make it a 30 µs locking overhead, the first test without work under the lock hinted at 5 µs. This is probably one of the harder things to profile and depends on quite some factors, but some ten thousand locks per second is probably the correct order of magnitude.


That matches my expectations. Really, if you are doing something as slow as writing logs to a disk, and the number of processes/threads is not in the tens or hundreds, I don't imagine locking overhead is your problem, given the speed of disk storage.

That said, I think the main problem with that is to do it cross platform, which the article goes to paint to mention quite a bit. I imagine the whole point here is to be portable, and I'm not sure what mechanisms work best with that, and what platforms they are available on. I uncovered some unsettling info about Fcntl locking[1], but generally I just used flock when I had to care about it, but i don't think that exists on windows normally(?).

1: http://0pointer.de/blog/projects/locking.html


That was also my first thought, why even use a separate lock and not just the one associated with a file? But I was not sure if that would work, whether it was flexible enough to support a single exclusive writer and multiple readers. I am still not absolutely sure how the option you specify when opening a file and when locking a file later exactly interact, but I am now convinced that it is possible, Windows has LockFileEx [1] and UnlockFileEx [2] which even support limiting the lock to specific blocks within the file.

Another idea was, why not just exclusively open the file, write to it and then closed it again? Why even bother having the file open in several processes at the same time? I am not sure what the overhead would be, but I guess it would not be to terrible. And if you can afford to buffer say 1000 records you want to write, then you can simply cut down the overhead by a factor of 1000 by just doing that.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

[2] https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...


For most purposes LockFileEx and UnlockFileEx can be used in almost the same way as flock and funlock. I recently worked on some code that needed to work on Windows, Linux and Mac and I managed to get the file locking semantics to work equivalently on all platforms.

Regarding having a buffer of records and only locking / unlocking once per batch write, I've used this technique in a program that writes logs to CSV and it works perfectly. You obviously need to tune the buffer size based on the rate and size of new records! One advantage of this is that, depending on the data you're dealing with, you can pre-sort the data in the buffer and end up with mostly sorted (less interleaved) data in the final output. If you need sorted output, then this can dramatically reduce the time taken to sort the final file.


The issue is the contention caused by all the processes waiting on file IO behind the lock, not the locking


You don't actually need to hold the lock while doing I/O - just while allocating yourself a region of the file that's yours to exclusively write to (eg. by opening without O_APPEND and instead using pwrite(2) to write at a specific offset).


If I/O were the bottleneck then that would be the case with or without a lock, wouldn't it?


It depends on what has to wait for the IO to complete. ETW, for example, is a Windows-wide event framework and it would be untenable for everything using it to wait on the file IO of everything else making use of it. So, it has a solution that makes use of locks but nothing ever has to wait on file IO.

File IO is always expensive, the trick is always in how you work around that.


That does seem simpler. I've run into variants of this problem where each thread/process can potentially produce GBs of data. For that use case performance would suffer if combining were done as a separate step.


Actually, since the data is sequential, I think you can write a very quick and optimized combining process.

E.g.

Open each file and buffer a small amount, such as 4k.

1) Read the first record of each buffer

2) Display the first record of the buffers

3) Read the next record from the buffer you just consumed from, and re-buffer another 4k from it if needed.

4) Go to 2.

Go ahead and change how much you want to buffer from each file depending on what seems to work well and to take advantage of sequential read times if using media that uses that to it's advantage (spinning disks), but I don't see that being very slow. Worst case, you are naively sorting X records each time to find the first, but since you are always consuming them in order, you can easily keep track of the current order and do a dingle comparison per loop after the first.


Namely,MapReduce


Or even older - an external merge sort (https://en.wikipedia.org/wiki/Merge_sort#Use_with_tape_drive...). It was designed for doing out of memory sorts backed by a much slower medium (like disk).


There may be no locking mechanism available. One example is multiple computers writing to a file on an NFS share with the nolock mount option.


> multiple computers writing to a file on an NFS share with the nolock mount option.

Yes, but that's the kind of thing everyone explicitly tells you never to do.


It's really much better for each process to write to its own file, and the consumer can merge-sort the records.


This is basically how Event Tracing for Windows (ETW) solves the problem very efficiently. It uses InterlockedIncrement() to quickly and with minimal contention have something to merge all the records at the end.

If you're on Windows you should use ETW instead of reinventing it when possible :)


ETW is awesome. Too bad it's been plagued by such an opaque API and all the registration junk. I hear it can be easier to use now, but I'm still not sure if an exe can emit ETW events without having to register every possible event or whatever.


The new ETW format is self-documenting at the expense of some overhead. You can set up hello world in ETW in about three lines of code [1], just create the logging channel and log a string or property bag to it.

1. https://blogs.windows.com/buildingapps/2016/06/10/using-devi...


I separate merge step to produce a single file would tripple the transfered data volume which might or might not be an issue. But in case the ordering of the records does not matter as suggested in the article and if you can get the consumer to consume several files, it would indeed be the easiest solution to just have several files and then consume then one after another.


This is a great interview question. It's very easy to formulate and explain, and yet the interviewer can gauge the depth of a candidates os knowledge by digging deeper into any solution, with no easy 'perfect solution'.


I'd argue this has the same problems as many existing interview questions: It relies on arcane knowledge that won't be used day to day at the job in question, and rewards the candidate who is lucky to know the answer or answers.

Can we stop with the silly/trick/arbitrary interview questions? I know we cannot, but I dream for the day.


It's a good question because even supposing that you have no prior knowledge of the problem at hand you should be able to think of some obvious problems if you have any programming experience.

You have N number of writers, those writers can write X sized data to location A.

How would those writers go about writing to "A" in a performant manner without causing issues? Whether those issues are interleaved data, slowing down the N writers or having to keep outsized buffers on the OS level.

Even if you have no knowledge of POSIX filesystems you should be able to reason about this in an intelligent way in an interview.

Do we go for locks? That has its own set of problems, what are those? Do we go for fixed sized non-interleaved guarantees? What problems does that cause? Do we buffer arbitrarily sized output and and merge it later etc.


Atomic writes are pretty easy - as this article points out, PIPE_BUF sized writes to pipes are atomic.

It seems like atomic reads are harder. You want to read a record from a shared input, but you don't know how big a record might be. Any thoughts?


If you have the ability to (minimally) parse what is coming out of your pipe, and you know that your writes are not interleaved, and you know the size of the record you're writing before you write it, you can prefix each record with the size, and use this on the reader side to fetch only upto the record length (read(2) syscall and friends allow you to specify a maximum number of bytes to fetch), for each record. Does this answer your question?


I don't believe what you're describing works. You're suggesting reading first the size of a record (stored itself in a fixed-size field), then reading the rest of it?

But unfortunately that is not atomic. Another process could read into the data between your first and second reads.


Not just pipes - Files, too (actually, all write calls). Assuring your records are always smaller than PIPE_BUF and just using O_APPEND works pretty well.


Writes to files less than PIPE_BUF aren't atomic on parallel file systems. e.g. Panasas, Lustre, GPFS, etc.


Clarification: writes to files less than PIPE_BUFF may be atomic, but they aren't necessarily consistent. This is another important property as you want reads following writes to always have the written values. I think I tripped up over myself since in atomicity you expect the 'all written' and 'nothing written' states, but there's also 'all written but nothing visible' which is a third state which feels like it's part of atomicity, but I guess it's really part of consistency.


We have this issue in the fish shell. There's a single shared history file, and multiple shell instances may append to it at once. We use fcntl() locking but we do in fact have users who have $HOME on nolock NFS, where locking fails.

Our strategy is to keep the history as append-only, with periodic "vacuums" that rewrite the file into a temporary and move it into place. Even the appends in principle could be split, as NFS provides no guarantees here, but in practice writing individual items seems to be atomic when done with O_APPEND.


Posix seems to guarantee that writes with O_APPEND are atomic:

"If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation."

but then it goes on and says:

"This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control."

Also, it does seems to allow a signal to interrupt file I/O.

Anyways, it seems that historical unix behaviour is that signals never interrupt file I/O and O_APPEND can be used for atomic appends. Of course, this is not documented anywhere, but you can find discussions on the topic. In particular, here [1] Linus goes in one of his rants when commenting on a patch to change this behaviour.

[1] http://yarchive.net/comp/linux/wakekill.html


Couldn't you use a named pipe to send data from each process to a single writer process?


How would the single writer process know that one 'atom' of writing has been received from each process? The named pipe might be empty if the sending process is filling it's buffer with the second half of the message.


Presumably you would have some protocol to passing the messages that handles this, such as at the simplest level wrapping the the message with a preceding length and following end of record identifier, which in conjunction should work.

But that's the (implicit) point of all this, now you've added a bunch of logic for an accumulating writer, a protocol to pass messages between the processes, and state so that partially received messages can be continued when you get back to that process. Now the application has a non-trivial amount of logic and code to deal with logging which might be a source of problems itself. As another commenter noted, individual files per process with an aggregation step (or specialized reader) is much simpler and easier to reason about and mirrors actual case in the article where it was eventually inserted into SQLite, which essentially enforces this read ordering as needed (if there is an ordering field).


The classic application from appending from multiple sources while potentially modifying a file and potentially having NFS in the way is a mail spool using mbox.

While there are various hard-won implementations that work doing this, it's also one of the reasons maildir was invented.

It's much more portable, comprehensible, and reliable to have a single process writing to the file and all the other producers sending records to it using sockets or named pipes. The consolidator needs to be aware of record boundaries so you don't have to worry about pipe atomicity.


It's fairly simple to write a little server that accepts socket connections from multiple processes, and reads a line of data from each connection whenever there is data available and append it to the log file. I wrote a server like that years ago; a write-only socket service which hundreds of separate processes could write high-volume statistics-tracking to. The heart of it was only a few dozen lines.


Something like syslogd?


Similar, but different requirements. All of the messages were grouped by sending process, and when the sending process said its session was over its messages were fed into a post-processing queue. There was support for holding messages in memory or on disk, and data-preserving fallbacks if the post-processing halted or got too backed up. The message rate varied quite a bit; during the US business day the rate was at least doubled. We used that to give the post-processing queue time to catch up off-hours.

This was designed about 16 years ago, so well before the big-data tools, ssds, or powerful machines we have today.


how about

    mkfifo foo

    tee -a log.txt < foo &
and then multiple processes write to pipe foo.


I fully expect to be disappointed at some point, but I've always seen the "use fflush()" approach work, on both Linux and Windows.

TCP Sockets to a server over localhost might also work ( although the flush IOCTL makes no guarantees w/ sockets), then have the server serialize out. One of the nice things about Tcl is that such a server is relatively easy to do. But you may need to mind newlines - don't write until an entire "line" of text has been read.


fflush() over networked drives?


Good question - I tended to create a TCP server and use that to write to drives local to the server machine. I never really did a "trust fall" with network mounted drives because the logger is generally not a standard part of the system - it was an add-on.


My approach would rely on database locking. Seems like a good fit.


Is it possible to call lockf on an O_APPEND file?


It should work just fine. But since the current offset will always be changing, you might want to lock the entire file and fcntl has a more flexible interface for doing that.


Check how Apache implements its log http://httpd.apache.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: