Fsync() on a different thread: apparently a useless trick

pquerna · on May 3, 2010

I guess I'm not gorking how the behavior of write() blocking when fsync is being called is considered a problem.

write() is documented and well known to be a potentially blocking function, for potentially a long time. If you are writing a small-number-of-threads process, that is doing Disk and Network IO in the same threads, you can definitely starve out the network IO when the disk system starts thrashing, but generally that is fine, because you can't keep up with new requests at that point anyways.

For that matter, read() can block too, people are just 'used' to their Kernels being smart and most important data being in cache.

If you want your event-style threads servicing network clients to be able to use write() or read without 'blocking' up a whole thread for apparently random times, consider using the aio_* style write/fsync, although you definitely would want to benchmark their impacts on general performance.

If you don't want to use aio_*, which is likely to be.. dangerous ground considering how few people use them, you end up adding threads to provide more isolation and parallelism to IO.

antirez · on May 3, 2010

Hello. Write in non real time Linux is not able to guarantee that it will return in a given amount of time, still all the system calls have more or less predictable timing behavior when the disk and the CPU are not busy. What I mean is that if you remove the fsync() call from the other thread, what you get is a constant stream of "13 microseconds" delays.

So when fsync() is not into the mix the kernel will do the right thing, will use buffers and will make the write calls very cheap. This is important for many applications. But when there are more strict durability requirements this is no longer true and care must be used.

Non blocking I/O (aio_*) is an interesting alternative in some application, but in the case of Redis it is important to return "OK" only after we got an acknowledge from write(2). Doing this suspending the client and resuming it when the write was performed will turn Redis from a 140k operations/second database into a 10k operations second database, so this is not going to be the solution.

Real world software is written not reading manual pages, but checking how the underlying OS actually works IMHO. For instance Redis persistence uses fork() copy-on-write semantic of modern operation system. Also the fact that write(2) can block per semantic, does not mean you'll be happy to know your kernel is blocking a process for seconds many times as a result of a write(2) operation.

anaisbetts · on May 3, 2010

Kernel developer here - honestly, I'm not surprised that fsync() stops the world, and furthermore I suspect that even if you got past the vfs layer, you would see different effects on different filesystems (i.e. you still couldn't bet that fsync() would act like you want it to). The semantics of fsync mean "Please guarantee that everything is written to disk, flush all caches now".

The kernel doesn't keep a 2nd queue for post-fsync writes that it will then swap into the "real" one - think about what happens if someone else calls fsync(); does it spin up a 3rd queue for that one? Does the fsync block? I think it would quickly descend into Crazyville.

antirez · on May 3, 2010

Hello xpaulbettsx, thanks for your comment!

Yes I guess the implementation may get more complex, not sure about the actual implementation it's just a linked list of operations to flush like it happens looking at a few source code fragments, then it's just possible to put a "sentinel" in the list that will block the first fsync() when the first sentinel is found and so forth.

I mean, I'm all against the complexity myself in the code I write, so I can't really question this behavior and the "fsync every second" policy is not a huge use case indeed, but still it's important to now that. Googling a bit around there are tons of people that appear to be pretty confident that moving fsync() into another thread is the way to go, while instead stuff are working in a different way.

houseabsolute · on May 3, 2010

A pair of queues should be enough. The first queue is the one being fsynced, the second queue is open for writes. The first fsync call starts the first queue flushing. The second fsync call marks the second queue as read to flush once the first one is done. The third has no additional effect. Writes can still proceed against the second until it actually starts flushing. At which point the queues swap and you're back where you started.

doty · on May 3, 2010

Isn't this just good old group commit? I'm surprised the kernel wouldn't do this optimization for you.

pquerna · on May 3, 2010

The interesting thing about your timing is that I would expect on ext3 you would sill similar slowdowns, although perhaps they are harder to see, due to how the most common configuration of using sync=ordered behaves, which should cause effectively any fsync to flush everything to disk, and even then, the kernel will trigger one ext3 commit interval every ~5 seconds by default.

I didn't see it mentioned in the blog post, but what filesystem were you running this test on?

I don't know the details about Redis' threading and event model, but please feel free to share them;

Most event-y systems I've worked on end up falling back to a smallish-number of threads to provide the IO concurrency, because you just don't have the fine grained control of the kernel to provide a perfect level of service with a tiny-number of threads.

This is the model the Event MPM in httpd uses, its the model that Traffic Server uses, its the model that Lighthttpd tried to not use at first, but they added threads in later because of blocking stat calls, etc.

Kernel's like the liberty that POSIX standards give them, things will block, and even if you want to do 140,000 operations a second, if you want any durability at all, there will have to be a tradeoff, the most likely ones are things like multiple commit logs, and more threading.

antirez · on May 3, 2010

Sorry for not mentioning this, but I'm using ext4.

About using more threads, it seems very unlikely to me that tis is going to help in a persistence model where an Append Only File is used in order to persist. All the clients will eventually write to the same file object and will get automatically serialized.

Btw as a proof of how the same filesystem will behave in a very different way check the O_SYNC test above: same ext4 filesystem, but an order of magnitude faster doing things in a way instead of in another way.

Unfortunately POSIX or not the real world kernel implementations are full of small implementations details that a programmer doing system programming is required to know to write fast code :(

gaius · on May 3, 2010

If you don't want to use aio_, which is likely to be.. dangerous ground considering how few people use them*

Oracle uses AIO (you can strace it and see for yourself) and if it's safe for them, it's probably safe for you.

alecco · on May 3, 2010

Using a complex generic filesystem with too much metadata is a bit part of the problem. If your app is write oriented try a logfs-style filesystem. Or perhaps it's time for a new minimal filesystem.

Direct I/O is another option but too crass, IMHO.

russell_h · on May 3, 2010

Is there a reason to call fsync() all the time instead of using O_SYNC?

antirez · on May 3, 2010

This is what I'm considering for "fsync always", but for "fsync everysec" this is now ideal unfortunately. Still O_SYNC can be able to fix at least one problem, that's cool :) Thanks for the comment.

Edit: worth noting that instead O_DIRECT can't be used for an append only file, because of alignment requirements.

Edit 2: yep O_SYNC helps a lot. Just updated the post with the results.

russell_h · on May 3, 2010

Ah, yeah, thats what I meant, I probably should have specified that.

btmorex · on May 3, 2010

Correct me if I'm wrong, but O_SYNC is the equivalent of calling fsync() after every single write, so even if you manage to omit one fsync() you're going to come out ahead.

antirez · on May 3, 2010

Semantically it's the same, but maybe it is somewhat optimized? Testing right now...

Edit: indeed it's different, this is the result:

    ...
    Write in 219 microseconds
    Write in 253 microseconds
    Write in 264 microseconds
    Write in 271 microseconds
    Write in 246 microseconds
    ...

250 us instead of 13, but no big stops. So much better indeed... Even summing the times, every 10 writes this take 2500 us compared to 40000 us of fsync().

Definitely an improvement.

p.s. all this, in ext4 FS.

pquerna · on May 3, 2010

its not the same on filesystems like ext3 with ordered data mode (the default).

fsync on those kind of filesystems will flush everything, not just a single file, though I believe there has been some more recent work in 2.6 to improve this behavior.

jedbrown · on May 3, 2010

> recent work in 2.6

So, since 2003?

geocar · on May 3, 2010

This has nothing to do with fsync().

If you lseek() back to the beginning after each write(), this behavior should never occur.

Growing a file is complicated, and may require extra unexpected disk accesses while the directory entry is modified or the free blocks are reassigned. These are what are blocking the write(), and write() will do them eventually, on a busy enough system, whether you call fsync() or not.

houseabsolute · on May 3, 2010

Seems like a userspace write queue would solve the write-delay problem. On the other hand, it's very dangerous to report success to the user before the changes actually hit the disk, because they might make some other change to the world that depends on the contents of that file being on disk, leading to an inconsistent state.

azim · on May 3, 2010

One thing the author of this post is doing here is calling gettimeofday() in every iteration through the loops. That's an awful way to benchmark something. gettimeofday() issues a serializing instruction, which can potentially flush the entire pipeline. It can also cause cores to sync their clocks. Either way, the net result of calling gettimeofday() too often is awful multithreaded performance. Calling it much less often or using a profiler would give very different results than reported.

rbranson · on May 3, 2010

Have you considered timing the average wait time for fsync() and continuously adjusting a timer to fsync() every N milliseconds based on this data? You could queue all of your response-to-client messages to wait on this fsync(), and do tens of these per second, even on a slow drive. This technique can potentially get client responses down to fractions of a second, even in high throughput scenarios.

on May 3, 2010

[deleted]

antirez · on May 3, 2010

Hello, I already use fdatasync() in the actual Redis source code. Unfortunately it has the same problem about blocking write() but still most of the times it's much faster.