Linux's fsync() woes are getting some attention

bodyfour · on March 10, 2014

I think part of the problem is that fsync() is that it's an insufficient interface. Most of the time, you want two things: * write ordering ("write B must hit disk after write A") * notification ("let me know when write A is on disk") In particular, you often don't want to actually force I/O to happen immediately, since for performance reasons it's better for the kernel to buffer as much as it wants. In other words, what you want should be nearly free, but instead you have to do a very expensive operation.

For an example for notification: suppose I have a temporary file with data that is being journaled into a data store. The operation I want to do is: 1. Apply changes to the store 2. Wait until all of those writes hit disk 3. Delete the temporary file I don't care if step 2 takes 5 minutes, nor do I want the kernel to schedule my writes in any particular way. If you implement step 2 as a fsync() (or fdatasync()) you're having a potentially huge impact on I/O throughput. I've seen these frequent fsync()s cause 50x performance drops!

dap · on March 10, 2014

> In particular, you often don't want to actually force I/O to happen immediately, since for performance reasons it's better for the kernel to buffer as much as it wants.

If you don't issue an fsync(), the kernel never has to write anything. If you had a different function that did what you suggest, returning when the kernel had decided to write the data (without instructing the kernel to do so soonish), you could literally be waiting forever. If the function ever returned, it would only be because something else on the system asked for a sync (and it was easier to write everything), or because you're operating on a filesystem that chooses to write data sooner than required. Both of these are basically working by accident (i.e., may well not work on other POSIX systems).

I think what you really want is to instruct the kernel that this data should be written out, and you want to block until that happens. That's exactly what fsync() is supposed to do. The fact that some kernels hammer the I/O subsystem while doing so is a bug in those kernels, not the fsync() interface.

ryandrake · on March 11, 2014

Coming from a graphics background, this reminds me of a similar problem in graphics.

In the OpenGL API, you have two functions (hard core graphics pedants, please forgive my simplification): There's glFinish() which instructs the graphics system to process all graphics commands previously sent and to block until everything is done. Then there's glFlush() which returns immediately, and basically says, make sure all commands previously sent will finish "in a finite amount of time" (as opposed to "maybe never").

From the discussion here, it seems that fsync() is like glFinish(). Maybe what's needed would be something similar to glFlush()? Something that says, "kernel, now would be a good time to start writing out data, but I'm not going to wait."

dap · on March 11, 2014

That's potentially interesting, but the question is: if you're not going to do anything that relies on the data being on stable storage, why bother asking the system to write it out?

Most of the use cases I think of involve a server of some kind servicing client requests. Most of the time, the only sane semantics are that if the request completes successfully, then the change will survive a system crash. In that case, you have to fsync() (or equivalent). Conversely, if the client doesn't need that guarantee (e.g., this is a cache that can be reconstructed from elsewhere, as in the case of a CDN), then there's no reason to make sure it's on stable storage at all. You're basically using the filesystem as a large, slow extension of DRAM, and if everything you ever write fits in DRAM, you wouldn't care if the kernel ever wrote it out.

Is there a use case you had in mind? I can't think of a middle ground.

jerf · on March 11, 2014

Anything where you want an image to be guaranteed consistent, even if not complete, could use an ordering guarantee without a particular "this has been written now" guarantee. A log-structured data store where you don't mind a bit of data loss if there's a power outage is a particularly clear example of that, but it's a useful property in general.

In fact filesystems in general attempt to implement this for themselves, because a filesystem should ideally always be in a consistent state... it may not be the right state, per se, but it should not be actively inconsistent, leaving you (or fsck) to basically guess what the correct state is.

The problem appears to be that today there's only "write this all out now and DO ABSOLUTELY NOTHING ELSE until that happens", and "yeah, whatever, write it whatever order you like and I sure hope it all works out."

Is that correct? There really isn't anything like a write barrier? All my reading of the links here seem to indicate that but I find it hard to believe that really is an accurate summarization of the current state of affairs on Linux. (Though I concede that I can see how hard it would be to propagate such a guarantee all the way from the disk hardware, through the drivers, through a large and varied number of file systems, all the way out to user space, without bugs, bugs, bugs.)

dap · on March 11, 2014

> Anything where you want an image to be guaranteed consistent, even if not complete, could use an ordering guarantee without a particular "this has been written now" guarantee. A log-structured data store where you don't mind a bit of data loss if there's a power outage is a particularly clear example of that, but it's a useful property in general.

So a little bit of data loss is okay, but a lot isn't? How does a program or operator determine how much is okay and how much isn't? How does the application ensure that that limit isn't exceeded? Without answers to these questions, it feels like "you'll probably be fine, but I can't be sure of anything", which feels pretty lame. But if a little data loss really is okay, then forget about both ordering and fsync and truncate the log after the last consecutive valid record.

> The problem appears to be that today there's only "write this all out now and DO ABSOLUTELY NOTHING ELSE until that happens", and "yeah, whatever, write it whatever order you like and I sure hope it all works out." > Is that correct?

I don't think so, but it depends on what you mean by "absolutely nothing else". You can always use other threads, and on most systems, you can do lots of useful I/O with reasonable performance while an fsync() is going on.

> There really isn't anything like a write barrier?

Other than fsync() and equivalents, I don't know of one. Non-blocking write barriers would represent a much more complicated abstraction for both applications and the filesystem, and (as you can tell from my comments on this thread) I'm not convinced it's worth the complexity for any rigorous program.

rosser · on March 11, 2014

You can always use other threads, and on most systems, you can do lots of useful I/O with reasonable performance while an fsync() is going on.

No, you can't. At least not without knowing what you're doing and careful planning.

My day job is PostgreSQL DBA, and I've been doing that for most of a decade now. As the kids on the Reddits would say, "I've seen some shit." I have some rather large servers, with some rather powerful IO subsystems — my production environment has SAS SLC SSDs under hardware RAID with a ginormous cache. I still see the behavior described in TFA far more often than I'd like. Linux really is pretty dumb here.

For example, because of this fsync() issue, and the fact that fsync() calls flush all outstanding writes for the entire filesystem the file(s) being fsync()'ed reside upon, I've set up my servers such that my $PGDATA/pg_xlog directory is a symlink from the volume mounted at (well, above) $PGDATA to a separate, much smaller filesystem. (That is: transaction logs, which must be fsync()'ed often to guarantee consistency, and enable crash recovery, reside on a smaller, dedicated filesystem, separate from the rest of my database's disk footprint.)

If I didn't do that, at every checkpoint, my performance would measurably fall. I learned this lesson the hard way, at an old job, where my postgres clusters lived on a SAN — it wasn't just my db instances that were being adversely affected by this IO storm. It was everything else that lived on the filer, too.

That's how bad it can be.

tytso · on March 11, 2014

It's not true that fsync() calls flush all outstanding writes for the entire file system; that was true for ext3 in data=ordered mode, but it's definitely not true for ext4 or xfs. If you use fdatasync(), and there were no write commands that issued against the file descriptor that required metadata updates (i.e., you didn't do any block allocations, etc), then both ext4 and xfs won't need to trigger a journal commit, so the only thing that has to get sent to disk is all of the dirty metadata blocks, followed by a SYNC CACHE command which forces the disk drive to guarantee that all writes sent to the disk will survive a power cut.

If you use fsync() and/or you have allocated blocks or otherwise performanced a write which required updating file system metadata, and thus will require a journal commit, then you will need to force out all pending metadata updates to the journal as part of the file system commit, but that's still not the same as "flush all outstanding writes for the entire file system".

bodyfour · on March 11, 2014

> or you have allocated blocks or otherwise performanced a write which required updating file system metadata

What if you're appending to a file, and want to checkpoint every so often. I guess you can be clever with fallocate(FALLOC_FL_KEEP_SIZE) to avoid the block allocation, but won't st_size still need to be updated?

I also assume that st_mtime doesn't count towards dirtying the metadata.

dap · on March 11, 2014

You're absolutely right. The situation on Linux sounds bad, and it sounds like work's being done to improve that. Some other systems handle this case fine, though, and the problem isn't with the fsync() interface.

Aside: I worked on one of those filers, and I've seen some shit too. :) Firmware, especially disk firmware, was the worst.

deathanatos · on March 11, 2014

> So a little bit of data loss is okay, but a lot isn't?

I don't think that's the issue being put forth here, all the way to the most-ancestor comment. It's not between a bit of data loss and a lot of data loss: if the power fails, I fully expect that some data that was in buffer might not have made it to the disk. But that's okay: I left my file in a state that's still valid (using a journal or something), and I can recover from it.

But if write B happens before write A, my file is corrupt, and it's game over. I don't need to override the kernel's I/O scheduler to say "write this write right now!", I just need to ensure that A goes before B. That's it; it can happen whenever, but it must happen in that order.

Currently, the only solution seems to be fsync and forcing everything to disk right now, and screwing over any buffering what-so-ever¹. I just want to specify to the kernel, "here's my data, here's how it must be written, but you can buffer and write it when convenient, as you have a better picture of the whole system I/O."

¹It's funny that the article mentions Postgres/MySQL, as I've heard of fsync woes through browsers using SQLite.

bodyfour · on March 11, 2014

> So a little bit of data loss is okay, but a lot isn't?

No data loss is OK. However, delaying acknowledgement to the network client requesting the db update often is OK. Especially if your protocol allows many simultaneous requests to be in flight with asynchronous completion notification. (Similar to how TCQ on a hard drive works)

Right now it's very hard to do this type of workload efficiently. That's a pity because with write-notifications you could do it with zero extra I/O cost.

maxlybbert · on March 11, 2014

I've never implemented database transactions, so this might be a little hand wavy.

A database transaction promises that all changes happen or no changes happen, i.e., that you don't get some of the changes. The easiest way to implement that is to write what you're going to do to a log, and then once you know that the log is on disk, you make the change in the actual data file. Two phase commit.

But what if I could loosen that some? It isn't that important when the log file is actually written; only that it will hit disk before the changes to the data file. If I write to the log, tell the operating system "after you get around to putting that on disk, make these changes to this other file," I can promise that the transaction will be atomic. I can't tell you if the transaction will actually be committed, but I know there won't be a partial commit. If the system crashes before the log gets updated, the transaction doesn't commit. But I don't have to wait around to find out when the file gets written out. I just need to know that if anything gets written to the data file, then all my writes to the log file are safely on disk.

Filligree · on March 11, 2014

Yes, a non-blocking fsync. I'd like to ask the kernel to write some data out, but I want to go on servicing other requests without blocking one OS thread entirely.

And sometimes I really don't care when data is written out, just that it happens in the right order. I may be okay with losing a few seconds to a minute of work, but not okay with blocking all computation while I'm waiting for fsync.

dap · on March 11, 2014

jerf described a similar case where one is okay with losing some data, but not "too much". I'm not sure how to make that approach rigorous. (If rigor isn't important, and it's just best effort with no guarantees at all, then don't bother with fsync at all.)

I'm also not sure in what higher-level use case it actually makes sense. Sorry if I'm being thick, but saying "the case where I want exactly that" doesn't help explain that case :) I'm looking for a higher-level description of the problem that would solve (e.g., "an ACID database", or "a CDN" -- except those are cases where you may want zero or any data loss, respectively).

Dylan16807 · on March 11, 2014

In most things I do on my computer, a few seconds or a couple minutes of data loss is acceptable, but data corruption is not acceptable. Without ordering guarantees data can become corrupt, something from weeks ago can disappear because you tried to update it. that is what is unacceptable.

Perhaps compare to the uberblocks on zfs. You can wipe out the most recent one, or the most recent fifty, and still have a consistent, slightly rewound, filesystem. Take that level of consistency and add a notification when everything is written out, and you have a nice nonblocking fsync.

nitrogen · on March 11, 2014

The kernel might wait for a very long time before writing any data to disk (especially in laptop mode with a spinning disk drive). Telling it to start writing now seems like a useful operation for an application that can handle the loss of recent writes, but would prefer to have data start going to disk anyway. Otherwise a power failure could mean losing 30 seconds of writes or more.

jamesaguilar · on March 11, 2014

There is already fflush, and it is generally not what people calling fsync want.

zurn · on March 11, 2014

fflush is unrelated to the business of flushing data to disk. It's just a libc FILE* thing. It causes libc to empty the (typically tiny) stdio buffer associated with the FILE* using the write() system call.

jamesaguilar · on March 11, 2014

Oops, my bad.

bodyfour · on March 11, 2014

> you could literally be waiting forever

Technically that's true, but it's not an issue in real life. Any sensible operating system will flush dirty buffers to disk eventually. Imagine if a sudden power outage lost writes you had made a month ago; it would be considered a pretty severe OS error.

So in reality, you might be waiting many seconds for the write to hit disk, but you won't be waiting hours.

> That's exactly what fsync() is supposed to do.

No, fsync is "Make sure the last write made it to disk. IN fact, force it out immediately. Also, all of the other pending writes as well, even if I'm not concerned about them. Oh, and block my thread until you're finished." It's an expensive operation.

otterley · on March 11, 2014

> No, fsync is "Make sure the last write made it to disk. IN fact, force it out immediately. Also, all of the other pending writes as well, even if I'm not concerned about them. ..."

POSIX 1003.1 (2004) doesn't go so far as to require unrelated data to be committed (emphasis mine):

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes." (http://pubs.opengroup.org/onlinepubs/009695299/functions/fsy...)

I think there's a good argument that Linux and other workalike OSes go too far in flushing all dirty pages to disk when fsync() is called.

sync(), on the other hand, does require all dirty pages to be committed, though, ironically, it is permitted to return asynchronously (http://pubs.opengroup.org/onlinepubs/009695299/functions/syn...).

bodyfour · on March 11, 2014

True, and it's good of you to clarify. However, what if the "other pending writes" are to the same file descriptor? Just because I'm eager to know when one write happens doesn't mean I care about other writes I made elsewhere.

sync() is basically a relic. Historically it only existed so that a UNIX userland process could roughly control the rate that dirty buffers went to disk. This was the "update" daemon historically, known as "fsflush" on SYSV. At least on linux that's internal to the kernel these days.

e12e · on March 11, 2014

Isn't sync() still useful for making sure all file-systems are sync'ed (as by the bin/sync utility -- which apparently basically calls sync())? For instance before shutdown?

B-Con · on March 11, 2014

Occasionally helpful, yes. But in my experience, usually unnecessary. Filesystems will sync themselves prior to being unmounted, which takes care of most of the obvious sync usecases.

riffraff · on March 11, 2014

could you explain what the kernels that "do it right" do differently from those that "hammer the I/O subsystem"?

Not being an expert, it would seem to me there shouldn't be much difference but obviously there is (i.e. TFA also says "linux woes").

dap · on March 11, 2014

One of the biggest problems here is that spindles usually don't handle oversubscription well. They service requests quickly up to a certain queue depth, and after that I/O times become significantly longer and more erratic. So if you've got a RAID5 system swamped by a large streaming write workload, an fsync that triggers even a single write may take hundreds to thousands of milliseconds to complete.

ZFS does a few things here:

- On fsync(), the filesystem only records intent log records to disk. It doesn't record all the filesystem updates needed to implement the change. This cuts back on I/O. - ZFS supports using a separate physical device only for the intent log (called a "slog"). Even if a huge (non-synchronous) workload is slamming most of the spindles, writes to the slog can still complete quickly. This helps even if the slog is another spindle, but it's a huge win if it's an SSD. - ZFS throttles writes to avoid issuing so much to disks concurrently that they start behaving poorly. For more on this (and recent work to improve it), see http://dtrace.org/blogs/ahl/2013/12/27/zfs-fundamentals-the-....

This may all apply to ZFS on Linux as well.

SamReidHughes2 · on March 11, 2014

What bodyfour is proposing is not just getting a write-notification -- it's in tandem with being able to specify write ordering. If you need to be able to specify write ordering, you also want to be able to have a write notification.

fsync does not work because it returns when everything (in the same thread? or whatever) has been written to disk, and doesn't let you wait for a particular block to have been written.

maxlybbert · on March 11, 2014

> fsync does not work because it returns when everything (in the same thread? or whatever) has been written to disk, and doesn't let you wait for a particular block to have been written.

I believe fsync is per-file (or, really, per-file descriptor). I wasn't familiar with this specific problem, but I believe the problem is that for fsync to work, it has to issue commands to the hardware to flush hardware buffers to disk. Apparently that isn't well targeted in Linux, and has the effect that calling fsync on a file can slow down other threads and processes that happen to be writing to that same disk, even if they're writing to a different file.

SamReidHughes · on March 11, 2014

Well even doing it on a file is bad.

It's per file, but also, it's (necessarily) the in-core state of that file, as a Linux man page says.

So if you do a write on one thread, wait for that write to complete, and then signal another thread to do an fsync, I don't really know what happens.

dap · on March 11, 2014

> What bodyfour is proposing is not just getting a write-notification -- it's in tandem with being able to specify write ordering. If you need to be able to specify write ordering, you also want to be able to have a write notification.

It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.

If you want the data on disk, use fsync(). If you don't care, don't. If the problem is that you can't afford the latency imposed by the multiple fsync() calls required to ensure correct data ordering for your application, fine. But that's not the problem the OP talks about. That was about fsync() hammering the I/O subsystem. You can solve that problem by fixing the fsync() implementation.

> fsync does not work because it returns when everything (in the same thread? or whatever) has been written to disk, and doesn't let you wait for a particular block to have been written.

If you really want that, could you mmap(2) the file and use msync(2)?

For this as well as the other cases described (e.g., wanting write ordering), I don't know what the intended use case actually is, but is it possible that there's another way to organize the data that's still correct and performs well without changing the POSIX interface? It seems likely, given the number of different programs out there that manage to get by with it, and there's a rather significant cost to adding a new interface.

(One option is to write everything you need into a temporary file, fsync() it, then synchronously rename it to "commit" it. That still requires two fsync's, but never more than that. You can generalize this for multiple files using a temporary directory.)

FWIW, I typically work on illumos systems. On fsync(), ZFS records only an intent log record. That alone helps, since it's not stopping the world to write out everything that's been buffered. For particularly latency-sensitive applications, we use a separate intent log device on an SSD. (Regardless of write ordering and filesystem optimizations, an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide.) This configuration works very well.

SamReidHughes2 · on March 11, 2014

> I don't know what the intended use case actually is,

Generally speaking, if you've got a database system of some sort and want to write data to a file.

> It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.

That's not really the problem -- being able to say "do this write operation after this other write operation" would let you pump modifications into some file at a faster rate than if you had to wait for every fsync. Suppose you have a modification that needs to be done. Right now, you might say, write a batch of blocks, wait for them to complete, and then write another block elsewhere (a new "superblock" or whatever the terminology you prefer is). Well, you'd rather send all the blocks simultaneously and say "the superblock write should happen _after_ these other blocks'". (Another option is to checksum the new blocks referred to by the superblock, but that requires pulling them up to the CPU and checksumming them.) (And there are other options that are more complicated with other trade-offs -- it would be nice if you could just send multiple blocks to write, with a partial ordering specified.)

So, even if you had no fsync at all, you'd be able to pump modifications into a database file faster than before. Without some kind of fsync, you couldn't confirm they'd ever been written. With a fine-grained fsync or "flush and notify on a per block basis" call, you can confirm that a certain subset of changes have been written. Generally speaking it's nice to be able to send in a bunch of changes without flushing because when you have multiple noncontiguous block writes to choose from that you'd like to perform simultaneously, they can get thrown on disk with better throughput.

> If you really want that, could you mmap(2) the file and use msync(2)?

It's better to send writes using O_DIRECT. Basically because mmap is bad for various reasons. There's some decent discussion on this here, especially in the comments: http://useless-factor.blogspot.com/2011/05/why-not-mmap.html

dap · on March 11, 2014

>> I don't know what the intended use case actually is, >Generally speaking, if you've got a database system of some sort and want to write data to a file.

But database systems have been around for years without such an interface, and can't they basically saturate a storage subsystem?

SamReidHughes2 · on March 11, 2014

You can always saturate a storage subsystem -- add more clients (assuming you don't saturate the CPU, the CPU's memory bandwidth, or the network interface -- any of which can happen if you put a high-end storage device on otherwise typical hardware). But what you get is higher than the minimal possible latency.

For example, suppose you send a bunch of write operations to the disk and then send an fsync. Well, if those write operations happen one after the other (figuratively) (because there's a bunch of them), their actual completion time would on average be half that of the actual waiting time all of them must suffer through.

Now suppose you've got the ability to do fine-grained fsyncs on particular write operations, efficiently. It would still be useful and result in improved latency if the disk or OS knew that getting block A on disk didn't matter to the process until block B was also on disk, and took advantage of that fact. And it would be extra-useful if the disk or OS knew that block B had to be written after block A, because then you would be able to save a round-trip or save on CPU bandwidth necessary for marking or checksumming blocks to a sufficient degree that you can determine upon startup whether they were completely and correctly written.

SamReidHughes2 · on March 11, 2014

> , an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide

Interestingly enough, if we ignore current hard drive firmware, I don't agree with this. In the context of sending arbitrary random sequences of writes to blocks, sure. But in the context of having a database or filesystem that wants low latency writes? My guess is that you could accomplish this if you track the location of the drive head and spindle. The last time I tried anything like this, though (talking to /dev/sdb, a new 7200 RPM WD Black laptop drive, from userland) I could only get about 1.5 ms, per block write + fsync (iirc -- the numbers 1.3 ms and 2.0 ms ring a bell too). I didn't try writing near the middle of the disk, though, so it could have been drifting the drive head off the track each time for some reason. There's just no hope in general, given current rotationals, when a rotational drive takes ~250 microsecs just to read a 4KB buffer from memory. They just don't care.

If you actually did take advantage of physical information to hold down write latency, garbage collection and keeping startup times low would be a pain (but hey SSDs have gc worries too), there'd definitely be throughput and capacity trade-offs.

mnarayan01 · on March 10, 2014

Is that not the purpose of [aio_fsync](http://pubs.opengroup.org/onlinepubs/009696899/functions/aio...)?

Disclaimer: I've never used it, I just filed it in the back of my mind for if I wanted to do basically what you just described.

justincormack · on March 10, 2014

Doesn't exist on Linux as a kernel interface, although the API is there; if it works it is either a thread doing fsync or its synchronous.

Qantourisc · on March 10, 2014

Not a kernel dev but to my knowledge "write B must hit disk after write A" would actually be rather easy to implement: if the storage device supports it: insert a write barrier. For the "let me know when write A is on disk" adding a callback to a write-barrier would do, not sure what the kernel queue looks like though.

Filligree · on March 10, 2014

In principle, yes. I believe most modern filesystems use (kernel-level) write barriers to ensure filesystem consistency, and those do depend on having hardware barriers as well.

The kernel is not set up to export this interface, however, and there doesn't appear to be any serious work being done to fix that.

EDIT: http://pl.atyp.us/2013-11-fixing-fsync.html has more details.

deathanatos · on March 11, 2014

While a "notify me when this write completes" (but don't feel the need to rush) might be useful, I think it'd be more useful to be able to push both write A and write B to the kernel, along with the condition of "A before B". You not only get the write ordering, but the kernel gets full knowledge of what's going on. Perhaps it could use this to write A and B together (but journaled at the FS level) to guarantee the order, whereas a simple write-complete notification implies at least two writes, and the latency of doing such.

bodyfour · on March 11, 2014

The two-write operation you propose is simpler, but you're losing a lot of generality. Specifically, think about:

  * I want to notify a network client that its write is done
  * When this write is done, I want to delete a temporary file

etc

deathanatos · on March 11, 2014

Good points — there's definitely a place for a write barrier. It may be that a simple write barrier, while not as efficient for this use case, might be generally more useful.

For your second point, if you take my initial "write B depends on the completion of write A", and extend it allow arbitrary kernel commands to depend on others, then you could still do that, (delete B depends on write A), however, things are probably too complex at this point, and a simple write barrier is better.

jcampbell1 · on March 10, 2014

I don't understand, but I am not very knowledgable about this any of this. My understanding is that fsync is an instruction to command the physical disk to flush its write buffer to the non-volatile (platters/flash) portion of storage, so it is safe to yank the cord from the wall.

In your scenario, it seems like you could have a situation where the temp file and the data are sitting on RAM inside the HD.

I honestly don't see how multiple processes that share a single physical disk can't get trashed by a processes that instructs the disk to constantly flush the buffer. It seems like asking for a network interface that doesn't slow down the network due other processes.

nknighthb · on March 11, 2014

There are multiple levels of caching. The cache in a disk is just one. Filesystem cache is another, which happens in RAM. There may be a layer of caching between filesystem and disk. Applications routinely cache data in their own memory allocations before writing to the filesystem.

fsync lives between applications and the physical disk. Its (intended) effect is to flush dirty data that an application has written to the filesystem all the way to the physical disk.

For maximum performance and minimum fragmentation, you want to wait as long as possible before data is flushed from the filesystem's cache. In a typical storage system, the filesystem is what decides the approximate ordering of writes and the allocation of disk space. Disks have no idea what's going on, they just respond to commands to write certain data to certain locations; their discretion is minimal, at most they will re-order a small batch of writes for optimum head movement.

Delaying a flush as long as possible allows the filesystem to coalesce multiple writes to the same location into one write, allocate disk space in the largest possible chunks in the best locations to avoid fragmentation and excessive seeking, order the writes for rough optimization and for data and metadata consistency.

Caching data in RAM is a good thing.

fsync is what you use when you've got important data that needs to be preserved immediately. Most data on a typical PC or server is just not that important. You're better off letting the filesystem do its job. And there's a whole continuum of data consistency in between, where you might want certain operations to happen in a certain order, or data to be written in a certain order if it's written, even if you don't necessarily need them to end up on the disk right now.

sanxiyn · on March 11, 2014

Featherstitch http://lwn.net/Articles/354861/ is an interface (and an implementation) for write ordering without blocking. As far as I can tell, it went nowhere.

haberman · on March 10, 2014

It always amazes me that after all these years, Linux still hasn't fixed this.

In my experience, any program that overloads I/O will make the system grind to a halt on Linux. Any notion of graceful degradation is gone and your system just thrashes for a while.

My theory about this has always been that any I/O related to page faults is starved, which means that every process spends its time slice just trying to swap in its program pages (and evicting other programs from the cache, ensuring that the thrashing will continue).

I've never gotten hard data to prove this, and part of me laments that SSDs are "fast enough" that this may never actually get fixed.

Can anyone who knows more about this comment? It seems like a good rule inside Linux would be never to evict pages that are mapped executable if you can help it.

Has anyone experimented with ionice or iotop? http://www.electricmonk.nl/log/2012/07/30/setting-io-priorit...

FigBug · on March 11, 2014

Happens to Windows and Mac OS too. Every time I boot Dropbox thrashes my disk for 10 minutes while the system is almost completely unresponsive.

PhantomGremlin · on March 11, 2014

Yes, OS X isn't very advanced.

In its heyday, Solaris was outstanding in terms of being responsive while simultaneously doing large amounts of I/O. (Or at least that's my perhaps clouded recollection, I haven't used Solaris in over 5 years).

ksk · on March 11, 2014

Well, you'll have to be more specific than 'being responsive' and 'large amounts of I/O' for anyone to make sense out of your statement :)

ksk · on March 11, 2014

>Every time I boot Dropbox thrashes my disk for 10 minutes while the system is almost completely unresponsive.

I seriously doubt that any usermode program could overwhelm the OS scheduler like that. What are your use case parameters?

the_mitsuhiko · on March 11, 2014

> I seriously doubt that any usermode program could overwhelm the OS scheduler like that. What are your use case parameters?

The OS IO scheduler is known to be shitty. That's not an exegeration. You can start swapping because the scheduler does not free unused caches fast enough.

FigBug · on March 11, 2014

Just after the OS boots, Dropbox needs to index 120 GB of files. Any other program that wants to access the disk takes forever. For Dropbox to finish, for my mail and IDEs to open takes about 10 minutes. Any other program that needs the disk is uselessly slow.

ksk · on March 11, 2014

Interesting. How many files do you have? I recorded a trace of dropbox executing on my windows machine (mostly flat folder hierarchy, ~500MiB , ~1000 files) and the file I/O for querying all my data took 71542.070μs (0.07s). I believe dropbox also does some extra things (reading the NTFS journal, its own file cache-journal, updating hashes, etc ) and so the total File I/O cost was around 2944815.431μs (2.9s). Note that the I/O happened sporadically, and the wall clock time is higher as expected (it didn't block the scheduler from scheduling other processes).

I assume since my data was synced and didn't need to be indexed all over again - I got some savings there. Maybe your dropbox configuration data is corrupted and thats why it needs to index it all again.

FigBug · on March 11, 2014

How do you record a trace? That would be interesting to do.

ksk · on March 12, 2014

Windows Performance Recorder.

rwmj · on March 10, 2014

Interesting couple of related articles / rants by Jeff Darcy:

http://pl.atyp.us/2013-08-local-filesystems-suck.html

http://pl.atyp.us/2013-11-fixing-fsync.html

zurn · on March 11, 2014

Good to see this summit involving the kernel developers, since their past situation sounds rather bleak interaction-wise: using kernel version from 2009 and haven't tested the improvements in the (2012) 3.2 kernel.

BTW, Linux provides the direct I/O O_DIRECT interface that allows apps to bypass the kernel caching business altogether. This is also discussed in Mel Gorman's message that this blog borrows from.

est · on March 11, 2014

Hmm. maybe that's why they had so many issues with single instance Redis & Mongodb? As soon as fsync() the whole db became unresponsive.

fiatmoney · on March 11, 2014

MongoDB's storage engine is more or less a mmap'd linked list of documents. It has a lot of issues once you actually start doing reads or writes, whether it's because your working set exceeds RAM or you actually want durability. It's a nice term-paper DB implementation but there's a good reason why most of the big single-instance RDBMS's use their own I/O algos instead of delegating to the kernel. Fundamentally the kernel doesn't know your optimal access pattern or your desired tradeoffs.

Redis turned away mmap'd storage some time ago; you can snapshot the DB to disk, but that's done via a fork() and the writes happen in that other process.

mrottenkolber · on March 10, 2014

I think this is a very important area of improvements for linux. While we call it "multitasking" there are a lot of situations where one might doubt it deserves that title.

I have been experimenting with very low cost computing setups that optimize for robustness and that led me to pretty slow disk I/O. While thats not a typical scenario for desktop computing, it can and should be possible with the limited but sane resources I ended up with. In practice however, certain loads freeze the whole system until a single usually non-urgent write finishes. Basically the whole throuput is used for a big write and then X (and others) freeze because they are waiting for the filesystem (probably just a stat and similar).

There are differences between applications. Some "behave" worse than others. Some even manage to choke themselves (ever seen GIMP take over an hour to write 4MB to an NFS RAID with 128kb/s throughput?).

I guess this is a hard problem, but I would wish for an OS to never stall on load. It is even better to slow down exponentially than to halt other tasks. Ideally the sytem would be smart and deprioritize long-running tasks so that small, presumably urgent, tasks are impacted as little as possible.

zvrba · on March 11, 2014

Re Mel Gorman's details in http://article.gmane.org/gmane.linux.kernel/1663694

I don't understand why PostgreSQL people don't want to write their own IO scheduler and buffer management. It's not that hard to implement (even a MT IO+BM is not really complicated), and there are major advantages:

- you become truly platform-independent instead of relying on particulars of some kernel [the only thing you need from the OS is some form of O_DIRECT; it exists also on Win32]

- you have total control over buffer memory allocation and IO scheduling

- whatever scheduling and buffer management policy you're using, you can more easily adapt it to SSDs and other storage types, which are still in their infancy (e.g., memristors) [thus not depending on the kernel developers' goodwill]

I mean, really: these pepole have implemented a RDBMS with a bunch of extensions to standard SQL, and IO+buffer management layer is suddenly complicated, or [quote from the link]: "While some database vendors have this option, the Postgres community do not have the resources to implement something of this magnitude."

This smells more like politics than a technical issue.

ibotty · on March 11, 2014

here is a mail from mel gorman with many more details.

http://mid.gmane.org/%3C20140310101537.GC10663%40suse.de%3E

peapicker · on March 11, 2014

So, what does Oracle Enterprise Linux do on an Exadata box running Linux?

caf · on March 11, 2014

ora uses O_DIRECT.

zvrba · on March 11, 2014

Ditch Linux and port PgSQL to run on top of raw Xen interfaces. You get to control your own buffering, worker thread scheduling, and talk directly to the (virtual) disk driver. I believe it'd be a win.

cratermoon · on March 11, 2014

O_PONIES ride again?

dschiptsov · on March 11, 2014

Oh yes, that annoying problem (especially fro MongoDB) that data should eventually be committed to a disk.)

Informix (and PostgreSQL) allows DBA to choice "checkpoint/vacuum intervals".

The rule of thumb, unless you are a Mongo fan, is that checkpoints should be performed often enough to not take too long, which depends only on the actual insert/update data flow.

But any real DBA could tell the same - sync quickly, sync often, so server will run smoothly, but not "at web scale" and the pain of recovery will be less severe.)