My experience with Linux and buffered IO (ext4) from multiple threads has been very positive. The only beef I have is that you can't prevent the data you write/read from polluting the cache without resorting to madvise which isn't available from Java. I don't usually care about the contents of the page cache so it isn't a showstopper.
You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.
I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.
My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.
Yeah, I would be more interested in hearing why libtorrent feels the need to implement their own disk cache. I'm sure they can do better, but by how much?
i actually very much doubt that they can do better. writing any sort of large LRU cache in a machine with swap turned on is a bad idea because a lot of your cache will get swapped out and then swapped back in unnecessarily when you try to use that memory for something else. mlock can be used to mitigate this effect but the default mlock limit is so low that it's useless in practice. another thing to consider is that using mmap gives you a big advantage over writing your own disk cache because the OS can take advantage of the paging hardware in the processor, which you can't do from userspace.
one reason why the bittorrent layer could make better decisions on what to flush is because it always need to hash pieces that are downloaded. When blocks are downloaded out of order, it doesn't make sense to flush unhashed blocks before ones that the the hash cursor has already passed. If an unhashed block is flushed, it will have to be read back from disk again, when the piece complete, which is very expensive.
While it's true that the Windows API seems to be the best thought through, I was surprised to learn that the implementation may randomly fall back to synchronous IO in unpredictable ways, which (depending on the app, but likely for something that's attempting to juggle a lot of work like a bittorrent implementation) means you need a thread pool anyway.
Interesting, I've personally experienced this bug multiple times and always been surprised when it happens since Chrome is normally extremely responsive UI wise even when under load.
We ran into issues where lots of threads attempting a disk write were causing latency problems. We were able to get around this by having a single dedicated thread to disk IO.
When node switched to libuv it didn't degrade io performance on unix hosts; io in threads has been competitive with evented io. I don't think async io was even considered, but let me know if you dig up anything.
The problem is that we have a chicken and egg problem. Very few programs (including most enterprise databases) use AIO. Why? For portability reasons; there are other ways of doing things (i.e., using thread pools) that will allow a database developer to get all or most of the benefits of AIO, at least before the days of super fast storage. (Now that we have really fast PCIe-attached flash which is fast enough that scheduler overhead starts becoming a real problem, this may no longer be true.
As a result, there is little incentive to improve AIO (on all systems, but especially on Linux --- a lot of the Direct I/O work was done to make the enterprise database vendors happy). And since AIO isn't good enough, very few people want to use it, and since making it better is difficult, few people are interested in working to solve the problem, and the cycle repeats again.
Nice write-up. I suspect the poor implementation of async I/O suggests how often it is actually used in practice. Signal handling definitely feels like the wrong design here, especially for a library author.
I'm also not surprised that Windows fared better here; with IOCP they had a chance to redo async I/O completely.
It's also the extremely limited use case. There's not that many I/O requests to a disk that you can do at a time while getting a performance speed-up. Having to use a thread pool ends up not being such a problem -- you don't really hurt your performance benchmarks. On the other hand, a system that talks on a network to thousands or millions of clients will benefit greatly from avoiding 4-8KB of stack overhead per connection.
4-8kB? Maybe physical memory overhead, if your code isn't too deep. But userspace thread stacks are anywhere from 128kB (FreeBSD) to 8MB (Linux) of virtual memory overhead.
Depends on what libc (or other) routines you call... going over the end of the stack is no fun. Lots of code seems to be written to rely on deep stacks in userspace.
> The aio branch has several performance improvements apart from allowing multiple disk operations outstanding at any given time. For instance:
This sounds like a bad idea. If the improvements aren't tied to asynch, why weigh them down with the async albatross instead of merging them to the mainline?
This is a good writeup in general, but it fails to mention the fact that under glibc, POSIX AIO is implemented with a thread pool anyway. Only native (non-portable) Linux AIO is implemented by the kernel.
In general, unless you're doing something super high-performance, you should not bother with AIO. It's kind of one of those "if you have to ask, you don't need to know," situations.
You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.
I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.
My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.