Libtorrent experience - the poor state of async disk IO

arielweisberg · on Oct 26, 2012

My experience with Linux and buffered IO (ext4) from multiple threads has been very positive. The only beef I have is that you can't prevent the data you write/read from polluting the cache without resorting to madvise which isn't available from Java. I don't usually care about the contents of the page cache so it isn't a showstopper.

You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.

I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.

My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.

wmf · on Oct 26, 2012

Yeah, I would be more interested in hearing why libtorrent feels the need to implement their own disk cache. I'm sure they can do better, but by how much?

jeffffff · on Oct 26, 2012

i actually very much doubt that they can do better. writing any sort of large LRU cache in a machine with swap turned on is a bad idea because a lot of your cache will get swapped out and then swapped back in unnecessarily when you try to use that memory for something else. mlock can be used to mitigate this effect but the default mlock limit is so low that it's useless in practice. another thing to consider is that using mmap gives you a big advantage over writing your own disk cache because the OS can take advantage of the paging hardware in the processor, which you can't do from userspace.

arvidn · on Oct 27, 2012

one reason why the bittorrent layer could make better decisions on what to flush is because it always need to hash pieces that are downloaded. When blocks are downloaded out of order, it doesn't make sense to flush unhashed blocks before ones that the the hash cursor has already passed. If an unhashed block is flushed, it will have to be read back from disk again, when the piece complete, which is very expensive.

evmar · on Oct 26, 2012

(Copy'n'paste of reddit comment:)

While it's true that the Windows API seems to be the best thought through, I was surprised to learn that the implementation may randomly fall back to synchronous IO in unpredictable ways, which (depending on the app, but likely for something that's attempting to juggle a lot of work like a bittorrent implementation) means you need a thread pool anyway.

http://neugierig.org/software/blog/2011/12/nonblocking-disk-...

mey · on Oct 26, 2012

Interesting, I've personally experienced this bug multiple times and always been surprised when it happens since Chrome is normally extremely responsive UI wise even when under load.

derleth · on Oct 27, 2012

My experience is the opposite: When Chrome starts to hit the disk, it freezes, especially on startup.

SnorkelTan · on Oct 26, 2012

Another good read about increasing disk IO and reducing latency can be found at the Mechanical Symphony blog about the single writer principle: http://mechanical-sympathy.blogspot.com/2011/09/single-write...

We ran into issues where lots of threads attempting a disk write were causing latency problems. We were able to get around this by having a single dedicated thread to disk IO.

wizard_2 · on Oct 26, 2012

I believe even NodeJS's libuv came to the same unfortunate conclusion for non windows hosts.

https://github.com/joyent/libuv http://nikhilm.github.com/uvbook/filesystem.html

Tobu · on Oct 26, 2012

When node switched to libuv it didn't degrade io performance on unix hosts; io in threads has been competitive with evented io. I don't think async io was even considered, but let me know if you dig up anything.

tedsuo · on Oct 27, 2012

yep, node has always used a threadpool for filesystem calls, even before libuv.

tytso · on Oct 27, 2012

The problem is that we have a chicken and egg problem. Very few programs (including most enterprise databases) use AIO. Why? For portability reasons; there are other ways of doing things (i.e., using thread pools) that will allow a database developer to get all or most of the benefits of AIO, at least before the days of super fast storage. (Now that we have really fast PCIe-attached flash which is fast enough that scheduler overhead starts becoming a real problem, this may no longer be true.

As a result, there is little incentive to improve AIO (on all systems, but especially on Linux --- a lot of the Direct I/O work was done to make the enterprise database vendors happy). And since AIO isn't good enough, very few people want to use it, and since making it better is difficult, few people are interested in working to solve the problem, and the cycle repeats again.

mattgreenrocks · on Oct 26, 2012

Nice write-up. I suspect the poor implementation of async I/O suggests how often it is actually used in practice. Signal handling definitely feels like the wrong design here, especially for a library author.

I'm also not surprised that Windows fared better here; with IOCP they had a chance to redo async I/O completely.

wmf · on Oct 26, 2012

Or the disuse of async disk I/O is due to the difficulty of its proper implementation.

shrughes · on Oct 26, 2012

It's also the extremely limited use case. There's not that many I/O requests to a disk that you can do at a time while getting a performance speed-up. Having to use a thread pool ends up not being such a problem -- you don't really hurt your performance benchmarks. On the other hand, a system that talks on a network to thousands or millions of clients will benefit greatly from avoiding 4-8KB of stack overhead per connection.

throwaway54-762 · on Oct 26, 2012

4-8kB? Maybe physical memory overhead, if your code isn't too deep. But userspace thread stacks are anywhere from 128kB (FreeBSD) to 8MB (Linux) of virtual memory overhead.

shrughes · on Oct 26, 2012

Stacks can be made to be 4KB or 8KB if you want them to be.

throwaway54-762 · on Oct 30, 2012

Depends on what libc (or other) routines you call... going over the end of the stack is no fun. Lots of code seems to be written to rely on deep stacks in userspace.

j_s · on Oct 26, 2012

Alan McGovern chose a compromise for MonoTorrent, using async io but processing all the results in a single thread.

The Evolution of MonoTorrent - FOSDEM 2010

http://www.youtube.com/watch?v=TbhKpeqIy8o&t=10m10s

Simplified Threading API

http://monotorrent.blogspot.com/2008/10/monotorrent-050-good...

VMG · on Oct 26, 2012

the line breaks make it very difficult to read for me - here's a copy of the text: https://gist.github.com/3960408

gwern · on Oct 26, 2012

> The aio branch has several performance improvements apart from allowing multiple disk operations outstanding at any given time. For instance:

This sounds like a bad idea. If the improvements aren't tied to asynch, why weigh them down with the async albatross instead of merging them to the mainline?

klodolph · on Oct 26, 2012

No problem, that's why we have cherry-pick.

freyrs3 · on Oct 26, 2012

libeio makes some strides in this direction.

http://software.schmorp.de/pkg/libeio.html

willvarfar · on Oct 27, 2012

by using a thread pool, right?

cmccabe · on Oct 27, 2012

This is a good writeup in general, but it fails to mention the fact that under glibc, POSIX AIO is implemented with a thread pool anyway. Only native (non-portable) Linux AIO is implemented by the kernel.

In general, unless you're doing something super high-performance, you should not bother with AIO. It's kind of one of those "if you have to ask, you don't need to know," situations.