Hacker News new | past | comments | ask | show | jobs | submit login
84% of a single-threaded 1KB write in Redis is spent in the kernel (nullspace.io)
158 points by antics on Feb 10, 2015 | hide | past | favorite | 48 comments



This is why high-performance commercial databases do most or all of their I/O management and scheduling in userspace. It is not a new idea; it is much more efficient. However, that means you need to reimplement most of what the kernel does in an optimal way.

This is relatively common for closed source software but you almost never see these types of userspace I/O designs in open source, which means OSS designs are often leaving integer factors worth of efficiency and performance on the table. For some use cases, companies are very successful selling into this efficiency and performance gap with closed source.

Part of the lack of open source is that these designs are not portable, due to OS dependencies and sometimes hardware dependencies if a design is hardcore. I think this is a partial copout; a Linux-only database engine would address the vast majority of real-world deployments. A bigger reason is that the design and implementation of these kinds of userspace kernels is a very high skill and low level art that, frankly, is way outside the expertise of most open source software contributors. For databases in particular, more often than not even the basic design elements of the software are naively done (e.g. MongoDB) and that is lower hanging fruit.


Indeed. Oracle basically uses the OS just to bootstrap itself and hand over the initial chunk of memory, then it manages everything internally. It even talks to NFS itself rather than going through VFS. Opens a socket from userland code to the NetApp and reads and writes the raw protocol.


Totally... this is why pipelining makes Redis 10x faster, less syscalls.

Basically to make Redis much faster we need to work to three different related things:

1) Less kernel friction.

2) Threaded I/O, this is the part worth threading, with a global lock to execute queries so you don't get crazy with concurrency and complex data structures. Memcached did it right.

3) Pipelining: better client support for pipelining so that it's easy to tell the client in what order I need my replies, and unrelated replies can be glued together easily.


(Author here) Continuing our thread from Twitter, one of the things I wanted to say here is that things like pipelining and threading are good, but there is still improvement to be made at the kernel I/O layer. If we can reduce this latency there could be very compelling cost incentives especially at the datacenter level, and it might simplify application logic quite a bit.

This is one of the big reasons I like the paper I mention in this post.


Point (2) makes me think of LMDB. Have you looked into it much? I wonder if it would be an interesting storage substrate for a threaded Redis.


Point "2" better applies when the storage substrate is memory and operations are O(1) or logarithmic, because in that case, the time to serve the query is comparable small compared to the time needed to process the reply and send back the response. With an on-disk storage, I would go for a classic on-disk database setup where different queries are served by different threads.


LMDB beats all other "classic" on-disk databases for read performance. It also happens to beat all other in-memory systems for read performance too, since its reads require no locks.

http://symas.com/mdb/memcache/ http://symas.com/mdb/inmem/ http://symas.com/mdb/ondisk/


The concept of redis has always baffled me. A hash table is a very fast data structure. As soon as you put that in a dedicated server, the cost of the actual lookup is instantly eclipsed by the need to parse a text protocol and do network I/O to communicate with the client.

So I'd be willing to say that the problem here isn't that the kernel stack is slow per-se, but that workload is too small as to make the overhead look ridiculous, when it'd be very much acceptable if your server did more actual work.


I took the article as to be more a case of using Redis as an example of a system which is reaching the limits of performance _given the current state of the kernel it's running on_, rather than a direct comment on the performance of Redis itself.

Also, saying that using Redis is slower than using a local hash table is a truism. There are myriad reasons why using a local, in-memory data structure is not viable: scalability and persistence, for example. It's like saying "I don't need a database, I can store everything in a local variable."


Having worked on an embedded system that dealt primarily with network I/O (10 Gbps) and hash tables (in the 10 GiB range), I can tell you that network I/O (done right!) and parsing account for MAYBE 1/3 of the total latency of an operation involving a hash table lookup. Memory, like all mass storage, is SLOW once you're not working in cache, and hash tables have no locality.

(By "done right!" I mean either batching requests to/from the kernel, or using a zero-copy userspace solution like DPDK. Clearly in this article network I/O was not done right. Round trips through the kernel ALWAYS will kill performance.)


I've no doubt memory is slow, taking up to 100-300 cycles in NUMA systems. But a single threaded server accessing local memory won't may those costs as much. Are you saying that a few random memory accesses are slower than sending and receiving a packet on two machines?

Redis has a great position as a persistent, shareable, data structure server, but replacing in memory hashtables where they work doesn't seem like one of those cases.


Sure... on a 10 Gbps link, you can transfer up to 15 million packets per second (assuming a tiny payload... which is about all you need for a hash lookup). That's about the same order of magnitude as the latency of a single memory lookup that misses the cache. In the right environment (not TCP on a stock Linux kernel!) each packet carries very little CPU overhead.

Of course, you can pipeline memory accesses to some degree, but not as easily as you can aggregate network requests into fewer packets.

I'm certainly not saying the network overhead is free -- it's not! I'm just saying it needn't "eclipse" the hash table lookup itself (as the GP suggested). They're on the same order of magnitude.


Alright but in the case of Redis, this low latency networking like RDMA or whatnot isn't available right? So, today, off-the-shelf, installing Redis on a remote server, you're probably gonna be off by at least an order of magnitude, no?

Also, on the target server, you still have at least one cache miss to retrieve the item from the hashtable (and potentially more for large hashtables).

It would be nice if there was a commonly supported API for doing this kind of stuff that doesn't require exotic hardware and came with a usable abstraction something like TCP. I know off-the-shelf Intel NICs have "Direct NIC Access" which can easily do wireline work from userspace, but afaik that's only for Intel NICs, and the API isn't as smooth as most socket users are used to. We need something like SuperSockets, but written to target NIC hardware directly, I guess.


There's two groups of people who need to squeeze all the work out of every cycle they can get: Embedded programmers (which, despite our massively powerful phone processors, still includes mobile due to power issues), and cloud programmers. Cloud people are totally interested in optimizing everything to within an inch of its life, so it's a valid concern for them that even if they reduce their user space costs to 0 they still have limits put on them by the kernel. And cloud people are willing to put a lot of cleverness and work into their core pathways, and will get surprisingly close to the minimum time necessary, so you might be surprised how much they can get done in fractions of a microsecond in their core workload.

You may not have these problems, in which case in addition to "the concept of redis" baffling you, this will seem absurdly performance-sensitive to you. From a desktop programmer or all but the most complicated websites, that is also a sensible perspective. But the niche in which this discussion makes perfect sense is itself pretty large.


Game developers also belong to that group of people who squeeze work out of every cycle. Less than 18 months ago, two of the most commonly used devices contained: 512MB dedicated RAM (with 10MB of VRAM) (xbox 360) and 256MB RAM and 256MB VRAM (PS3)

alongside very aged processors. The hardware was almost 10 years old in both cases. Even the current gen aren't particularly powerful, coming in at 8GB ram with a 1.75GHz processor, and a GPU comparable to a 3-4 year old PC for the xbox one, and 1.6GHz processors, 8GB ram and a slightly beefier GPU in the case of the PS4.


And HPC guys, and HFT guys, and graphics guys... Cloud schmoud.


The value of Redis is never going to be found with a single server. It's going to be found when you use Redis to synchronize the state of multiple servers.


Redis can actually be very useful on a single server, too, as a fast, robust, convenient, potentially shared datastore.

Performance is usually a secondary concern in these use-cases.


Yep, for data stores (caches, etc), Redis is awesome and super fast, though not as time-tested as memcached.

And all of the Redis set/list operations are super valuable. You just need to be careful once you start relying on Redis at scale for things you take for granted when you start playing around with it... For example: zunionstores on lots of large sorted sets. "O(N)+O(M log(M)) with N being the sum of the sizes of the input sorted sets, and M being the number of elements in the resulting sorted set." When you start off using it, it's awesome and super fast, but before you know it, the blocking, single-threaded architecture will crash and burn if your data scales up. Luckily we have clustering now :)


It makes a great inter-process work queue as well. I don't think you have to stretch your imagination very far to find interesting single server use cases for redis.


There are better, more reliable ways to do this than redis. NSQ for example, in a reactive (or log-based) pattern.

Regardless, your goal should be reducing shared state as much as possible, since synchronizing it among distributed systems is a Hard Problem(TM).


I've used Redis as bounded temporal storage for "log data" -- specifically, on an occasionally crashy Solaris box w/o capabilities to store a days worth of tcpdump data, I would pipe it to Redis w/ timeouts and have a 15-minute sliding window of data -- so when the host bit the dust, I had what I needed, stored safely. That was valuable.


Let's say you have a 8gb hash table that you want shared between 32 app servers. You are only setting or getting a few elements each request.

It doesn't make sense for each app server to have an 8gb hash table in their memory nor can they easily be synchronized.


But if you do it right you don't parse anything, you take a chunk of bytes off the wire and just cast them to a struct, which you know because you defined the protocol in the first place. This is how market data systems do it.


I've never used a system like Redis, but clearly the trade-off is one of performance for scalability.


This is why techniques like kernel bypass are used in high throughout or low latency systems like finance. Things like http://www.openonload.org/

This does tie you in to a specific network vendor but I can see the argument for moving more commodity networking hardware in this direction too.


Personally, I see a lot of value in Kernel modules which let you bypass the kernel entirely for simple IO. It requires more work on the applications end, and more libraries to support the disparate hardware, but it would help with many such activities.


This article from last month is relevant:

Improving Linux networking performance

https://lwn.net/Articles/629155/

HN discussion: https://news.ycombinator.com/item?id=8931431


InfiniBand was conceived around this issue. Additional overhead using kernel I/o includes the user/kernel space switch, copying between user/kernel buffers, and waiting/polling for interrupts. That stuff doesn't get any smaller as networks get faster, so today we're at the point that they dominate the actual hardware I/o time for many network devices.

There's been periodic interest in 'virtual hardware' where the hardware presents multiple interfaces to different users. This way the driver can run in user mode, since there's no need to control/share the hardware registers in the kernel.


Rian Hunter, Dropbox's third engineer, talks about the latency incurred by OpenSSL when they were designing and implementing their extremely high-performance Dropbox notification servers in the talk below. Also, the pitch he gives at the end for joining Dropbox is one of the most genuine and heartfelt I've ever seen.

http://www.youtube.com/watch?v=FBRIeoEr8GU


To be clear, 80% of kernel-time in that 1KB write is spent in fsync(), _not_ in the network stack. Network overhead is roughly similar between read & write requests. What Arrakis seems to be able to do is avoid the overhead of write & sync, presumably because it doesn't go through the VFS + filesystem + block IO code paths.


>To be clear, 80% of kernel-time in that 1KB write is spent in fsync(), _not_ in the network stack

Are you sure?

Of the total 3.36 μs (see Table 1) spent processing each packet in Linux, nearly 70% is spent in the network stack


Table 1 is looking specifically at getting a chunk of data off the wire and into the users code. Look at Table 2 for a comparison of redis read/write. Average time for a write is 163 μs - of which 137 μs is spent in fsync(2)


Right but you can't compare table 2 data to the network stack, table 2 data is only timing the redis operations. Which as stated take up significantly less time than the time spent in the network stack.


Where exactly in the networking stack is the time being spent?

If it is in the IP/TCP layers then moving that to user-space does not, by itself, necessarily reduce latency, it merely shifts it elsewhere. If the latency is due to kernel user land memory copies then that is a different matter.


The point of moving it to user-space is that you can source a lot of the jobs of the TCP/IP stack to hardware directly. In some cases this dramatically speeds up your I/O.

Consider, for example that the dominant costs are things like demultiplexing and security checks. If you choose to implement multiplexing with virtual network cards then you get true 0-copy multiplexing, which is much faster than the software equivalent. And many of the security checks can be eliminated by using some combination of packet filters and logical disks. (The security BTW seems to be one big difference from RDMA, which might be an alternative, but I'm not really an expert.)

Some things can't be sourced to the hardware, like naming and access control. But that's fine.

(NB, I'm not arguing for this paper's position necessarily, I just thought it was interesting, and the motivation was good enough to start me thinking about how I might get around the kernel.)


The life of a data frame has become pretty complicated. Along the way it may pass through one or more layers of virtualization, one or more layers of network-specific mangling like iptables doing filtering and NAT, and a combo of the two in OS-based virtualized networks connecting VMs and containers both intra- and inter-host.

Pushing more of the stack into hardware is probably a good idea for single-tenant datacenters that can deploy a lot of e.g. Redis appliances, but those of us just renting capacity in the cloud are going to suffer from Amdahl's Law if you can only accelerate the part of the system adjacent to real hardware NICs.


You can make surprisingly significant gains by writing a TCP/IP stack that is intended for one process only. OSv is a kernel that supports a single process. By running memcached (just as one specific example) on OSv in KVM, you get better performance than by running it on the host. The trick is their network stack is significantly simpler and bypasses the kernel to go straight to the hardware. If you could do the same in user space, that's where you'd get your gains. (But I do agree with you - JUST moving to user-space is rather pointless).


I'd like to know that too e.g. even if you just have 1 iptables rule, that don't match your traffic, you'll still take a rather big performance hit in the network stack for everything.


While we're looking at OSDI proceedings, check out the IX paper: https://www.usenix.org/conference/osdi14/technical-sessions/... It shows how the kernel can provide high throughput and low latency without giving up sharing and protection.


When testing https://github.com/wyager/Neks , pipelining made the server something like 30x faster for the exact same reason. Syscalls are tremendously expensive compared to the work of processing a request.


Solutions that pull the TCP stack out of the kernel perform so much better because they're bypassing all the internal bureaucracy that the kernel otherwise performs to make it as easy as possible for userspace applications to use the network without stepping on other applications' toes.

The kernel socket API is designed so that programs have to do as little thinking as possible to get their own personal slice of the shared and noisy network. It provides an easy abstraction, and that requires the kernel do a lot of messy stuff for you:

- When you're using TCP sockets, the kernel makes copies of everything your application writes and holds it in a buffer until its receipt is acknowledged, just in case it needs to resend it when the other side doesn't acknowledge it. If the socket's buffer fills up, your application blocks on I/O until some space is freed.

- It holds ports open in a lingering state long after they're closed just in case it needs to re-transmit the last bytes. This can be disabled, but it's on by default.

- It takes care of all the congestion control for you, but it's tuned for the general case, and as a result there are a lot of edge cases which perform very badly for the problem they're trying to solve. Redis is probably one such edge case.

Of course, all of this is fine and desirable for general applications, but it ends up being problematic if you're trying to solve a problem where performance is the chief concern.

It's tempting to say the problem is that kernel has to do way too much to provide that easy abstraction, but really the problem is that the kernel provides no way around it. You pretty much have the option of using their cushy stream abstraction at the cost of performance, or you use a userspace TCP stack on raw sockets, which requires running as root and disabling TCP in the kernel (otherwise the kernel stomps all over your TCP negotiations[1]).

There are some other transport layer protocols (SCTP, DCCP, etc.), as well as application layer protocols built on UDP, that remove some of the abstractions TCP provides and as a result require less in-kernel bureaucracy, but those solutions don't seem to be very popular or well-supported.

It would be nice if the kernel would provide some lower level system calls that could be selectively used to move parts of TCP into the application (e.g., retaining copies of data in case of re-transmission). Alas, I don't think there's much push for that, because a) it's hard, and b) the current situation is fine for 99% of network applications.

[1] http://jvns.ca/blog/2014/08/12/what-happens-if-you-write-a-t...


maybe relevant: http://info.iet.unipi.it/~luigi/netmap/ a "framework for high speed packet I/O implemented as a kernel module for FreeBSD and Linux"


I am not very familiar with OSes and things at the kernel level, can anyone answer the question in the comments of the post?

"How did you measure the time spent in each section (HW, kernel, app)? how did you get such granularity?"


From the Arrakis paper: "To analyze the sources of overhead, we record timestamps at various stages of kernel and user-space processing." You can probably implement this with something like perf dynamic tracing: http://www.brendangregg.com/perf.html#DynamicTracing


I wonder if anyone has tried it with Solarflare?


Solarflare has a whitepaper on accelerating memcached: http://10gbe.blogspot.com/2014/12/memcached-3x-faster-than-i...

I just whipped this up in 5 minutes and didn't do much of the tuning there (e.g. no isolcpus or interrupt changes), but here's a single-client 1024-byte SET redis-benchmark running against localhost with and without TCP Loopback Acceleration... redis 2.8.4 on a dual E5-2630 @ 2.30GHz, card is SFN5122F but this is all loopback. I'm not claiming anything and just doing it because somebody pondered...

   * plain jane
  /usr/bin/redis-server
  redis-benchmark -t set -q -n 1000000 -d 1024 -c 1
  SET: 21258.96 requests per second
  
   * unaccelerated server, unaccelerated client
   numactl --physcpubind 1,3,5 --preferred 1 /usr/bin/redis-server
   numactl --physcpubind=7,9 --preferred 1  redis-benchmark -t set -q -n 100000 -d 1024 -c 1
  SET: 14293.88 requests per second
  
   * TCP loopback accelerated server, unaccelerated client
  EF_NAME=hn EF_TCP_SERVER_LOOPBACK=2 EF_TCP_CLIENT_LOOPBACK=2 onload -p latency numactl --physcpubind 1,3,5   --preferred 1 /usr/bin/redis-server
  numactl --physcpubind=7,9 --preferred 1  redis-benchmark -t set -q -n   100000 -d 1024 -c 1  
  SET: 25967.28 reque  sts per second

   * TCP loopback accelerated server, accelerated client
  EF_NAME=hn EF_TCP_SERVER_LOOPBACK=2 EF_TCP_CLIENT_LOOPBACK=2 onload -p latency numactl --physcpubind 1,3,5   --preferred 1 /usr/bin/redis-server
  EF_NAME=hn onload -p latency numactl --physcpubind=7 --preferred 1  redis-benchmark -t set -q -n 1000000 -  d 1024 -c 1  
  oo:redis-benchmark[13454]: Sharing OpenOnload 201405-u1 Copyright 2006-2012 Solarflare Communications,   2002-2005 Level 5 Networks [4,hn]
  SET: 96098.41 requests per second
Edit: formatting fixes


Not bad considering its "for free". Thanks a lot for the post




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: