Hardware Acceleration of Key-Value Stores [pdf]

TD-Linux · on Dec 27, 2014

Their system works well on a RISC-V, which is a deeply pipelined but in-order architecture running at only 50MHz, with very fast memory relative to its clock speed. I'd like to see results compared to a more reasonable CPU implementation. For example they could have used a Xilinx Zynq, which includes hard Cortex A9 cores, which are out-of-order and superscalar. They also run at a much higher frequency relative to the memory speed.

I think this paper vastly underestimates memory constraints of higher performance systems.

ms705 · on Dec 27, 2014

Another comparison point (curiously not referenced in the above paper) is this similar effort: https://www.usenix.org/conference/hotcloud13/workshop-progra... [PDF/slides/video]

From a quick skim, the Xilinx work synthesises the entire network stack (including TCP) in hardware, unlike the above study, which only supports UDP in the HW traffic manager.

The numbers in the Xilinx paper are more attractive than what this study found and the paper also includes power measurements (since Joule/request is one metric on which FPGAs and hardware do quite well compared to software).

That said, much of the latency gain likely comes from bypassing the OS kernel and its generalised network stack. There is plenty of existing work (unfortunately also not referenced in this paper) that does this and which achieves very low latency, albeit -- to be fair -- on x86 hardware. (Examples: Arrakis [1], IX [2] and MICA [3].)

[1] -- https://www.usenix.org/conference/osdi14/technical-sessions/...

[2] -- https://www.usenix.org/conference/osdi14/technical-sessions/...

[3] -- https://www.usenix.org/conference/nsdi14/technical-sessions/...

toomim · on Dec 27, 2014

At first, I was impressed they reduced latency by 10x:

    Our initial evaluation with a realistic workload shows
    a 10x improvement in latency for 40% of requests without
    adding significant overhead to the remaining requests.

And then I re-read this claim, did a little math, and realized that they only reduced it by 36%.

"10x for 40% of requests" is a skeezy way of saying 36%.

ComputerGuru · on Dec 27, 2014

And that's not taking into account the "not significant" overhead added to the remaining 60%.

pradn · on Dec 27, 2014

Nothing revolutionary, but someone had to do it. (Of course, you would take something expensive in software and implement the logic in software. Of course you can get by implementing only the most common requests: GETs to a small subset of keys.)

I don't intend to criticize this paper in particular, but, generally, I don't see small performance improvements in such software to be very useful for society. Academia just becomes a research arm of corporations that might even be a net negative for society: eroding privacy rights (Facebook et al) or introducing volatility into stock markets (HFT could use this paper's insight just as fruitfully.)

moru0011 · on Dec 27, 2014

Main source of latency will be network. The main problem are synchronous GET requests as then performance == latency. Better go async instead of reducing latency with hardware accel.

yxhuvud · on Dec 27, 2014

Not necessarily if the request is coming from within the same data centre. Then the network can introduce less latency than disk access do.

mmf · on Dec 27, 2014

I have to agree with moru here. The latency on memory access will be negligible with respect to the latency of any io operations, even within the same data center. In my experience anything that involves the OS is >> 1us. Also beware of anything that declares 10x performance improvement.

deadgrey19 · on Dec 28, 2014

You are assuming that a high performance, latency tuned system is using the OS network stack. This would be a petty naive implementation. Offerings from Solarfalre (OpenOnload) and Exablaze (ExaSock) transparently offload and bypass the kernel stack. Quoted performance numbers are around 1us, of which 500ns is spent getting up and down the PCIE bus. Offloading to the nic makes a whole lot of sense in this space except that it has already been done, and with more compelling perfomance. The authors compleltly failed to take into account existing work.

moru0011 · on Dec 29, 2014

nope RTT even for these kind of network hardware is >=10 microseconds (I deal with such stuff professionally). Still a big gain going async :-)

deadgrey19 · on Dec 29, 2014

Surprising. >10us sounds pretty slow to me.

moru0011 · on Dec 30, 2014

round trip. one way 5 to 7 micros. some special cards go down to 3 one way, however with rtt, software usually adds some 2 micros overall

deadgrey19 · on Dec 30, 2014

Less than 1us, RTT to software (http://exablaze.com/exanic-x4)

Less than 400ns per switch hop (http://www.arista.com/en/products/7150-series) excluding congestion.