Their system works well on a RISC-V, which is a deeply pipelined but in-order architecture running at only 50MHz, with very fast memory relative to its clock speed. I'd like to see results compared to a more reasonable CPU implementation. For example they could have used a Xilinx Zynq, which includes hard Cortex A9 cores, which are out-of-order and superscalar. They also run at a much higher frequency relative to the memory speed.
I think this paper vastly underestimates memory constraints of higher performance systems.
From a quick skim, the Xilinx work synthesises the entire network stack (including TCP) in hardware, unlike the above study, which only supports UDP in the HW traffic manager.
The numbers in the Xilinx paper are more attractive than what this study found and the paper also includes power measurements (since Joule/request is one metric on which FPGAs and hardware do quite well compared to software).
That said, much of the latency gain likely comes from bypassing the OS kernel and its generalised network stack. There is plenty of existing work (unfortunately also not referenced in this paper) that does this and which achieves very low latency, albeit -- to be fair -- on x86 hardware. (Examples: Arrakis [1], IX [2] and MICA [3].)
At first, I was impressed they reduced latency by 10x:
Our initial evaluation with a realistic workload shows
a 10x improvement in latency for 40% of requests without
adding significant overhead to the remaining requests.
And then I re-read this claim, did a little math, and realized that they only reduced it by 36%.
"10x for 40% of requests" is a skeezy way of saying 36%.
Nothing revolutionary, but someone had to do it. (Of course, you would take something expensive in software and implement the logic in software. Of course you can get by implementing only the most common requests: GETs to a small subset of keys.)
I don't intend to criticize this paper in particular, but, generally, I don't see small performance improvements in such software to be very useful for society. Academia just becomes a research arm of corporations that might even be a net negative for society: eroding privacy rights (Facebook et al) or introducing volatility into stock markets (HFT could use this paper's insight just as fruitfully.)
Main source of latency will be network. The main problem are synchronous GET requests as then performance == latency. Better go async instead of reducing latency with hardware accel.
I have to agree with moru here. The latency on memory access will be negligible with respect to the latency of any io operations, even within the same data center. In my experience anything that involves the OS is >> 1us.
Also beware of anything that declares 10x performance improvement.
You are assuming that a high performance, latency tuned system is using the OS network stack. This would be a petty naive implementation. Offerings from Solarfalre (OpenOnload) and Exablaze (ExaSock) transparently offload and bypass the kernel stack. Quoted performance numbers are around 1us, of which 500ns is spent getting up and down the PCIE bus. Offloading to the nic makes a whole lot of sense in this space except that it has already been done, and with more compelling perfomance. The authors compleltly failed to take into account existing work.
I think this paper vastly underestimates memory constraints of higher performance systems.