Main source of latency will be network. The main problem are synchronous GET requests as then performance == latency. Better go async instead of reducing latency with hardware accel.
I have to agree with moru here. The latency on memory access will be negligible with respect to the latency of any io operations, even within the same data center. In my experience anything that involves the OS is >> 1us.
Also beware of anything that declares 10x performance improvement.
You are assuming that a high performance, latency tuned system is using the OS network stack. This would be a petty naive implementation. Offerings from Solarfalre (OpenOnload) and Exablaze (ExaSock) transparently offload and bypass the kernel stack. Quoted performance numbers are around 1us, of which 500ns is spent getting up and down the PCIE bus. Offloading to the nic makes a whole lot of sense in this space except that it has already been done, and with more compelling perfomance. The authors compleltly failed to take into account existing work.