Just tested in practice the N instances approach with Redis: what I get in similar hardware is 100k SET/GET operations per core, so using the same box dormando used for the test we should get 400k operations per second.
Note that you'll see this numbers for both SETs and GETs, as with the single process approach of Redis there is no contention. Instead with memcached you see different numbers for SETs and GETs, I guess this is due to some kind of locking happening. It's a tradeoff, with threads you have more power against a single "object", but at the cost of more complexity and less linear scalability. Anyway it's still possible to run a memcached process per core, I would like to see this results as well.
Currently I don't have access to a box with 8 different cores like the one used in the test so I can't fully replicate the test, so I hope dormando will update the results adding the test that uses four Redis instances.
I don't get it, Redis isn't optimized to be a key/val store like Memcached. Redis does a hell of a lot that Memcached doesn't do, so if we added those operations to the chart, Memcached would be flatlined. If I need a fast, key/val cache, Memcached is the obvious choice. If I need something different, like a fast set of in-memory, disk-backed queus, or hash tables, or sets, I'll use Redis. It's like saying, look, this hammer puts nails in the wall faster than banging them in with this multitool.
I'm not sure why memcached can't saturate all the threads with the async benchmark, but if you want to maximize everything in your test involving multiple processes you should also run four Redis nodes at the same time, and run every redis-benchmark against a different instance.
We tried, and this way you'll get very high numbers for Redis, but this is still wrong as it starts to be very dependent on what core is running the benchmark, and if it is the same as the server. So a better one is to have two boxes, linked with gigabit ethernet, and run the N clients in one box and the N threads of the server (being it a single memcache process and N threads or N Redis processes) on the other box.
Running four instances of redis is not the same as running one instance of memcached with four threads.
Four times the number of processes makes for 1/4 the effectiveness of multiget or multiset on average since your keys are now spread across several process boundaries. In good clients, that's the difference between a single packet to a single destination and four packets to four different destinations plus result processing.
Scaling horizontally is great, but scaling vertically is also useful in practice. It's giving the process more memory to handle more requests on fewer connections.
A bit more than a thread per core is faster. This is demonstrated by dormando's run of your test as well as a comparison of the throughput of your test vs. my mc-hammer tool (http://github.com/dustin/mc-hammer) I use for memcached. It can push well over twice as many ops on my dual core laptop as it can with a single core (which is in the neighborhood of a single instance of your test app).
It wouldn't take a lot of work to get your benchmark tool natively supporting multiple threads. Kill off the global config and split out one per thread and then allow the connection to find its config. Atomics for recording stats and there should be quite near no cross-thread communication (this is what I did for mc-hammer).
That'd be useful for your own test cases as well as apples-to-apples comparison with memcached.
Hello, your reasoning is what resulted in memcached to use a multi thread approach I guess. I don't agree with this design choice for a few reasons: in Redis we saw that multi gets are everything but a very used primitive. You can still group things that are often asked very often together, so that they'll be in the same instance (we have an explicit concept about this, called hash tags). We have hashes for objects that are composed of many fields, and it's a single key, so every kind of hashing algorithm will still lead to the object being stored together.
In contrast see the benefits of a single threaded approach: zero contention, you can manipulate complex data structures without locking, more stability (less bugs) for the same amount of features.
In my opinion it's a no brainer... but it's also a matter of taste / vision / priorities.
About benchmarking with a multi threaded tool, this is something I'm going to investigate today because something is strange here.
Let's assume that a multiplexing benchmark is not enough, why it was still able to saturate Redis better than memcached? I don't have a good answer to this question but I want to understand what's happening.
Also, running multi-thread benchmarks against multi threads servers is going to bring results that are much different than what you'll get in the real world where there is a networking link between: when both the benchmark and the server thread are in the same core I think what happens is that the I/O system calls start to be much faster.
It's a shame I don't have the hardware to perform the two boxes test as this could be really cool for Redis and memcached, a much better indicator of what the limits are currently with this two systems and in general with TCP servers.
Note that you'll see this numbers for both SETs and GETs, as with the single process approach of Redis there is no contention. Instead with memcached you see different numbers for SETs and GETs, I guess this is due to some kind of locking happening. It's a tradeoff, with threads you have more power against a single "object", but at the cost of more complexity and less linear scalability. Anyway it's still possible to run a memcached process per core, I would like to see this results as well.
Currently I don't have access to a box with 8 different cores like the one used in the test so I can't fully replicate the test, so I hope dormando will update the results adding the test that uses four Redis instances.