I've implemented counters exactly like this. I actually migrated away from the tls_counter to something like the cas_multi_counter.
The problem with the tls_counter is memory usage that scales with the number of threads. In this example you need 16 bytes per thread because only one counter is supported, I believe we ended up with 48 bytes per thread. So with 10,000 threads you can end up with >=160KB of data for every counter. And 10,000 threads isn't that far-fetched if most are sleeping (though not ideal). Then you end up with 10,000 counters and you're spending >= 1.6 GB of RAM on counters. Where the cas_multi_counter would only be using 40.96 MB. And read() scales with the number of threads, so can get pretty slow, but that is secondary.
One optimization you can use to make the cas_multi_counter more memory efficient is to share the backing memory. You have 4096 bytes, but are only using the first 8 bytes of each 64-byte stripe. You can multiplex 8 counters into the same 4096 byte storage. The first counter gets offset 0, the second offset 8, and so on. This only adds 1 indirection in the hot operator++() path, and doesn't add any extra false sharing as long as the 4096 byte buffer is 64-byte aligned. Construction and destruction is more expensive, but not terribly so, and that is probably cold anyway. Now you're at 512 bytes per counter, and the size is independent of the cache-line size. If you can count the number of active CPUs you can dynamically size your buffer to only have # CPUs cachelines, further saving memory.
Recently I've found that uncontended atomics are less expensive than what back-to-back tests would suggest, and so the real-world performance of TLS vs uncontended atomics might be close than the numbers in the article would imply (assuming you aren't actually in the back-to-back [1] scenario).
If you really want to gold plate things, you could imagine a hybrid strategy that initially uses a shared CAS-based counter for new threads, but moves some threads to a private TLS counter if they are using it heavily for some definition of heavily. Revocation, where you decide that a thread no longer deserves a dedicated counter, is more tricky!
---
[1] Roughly, "not back to back" means that your atomics are spaced out by more than 20 cycles or so.
I'm fairly sure that the spacing between the counters is so that writes on separate cores don't fight over the cache line coherency logic. I'm not sure how you conclude that putting multiple counters into the same line avoids false sharing. That looks like a text-book case of false sharing to me.
If memory is an issue, is your suggestion going to behave any better than simply collapsing each counter down to 8 bytes per thread? Perhaps a bit if there's a wide disparity between counter frequencies and the independent counters tend not to run simultaneously, otherwise you now have contention between mostly unrelated tasks, which are now interacting less subtly.
I believe parent means you can pack multiple unrelated counters intended for the same core. The cache line is still uncontended, you're just storing more in it.
If you're building these things up from scratch this should be obvious, but if you're pulling in a library it's easy to not realize how wasteful it is being.
> I'm fairly sure that the spacing between the counters is so that writes on separate cores don't fight over the cache line coherency logic. I'm not sure how you conclude that putting multiple counters into the same line avoids false sharing. That looks like a text-book case of false sharing to me.
Lets say you have 8 independent counters and 4 CPUs. Each gets 4 slots in 4-cacheline = 256 byte storage. Counter0 gets slots 0, 64, 128, and 192. Counter1 gets slots 8, 72, 136, and 200. And so on. Then you map CPU0 to slot 0, CPU1 to slot 1, CPU2 to slot 2, and CPU3 to slot 3.
Since we share the mapping between all counters. Any bump of any counter on CPU0 touches only cache line 0. So there isn't any false sharing.
In practice, mapping CPU to slot isn't quite so easy. But you can get a pretty good mapping using the strategy in this post, or something like folly::AccessSpreader::cachedCurrent() [0]. In new kernels, I believe there is kernel support for getting this mapping, using the rseq library. The kernel will update the current CPU in a thread local on every context switch.
my application was essentially a user-space forwarding (of gtpu packets) using dpdk. approx 80% of available cores (total of 32/64 not so sure) were dedicated to that task. now each forwarding thread would work independently of every other thread incrementing tx/rx counters etc. instead of global shared counter, we had per thread 16bit counters which were 'synced' if they were overflowing. which ended up reducing the overall contention by quite a large margin.
ofcourse the salient point here being that it was ok to be 'eventually correct' minor lag was always ok.
The linux kernel can do an interesting twist on this, because it controls when execution switches between cores. It only needs 2*num_cpus counters for kernel things (2 because one is for normal execution, second for interrupts).
If the difference in performance of CAS vs TLS is a concern, you're likely better off using a thread per cpu and using 10,000 fibers with carefully managed stack sizes which doubly reduces that memory concern.
The author correctly identifies the problems with sched_yield() [1], but then used it anyway. In my experience, using nanosleep() with an exponential backoff and a threshold performs well and results in the kind of behavior people want when they use sched_yield(). I think the difference is enough to affect the results, so I’m surprised the author didn’t experiment with it.
Yes, because the article isn't about the best approaches for concurrent counters, but rather the good, bad and downright ugly ways you can do concurrency, and the associated costs. The counter is used as an illustrative device across all the levels, so of course the really slow levels are going to involve something you don't actually want to do!
I'm not concerned with the counter itself, either. I'm concerned with the implementation of the synchronization at that level. I'm afraid the solution at that level is a strawman. There is a better solution for the "system call" level.
I also don't think it's useful to think of this hierarchy as "good, bad and ugly." Sometimes the problem you're solving will require being at one of these levels, and you can't perform the optimizations or re-architecture required to drop down a level. When that is the case, it's useful to know the best options at that level.
As the article shows, there are basically the reasonable levels, 0, 1, and 2 which are all valid approaches and depend on engineering tradeoffs. Levels 3, 4 and 5 are mistakes you want to avoid: so of course you aren't going to agree with the implementation choices in those levels, because they are mistakes! Those types of mistakes occur in practice along these lines (and in other ways), so I think it is useful to show them.
Note that "system call" level doesn't mean "the best level you can implement with a system call", although admittedly perhaps the naming was confusing. Rather, it means "an approach which implies that a system call is made for every interactive" - that is, it is almost always mistake, since can generally avoid this.
The levels are named after the dominant factor in their cost.
As I said, sometimes you need to be at those levels for reasons outside of your control. When you are at those levels, it's useful to know the best thing to do at that level. In particular, sched_yield() is a known anti-pattern when using system calls, and there are known better techniques. It's better enough that I think it could have a significant impact on the performance results.
Yes, I am the author so I'm especially interested in feedback, thus my desire to get to the bottom of this. If it's a misunderstanding I can fix by clarifying something in the article, I'm open to it.
The goal of this post is to define and give a rough measurement for various concurrency performance levels, and as a pedagogical technique I pick one example (cross-thread counter) and stretch it across all the levels. Specifically, the goal is not to create a "good" counter at all levels, but to pick an example which can plausibly hit all the levels, but that's going to require some bad designs at the upper levels.
No all levels represent good engineering: some are simply dominated. For example, I'd argue that 4 and 5 are dominated pretty much always. For a specific requirement, even more levels will be dominated. E.g., for this job, the simplest counter which uses an atomic add to increment the counter is a good baseline: any slower implementation than this is dominated. You'll never need to be at a higher level for this task.
So yes, the higher levels "do something dumb" by definition, because if you are making a system call to increment a counter every time, you are doing it wrong. That sched_yield is an anit-pattern [1] is besides the point because all of the higher levels are deep into anti-pattern territory. That's the only way I could see to make one example work across all the levels (and having a single example was important to me).
If I understand your concern, you would like (or believe the intent is) that the various levels represent a "best possible" implementation within the constraints of the level? I don't think it is possible, as above above it's always going to be deeply artificial once you start forcing an atomic counter to make a system call. For example, I could use a SysV semaphore as my concurrency mechanism: this also always makes a system call. It is not violating any "no sched_yield" rule, but I think it is equally artificial.
---
[1] Actually, I don't know that sched_yield is always an anti-pattern, but let's not even go there: for the purposes of this discussion my point stands even if we assume it is always an anti-pattern (you hint at some of the places where it might be useful with your nanosleep comment).
I feel like this discussion is slipping out from under me, as I thought we agreed that the counter itself is a contrived example to illustrate synchronization mechanisms. But your recent response seems predicated on the fact the problem you're solving is just incrementing a counter.
I looked at it as just some work that needs to be done. But in the general case, not all work can be reduced to a single, or even a few atomic instructions. And sometimes your synchronization is needed not because you have work to do, but because you are waiting for another thread to complete their work so you can proceed. In those cases, you sometimes need system calls and/or condition variables to correctly synchronize with another thread. And when you find yourself in such a situation, sched_yield() is an anti-pattern.
And your footnote confuses me: I am not proposing using nanosleep() and sched_yield(). I am proposing it instead of. What I linked to in my parent comment explains, basically, that sched_yield() does not mean what you think it means, so don't use it.
I agree the conversation seems to be going backwards and it seems like more work than expected to get to a common understanding (with no guarantee of success), so let's just drop it.
The problem with the tls_counter is memory usage that scales with the number of threads. In this example you need 16 bytes per thread because only one counter is supported, I believe we ended up with 48 bytes per thread. So with 10,000 threads you can end up with >=160KB of data for every counter. And 10,000 threads isn't that far-fetched if most are sleeping (though not ideal). Then you end up with 10,000 counters and you're spending >= 1.6 GB of RAM on counters. Where the cas_multi_counter would only be using 40.96 MB. And read() scales with the number of threads, so can get pretty slow, but that is secondary.
One optimization you can use to make the cas_multi_counter more memory efficient is to share the backing memory. You have 4096 bytes, but are only using the first 8 bytes of each 64-byte stripe. You can multiplex 8 counters into the same 4096 byte storage. The first counter gets offset 0, the second offset 8, and so on. This only adds 1 indirection in the hot operator++() path, and doesn't add any extra false sharing as long as the 4096 byte buffer is 64-byte aligned. Construction and destruction is more expensive, but not terribly so, and that is probably cold anyway. Now you're at 512 bytes per counter, and the size is independent of the cache-line size. If you can count the number of active CPUs you can dynamically size your buffer to only have # CPUs cachelines, further saving memory.