I'm fairly sure that the spacing between the counters is so that writes on separate cores don't fight over the cache line coherency logic. I'm not sure how you conclude that putting multiple counters into the same line avoids false sharing. That looks like a text-book case of false sharing to me.
If memory is an issue, is your suggestion going to behave any better than simply collapsing each counter down to 8 bytes per thread? Perhaps a bit if there's a wide disparity between counter frequencies and the independent counters tend not to run simultaneously, otherwise you now have contention between mostly unrelated tasks, which are now interacting less subtly.
I believe parent means you can pack multiple unrelated counters intended for the same core. The cache line is still uncontended, you're just storing more in it.
If you're building these things up from scratch this should be obvious, but if you're pulling in a library it's easy to not realize how wasteful it is being.
> I'm fairly sure that the spacing between the counters is so that writes on separate cores don't fight over the cache line coherency logic. I'm not sure how you conclude that putting multiple counters into the same line avoids false sharing. That looks like a text-book case of false sharing to me.
Lets say you have 8 independent counters and 4 CPUs. Each gets 4 slots in 4-cacheline = 256 byte storage. Counter0 gets slots 0, 64, 128, and 192. Counter1 gets slots 8, 72, 136, and 200. And so on. Then you map CPU0 to slot 0, CPU1 to slot 1, CPU2 to slot 2, and CPU3 to slot 3.
Since we share the mapping between all counters. Any bump of any counter on CPU0 touches only cache line 0. So there isn't any false sharing.
In practice, mapping CPU to slot isn't quite so easy. But you can get a pretty good mapping using the strategy in this post, or something like folly::AccessSpreader::cachedCurrent() [0]. In new kernels, I believe there is kernel support for getting this mapping, using the rseq library. The kernel will update the current CPU in a thread local on every context switch.
If memory is an issue, is your suggestion going to behave any better than simply collapsing each counter down to 8 bytes per thread? Perhaps a bit if there's a wide disparity between counter frequencies and the independent counters tend not to run simultaneously, otherwise you now have contention between mostly unrelated tasks, which are now interacting less subtly.