I'm fairly sure that the spacing between the counters is so that writes on separ...

reitzensteinm · on Sept 17, 2020

I believe parent means you can pack multiple unrelated counters intended for the same core. The cache line is still uncontended, you're just storing more in it.

If you're building these things up from scratch this should be obvious, but if you're pulling in a library it's easy to not realize how wasteful it is being.

terrelln · on Sept 17, 2020

Yeah that’s right. All the counters use the same indexing scheme, so all accesses to the same cache line should be on the same core.

signa11 · on Sept 17, 2020

> I'm fairly sure that the spacing between the counters is so that writes on separate cores don't fight over the cache line coherency logic. I'm not sure how you conclude that putting multiple counters into the same line avoids false sharing. That looks like a text-book case of false sharing to me.

yup that's exactly what it is.

terrelln · on Sept 17, 2020

Lets say you have 8 independent counters and 4 CPUs. Each gets 4 slots in 4-cacheline = 256 byte storage. Counter0 gets slots 0, 64, 128, and 192. Counter1 gets slots 8, 72, 136, and 200. And so on. Then you map CPU0 to slot 0, CPU1 to slot 1, CPU2 to slot 2, and CPU3 to slot 3.

Since we share the mapping between all counters. Any bump of any counter on CPU0 touches only cache line 0. So there isn't any false sharing.

In practice, mapping CPU to slot isn't quite so easy. But you can get a pretty good mapping using the strategy in this post, or something like folly::AccessSpreader::cachedCurrent() [0]. In new kernels, I believe there is kernel support for getting this mapping, using the rseq library. The kernel will update the current CPU in a thread local on every context switch.

[0] https://github.com/facebook/folly/blob/a590a0e559d0f1c7af442...