Do you mean the drop they later used to analyze cache line coherency when adding cores? They improved this if I understood the latter results correctly.
Can anyone explain the use of 'round-robin' to describe mulit-node scenarios and 'fill-first' for single node scenarios. I initially assumed they were describing thread schedulers, but that doesn't make clear sense in these tests. Thanks in advance.
Most of these optimizations imply scaling by number of cores. Thus the more cores (and thus sockets/NUMA nodes), the more the benefit. Desktop-ish systems with ~4 cores don't see much gain, but nor did we introduce performance degradations.