Spinning with pause is slower than spinning with sched_yield according to every ...

epcoa · 2024-10-03T14:48:39 1727966919

For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.

https://github.com/google/nsync/blob/c6205171f084c0d3ce3ff51...

Animats · 2024-10-03T17:39:10 1727977150

I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done. But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.

Works fine on Microsoft Windows. Only fails on Wine.

epcoa · 2024-10-03T19:01:00 1727982060

It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.

pizlonator · 2024-10-03T16:43:14 1727973794

My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.