> just make sure you `sched_yield` before each retry
Assuming `sched_yield` does something.
There's a futex congestion problem inside Wine's memory allocator. There are several levels of locks. If you're growing a buffer, in the sense of C's "realloc", and no buffer is available, memory allocation is locked during the allocation of a bigger buffer, copying of the contents, and release of the old buffer. "Push" type operations can force this. Two orders of magnitude performance drops ensue when multi-threaded programs are contending for that lock.[1]
Inside one of the lock loops is a call to "YieldProcessor".
static void spin_lock( LONG *lock )
{
while (InterlockedCompareExchange( lock, -1, 0 ))
YieldProcessor();
}
But the actual code for YieldProcessor is a NOP on x86:[2]
Wine devs are aware of this, but the mess is bad enough that no one has tackled it.
This is down in the core of what "malloc" calls, so changes there could have unexpected effects on many programs.
Needs attention from someone really into mutexes.
Just to clarify: the spinning on CMPXCHG is not also a bad idea, the YieldProcessor is correct (a pause), but inside the CMPXCHG loop it should be spinning on a pure unlocked load. Is that correct?
No you should spin on a read. Once you see the value you want you then try the CMPXCHG. If that succeeds you exit. If it fails you go back to spinning on the read.
Read and load mean the same thing. (I think GP just missed the end of your comment.)
You care about exchange vs read/load because of cache line ownership. Every time you try to do the exchange, the attempting CPU must take exclusive ownership of the cacheline (stealing it from the lock owner). To unlock, the lock owner must take it back.
If the attempting CPU instead only reads, the line ownership stays with the lock holder and unlock is cheaper. In general you want cache line ownership to change hands as few times as possible.
For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.
I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done.
But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.
Works fine on Microsoft Windows. Only fails on Wine.
It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.
My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.
Has WINE considered using mremap() for its realloc() implementation? It allows you to make realloc() of a sufficient size basically cost nothing. See this screenshot https://x.com/JustineTunney/status/1837663502619889875 On other platforms like MacOS you can achieve the same thing as mremap() by mapping to random addresses, and then using MAP_FIXED to append.
`YieldThread` is just named confusingly. It is not equivalent to `sched_yield` or `std::thread::yield()`, it's rather a macro to issue a pause instruction (which is indeed what you typically want in a spin loop). The actual Windows equivalent would be `SwitchToThread`.
It might look like a NOP, but "REP NOP" is not a real NOP, it's the PAUSE instruction (which very old processors from the 1990s treated as a normal NOP).
Assuming `sched_yield` does something.
There's a futex congestion problem inside Wine's memory allocator. There are several levels of locks. If you're growing a buffer, in the sense of C's "realloc", and no buffer is available, memory allocation is locked during the allocation of a bigger buffer, copying of the contents, and release of the old buffer. "Push" type operations can force this. Two orders of magnitude performance drops ensue when multi-threaded programs are contending for that lock.[1]
Inside one of the lock loops is a call to "YieldProcessor".
But the actual code for YieldProcessor is a NOP on x86:[2] }Wine devs are aware of this, but the mess is bad enough that no one has tackled it. This is down in the core of what "malloc" calls, so changes there could have unexpected effects on many programs. Needs attention from someone really into mutexes.
[1] https://forum.winehq.org/viewtopic.php?t=37688
[2] https://gitlab.winehq.org/wine/wine/-/blob/HEAD/include/winn...