> just make sure you `sched_yield` before each retry Assuming `sched_yield` does...

secondcoming · 2024-10-02T22:48:57 1727909337

`rep; nop;` is actually the `pause` instruction. On older CPUs it’s a standard nop, but on newer CPUs it’s a more efficient nop.

Spinning on the CMPXCHG is also a bad idea. You should spin on the read and only then attempt the CMPXCHG.

bobmcnamara · 2024-10-03T00:51:05 1727916665

Bingo. Spinning on CMPXCHG can cause livelock.

epcoa · 2024-10-03T06:54:35 1727938475

Just to clarify: the spinning on CMPXCHG is not also a bad idea, the YieldProcessor is correct (a pause), but inside the CMPXCHG loop it should be spinning on a pure unlocked load. Is that correct?

secondcoming · 2024-10-03T08:53:18 1727945598

No you should spin on a read. Once you see the value you want you then try the CMPXCHG. If that succeeds you exit. If it fails you go back to spinning on the read.

epcoa · 2024-10-03T14:24:19 1727965459

What is the difference between a read and a “load” here?

loeg · 2024-10-03T16:38:38 1727973518

Read and load mean the same thing. (I think GP just missed the end of your comment.)

You care about exchange vs read/load because of cache line ownership. Every time you try to do the exchange, the attempting CPU must take exclusive ownership of the cacheline (stealing it from the lock owner). To unlock, the lock owner must take it back.

If the attempting CPU instead only reads, the line ownership stays with the lock holder and unlock is cheaper. In general you want cache line ownership to change hands as few times as possible.

pizlonator · 2024-10-03T12:58:42 1727960322

Spinning with pause is slower than spinning with sched_yield according to every test I’ve ever done

epcoa · 2024-10-03T14:48:39 1727966919

For what it is worth it seems the library in question does both, uses an exponential retry loop, busy read looping 2^i times for the first 7 attempts before then yielding. It seems like there must be some threshold where latency is improved by retrying before yielding, but I don’t test these things for a living.

https://github.com/google/nsync/blob/c6205171f084c0d3ce3ff51...

Animats · 2024-10-03T17:39:10 1727977150

I've tried to figure out where that goes wrong. Read the bug report linked. I've caught this situation in gdb with 20 threads in spinlocks inside realloc, performance in a game down to 1 frame every 2 seconds, and very little getting done. But I don't understand how the three levels of locking involved cause that. Nor do the Wine people.

Works fine on Microsoft Windows. Only fails on Wine.

epcoa · 2024-10-03T19:01:00 1727982060

It seems like the loop around InterlockedCompareExchange is a bad idea since this is a bus lock in a tight loop. Rather the inner spinning loop that is yielding should just be reading the value surrounded by the cmpxchg. As for whether sched_yield should just be called in the inner loop or a short nop/pause loop should be attempted for microcontention reasons, the expert opinion here is don't bother with the nop loop. However while the nop loop might not be real world optimal I doubt that would be causing a catastrophic performance issue.

pizlonator · 2024-10-03T16:43:14 1727973794

My data says that always yielding is better than ever busy spinning, except on microbenchmarks, where depending on the microbenchmarks you can get any answer your heart desires.

jart · 2024-10-03T06:56:27 1727938587

Has WINE considered using mremap() for its realloc() implementation? It allows you to make realloc() of a sufficient size basically cost nothing. See this screenshot https://x.com/JustineTunney/status/1837663502619889875 On other platforms like MacOS you can achieve the same thing as mremap() by mapping to random addresses, and then using MAP_FIXED to append.

spacechild1 · 2024-10-03T04:34:50 1727930090

`YieldThread` is just named confusingly. It is not equivalent to `sched_yield` or `std::thread::yield()`, it's rather a macro to issue a pause instruction (which is indeed what you typically want in a spin loop). The actual Windows equivalent would be `SwitchToThread`.

cesarb · 2024-10-03T15:54:25 1727970865

> But the actual code for YieldProcessor is a NOP on x86:[2]

> __asm__ __volatile__( "rep; nop" : : : "memory" );

It might look like a NOP, but "REP NOP" is not a real NOP, it's the PAUSE instruction (which very old processors from the 1990s treated as a normal NOP).

exikyut · 2024-10-04T23:14:29 1728083669

I know NQP here isn't Not Quite Perl, but I'm not sure what it *is*. Seeking enlightenment!

pizlonator · 2024-10-02T22:44:00 1727909040

sched_yield isn’t a nop

xxs · 2024-10-03T06:54:26 1727938466

if YieldProcessor() actually did switch it'd be awfully expensive, like mentioned "rep; nop", is not just a "nop".

OTOH, the usual way to lock via a busy loop/spin, does require some attempts then backoff with a random duration.