Breaking memory ordering will breaks software - if a program requires it (which ...

zamadatix · on Oct 2, 2021

It's not just a question of "is this memory accessed by multiple threads" and call it a day for full TSO support being mandated it's a question of "is the way this memory is accessed by multiple threads actually dependent on memory barriers for accuracy and if so how tight do those memory barriers need to be". For most apps the answer is actually "it doesn't matter at all". For the ones it does matter heuristics and loose barriers are usually good enough. Only in the worst case scenario that strict barriers are needed does the performance impact show up and even then it's still not the end of the world in terms of emulation performance.

As far as applying it the default assumption for apps is they don't need it and heuristics can try to catch ones that do. For well known apps that do need TSO it's part of the compatibility profile to increase the barriers to the level needed for reliable operation. For unknown apps that do need TSO you'll get a crash and a recommendation to try running in stricter emulation compatibility but this is exceedingly rare given the above 2 things have to fail first.

Details here https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

im3w1l · on Oct 2, 2021

> For unknown apps that do need TSO you'll get a crash

Sure about that? Couldn't it lead to silent data corruption?

reitzensteinm · on Oct 3, 2021

Yes, it absolutely can. Shameless but super relevant plug. I'm (slowly) writing a series of blog posts where I simulate the implications of memory models by fuzzing timing and ordering: https://www.reitzen.com/

I think the main reason why it hasn't been disastrous is that most programs rely on locks, and they're going to be translating that to the equivalent ARM instructions with a full memory barrier.

Not too many consumer apps are going to be doing lockless algorithms, but where they're used all bets are off. You can easily imagine a queue where two threads grab the same item from, for instance.

my123 · on Oct 2, 2021

Heuristics are used. For example, memory accesses relative to the stack pointer will be assumed to be thread-local, as the stack isn’t shared between threads. And that’s just one of the tricks in the toolbox. :-)

The result of those is that the expensive atomics aren’t applied to all accesses at all on hardware that doesn’t expose a TSO memory model.