It's not supposed to be a particularly valuable model, it's just to show how it could exist.
> the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.
That's not the intent.
The barrier pushes the lines in this new state to I, but under MESI they already would have been I.
There is no need to have any performance loss compared to MESI. If no lines are in this state then there is no extra work. And if it would be helpful, you can preemptively work on converting those lines to S in the background.
Edit: Oh wait you said release barrier. No, if you're doing the simplest plan based on MESI then there is no need for anything extra when releasing. Those considerations were only if you wanted to look at MOSI where it's already doing write broadcasts.
Sorry, I'm losing track of what you are proposing. I guess you mean MOESI plus an additional Stale state? Can you describe all the transitions and when they are performed?
It seems to be a state that can send stale data to loads, but will get invalidated if the CPU performs a barrier.
It doesn't work of course, obviously because there is no forward progress. CPU1 can order all previous stores, then later store some flag or lock variable, and that will never propagate to CPU2 spinning on that waiting for the value.
But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values. And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines. By the time you make it work, it's not a cache it's a store queue but worse because it has to do all these coherency operations and barriers have to walk it and flash or flush it, etc.
I can't say that spinning without a barrier is doing things wrong?
I guess if I need an emergency fix I can make those lines invalidate themselves every thousand cycles.
> But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values.
But only if there's no memory barriers, so it shouldn't be a big deal. And the valid-stale lines can't be used as the basis for new writes.
> And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines.
The only such lines are in this new state that's halfway between S and I. They don't need to be marked any more invalid. Zero bus traffic there.
Yes, saw that flaw, but I think in their model, barriers are always associated with stores (so atomic_thread_fence is not implementable but store_release is), which fixes your first example. I agree that in any case you end up doing more work that in the typical model to make other scenarios work.
> the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.
That's not the intent.
The barrier pushes the lines in this new state to I, but under MESI they already would have been I.
There is no need to have any performance loss compared to MESI. If no lines are in this state then there is no extra work. And if it would be helpful, you can preemptively work on converting those lines to S in the background.
Edit: Oh wait you said release barrier. No, if you're doing the simplest plan based on MESI then there is no need for anything extra when releasing. Those considerations were only if you wanted to look at MOSI where it's already doing write broadcasts.