> It isn't, because it's relative. "Most recent" is according to the observer, a...

gpderetta · on June 16, 2023

What if the write hasn't happened at all?

Writer:

    0:    mov $0 $random_cell // zero initialize $random_Cell
    1:    mov $0 $random_cell_ready // zero initialize $random_cell_ready
    3:    rng r1 // generates a non-zero random number in r1, takes 300 hundreds cycles
    4:    mov r1 $random_cell // write generated value to memory location $random_cell
    5:    mov 1 $random_cell_ready // sets a flag to notify that the value has been written

Reader

   0:    test $random_cell_ready
   1:    jz 0  // spin-wait for cell ready
   2:    mov $random_cell r1

Consider an 1-wide OoO[0] machine with a maximally relaxed memory model. There are no caches, but magic memory with 1 clock cycle latency, so no coherence issues: at time t writers starts computing a random number (writer:3): rng is going to take a few hundreds cycles before writing the result to r1. Next clock cycle t+1 it can't execute writer:4 as r1 is not ready. Instead it executes writer:5 that has no dependencies. writer:3 is only executed at t+300.

At time t, the reader will read a zero form the flag, so at t+1 loops back. On t+2 it sees the flag set to 1. So t+3 the jump falls through and t+4 we read 0 form $random_cell. That's obviously wrong as the data is not there. It is certainly not reading stale data as that's literally the last value that was written to this memory location: a newer value hasn't even been computed yet.

To fix this you need a fence #StoreStore between writer:4 and writer:5 and a corresponding fence #LoadLoad between reader:1 and reader:2 to implement release and acquire semantics [1]. In particular the fence on the writer side will stall the pipeline [2] until previous stores in program order have committed.

As you can see there are no caches, only a single shared memory which is necessarily always coherent. There is no stale data. Yet we need fences for correctness.

You can now reintroduce caches, but MESI give the illusion to the rest of the CPU that it is actually talking with uncached, fully coherent memory, except that latency is now variable.

Ergo, generally, fences and caches are completely orthogonal concepts.

[0] BTW the example would be broken even on a machine with scoreboarding but no renaming (so no fully OoO).

[1] proper release and acquire require slightly stronger fences, but these are sufficient for this example

[2] a better option is to just prevent further stores to be dispatched.

edit: some light editing

Dylan16807 · on June 16, 2023

The writing thread I had in mind was not being reordered.

Consider instead a loop that walks through an array looking for a value of 8, then writes the address of that value to memory location X.

Another thread can read X, then read the location, and see a number that is not 8 but used to be there before the 8.

That's allowed by the memory model, but is it wrong to say "stale"?

Is it wrong to say that the 8 was set before X was set?

gpderetta · on June 16, 2023

Your example has a single store, there is no reordering. There is no value before 8, it doesn't require any barrier. If you make and additional store to the array, that's would be equivalent to my example.

Dylan16807 · on June 16, 2023

I said there was a value before 8, I just didn't describe it well enough.

First off your example looks a little under-constrained (What if reader:0 happens before writer:0?), so let's assume all the writes of 0 at the start of every program happen before any other instructions.

Let there be a third thread, "builder". It writes 0 to every address in the array, then fills it with new values including an 8.

The "writer" thread loops repeatedly over the array until it finds an 8, then stores the address in X.

The "reader" thread waits for X to be nonzero then loads the value at [X].

In the toy coherent computer, reader will always load an 8.

But in a very weak memory model, one that doesn't enforce dependent loads, reader is allowed to get a 0.

The write to X and the write of 8 don't have to show up in any particular order.

But in that situation I would say that X is "more recent" than 8, because of causality. And if you asked me yesterday I would have called the 0 "stale".

throwawaylinux · on June 16, 2023

Your example was hard to follow. What do you mean exactly. Write it in terms of memory operations performed and values observed by each CPU and address, with assertions that are or are not violated according to your example.