> A system without cache coherency can still promise that updates won't be lost....

Dylan16807 · on June 15, 2023

> Not without coherency performed in software though, which is the point.

How would you do cache coherency in software? I don't follow.

> It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.

When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?

> What property of cache coherency could you lose and still have it working?

Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.

Normal memory barrier semantics could force certain accesses to be viewed in the same order, including important accesses that would normally be enforced by cache coherency. You don't need it to be enforced twice. The code will run fine.

throwawaylinux · on June 16, 2023

> How would you do cache coherency in software? I don't follow.

Hardware caches which are not coherent require e.g., writeback and flushes to be coherent with other agents.

> When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?

It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.

> Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.

Not same order of access, same order of stores. If memory location x receives two stores, A and B, and CPU1 sees A, B and CPU2 sees B, A, now one thinks the location contains B and the other thinks it contains A. In the end, all CPUs can see different values at all memory locations. A normal barrier doesn't solve this, the stores are already done.

Dylan16807 · on June 16, 2023

> It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.

Why is most recent according to the observer, and not the core(s) that actually did the writes?

The value in Y caused the value in X. Surely that's worth something in terms of ordering?

gpderetta · on June 16, 2023

What if the write hasn't happened at all?

Writer:

    0:    mov $0 $random_cell // zero initialize $random_Cell
    1:    mov $0 $random_cell_ready // zero initialize $random_cell_ready
    3:    rng r1 // generates a non-zero random number in r1, takes 300 hundreds cycles
    4:    mov r1 $random_cell // write generated value to memory location $random_cell
    5:    mov 1 $random_cell_ready // sets a flag to notify that the value has been written

Reader

   0:    test $random_cell_ready
   1:    jz 0  // spin-wait for cell ready
   2:    mov $random_cell r1

Consider an 1-wide OoO[0] machine with a maximally relaxed memory model. There are no caches, but magic memory with 1 clock cycle latency, so no coherence issues: at time t writers starts computing a random number (writer:3): rng is going to take a few hundreds cycles before writing the result to r1. Next clock cycle t+1 it can't execute writer:4 as r1 is not ready. Instead it executes writer:5 that has no dependencies. writer:3 is only executed at t+300.

At time t, the reader will read a zero form the flag, so at t+1 loops back. On t+2 it sees the flag set to 1. So t+3 the jump falls through and t+4 we read 0 form $random_cell. That's obviously wrong as the data is not there. It is certainly not reading stale data as that's literally the last value that was written to this memory location: a newer value hasn't even been computed yet.

To fix this you need a fence #StoreStore between writer:4 and writer:5 and a corresponding fence #LoadLoad between reader:1 and reader:2 to implement release and acquire semantics [1]. In particular the fence on the writer side will stall the pipeline [2] until previous stores in program order have committed.

As you can see there are no caches, only a single shared memory which is necessarily always coherent. There is no stale data. Yet we need fences for correctness.

You can now reintroduce caches, but MESI give the illusion to the rest of the CPU that it is actually talking with uncached, fully coherent memory, except that latency is now variable.

Ergo, generally, fences and caches are completely orthogonal concepts.

[0] BTW the example would be broken even on a machine with scoreboarding but no renaming (so no fully OoO).

[1] proper release and acquire require slightly stronger fences, but these are sufficient for this example

[2] a better option is to just prevent further stores to be dispatched.

edit: some light editing

Dylan16807 · on June 16, 2023

The writing thread I had in mind was not being reordered.

Consider instead a loop that walks through an array looking for a value of 8, then writes the address of that value to memory location X.

Another thread can read X, then read the location, and see a number that is not 8 but used to be there before the 8.

That's allowed by the memory model, but is it wrong to say "stale"?

Is it wrong to say that the 8 was set before X was set?

gpderetta · on June 16, 2023

Your example has a single store, there is no reordering. There is no value before 8, it doesn't require any barrier. If you make and additional store to the array, that's would be equivalent to my example.

Dylan16807 · on June 16, 2023

I said there was a value before 8, I just didn't describe it well enough.

First off your example looks a little under-constrained (What if reader:0 happens before writer:0?), so let's assume all the writes of 0 at the start of every program happen before any other instructions.

Let there be a third thread, "builder". It writes 0 to every address in the array, then fills it with new values including an 8.

The "writer" thread loops repeatedly over the array until it finds an 8, then stores the address in X.

The "reader" thread waits for X to be nonzero then loads the value at [X].

In the toy coherent computer, reader will always load an 8.

But in a very weak memory model, one that doesn't enforce dependent loads, reader is allowed to get a 0.

The write to X and the write of 8 don't have to show up in any particular order.

But in that situation I would say that X is "more recent" than 8, because of causality. And if you asked me yesterday I would have called the 0 "stale".

throwawaylinux · on June 16, 2023

Your example was hard to follow. What do you mean exactly. Write it in terms of memory operations performed and values observed by each CPU and address, with assertions that are or are not violated according to your example.