> A system without cache coherency can still promise that updates won't be lost. There are lots of way to write rules around update propagation, and cache coherency is just one of them.
Not without coherency performed in software though, which is the point.
> Cache coherency doesn't protect you from stale data unless you only read one memory address ever.
It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.
> That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.
What property of cache coherency could you lose and still have it working?
> Not without coherency performed in software though, which is the point.
How would you do cache coherency in software? I don't follow.
> It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.
When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?
> What property of cache coherency could you lose and still have it working?
Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.
Normal memory barrier semantics could force certain accesses to be viewed in the same order, including important accesses that would normally be enforced by cache coherency. You don't need it to be enforced twice. The code will run fine.
> How would you do cache coherency in software? I don't follow.
Hardware caches which are not coherent require e.g., writeback and flushes to be coherent with other agents.
> When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?
It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.
> Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.
Not same order of access, same order of stores. If memory location x receives two stores, A and B, and CPU1 sees A, B and CPU2 sees B, A, now one thinks the location contains B and the other thinks it contains A. In the end, all CPUs can see different values at all memory locations. A normal barrier doesn't solve this, the stores are already done.
> It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.
Why is most recent according to the observer, and not the core(s) that actually did the writes?
The value in Y caused the value in X. Surely that's worth something in terms of ordering?
0: mov $0 $random_cell // zero initialize $random_Cell
1: mov $0 $random_cell_ready // zero initialize $random_cell_ready
3: rng r1 // generates a non-zero random number in r1, takes 300 hundreds cycles
4: mov r1 $random_cell // write generated value to memory location $random_cell
5: mov 1 $random_cell_ready // sets a flag to notify that the value has been written
Reader
0: test $random_cell_ready
1: jz 0 // spin-wait for cell ready
2: mov $random_cell r1
Consider an 1-wide OoO[0] machine with a maximally relaxed memory model. There are no caches, but magic memory with 1 clock cycle latency, so no coherence issues: at time t writers starts computing a random number (writer:3): rng is going to take a few hundreds cycles before writing the result to r1. Next clock cycle t+1 it can't execute writer:4 as r1 is not ready. Instead it executes writer:5 that has no dependencies. writer:3 is only executed at t+300.
At time t, the reader will read a zero form the flag, so at t+1 loops back. On t+2 it sees the flag set to 1. So t+3 the jump falls through and t+4 we read 0 form $random_cell. That's obviously wrong as the data is not there. It is certainly not reading stale data as that's literally the last value that was written to this memory location: a newer value hasn't even been computed yet.
To fix this you need a fence #StoreStore between writer:4 and writer:5 and a corresponding fence #LoadLoad between reader:1 and reader:2 to implement release and acquire semantics [1]. In particular the fence on the writer side will stall the pipeline [2] until previous stores in program order have committed.
As you can see there are no caches, only a single shared memory which is necessarily always coherent. There is no stale data. Yet we need fences for correctness.
You can now reintroduce caches, but MESI give the illusion to the rest of the CPU that it is actually talking with uncached, fully coherent memory, except that latency is now variable.
Ergo, generally, fences and caches are completely orthogonal concepts.
[0] BTW the example would be broken even on a machine with scoreboarding but no renaming (so no fully OoO).
[1] proper release and acquire require slightly stronger fences, but these are sufficient for this example
[2] a better option is to just prevent further stores to be dispatched.
Your example has a single store, there is no reordering. There is no value before 8, it doesn't require any barrier. If you make and additional store to the array, that's would be equivalent to my example.
I said there was a value before 8, I just didn't describe it well enough.
First off your example looks a little under-constrained (What if reader:0 happens before writer:0?), so let's assume all the writes of 0 at the start of every program happen before any other instructions.
Let there be a third thread, "builder". It writes 0 to every address in the array, then fills it with new values including an 8.
The "writer" thread loops repeatedly over the array until it finds an 8, then stores the address in X.
The "reader" thread waits for X to be nonzero then loads the value at [X].
In the toy coherent computer, reader will always load an 8.
But in a very weak memory model, one that doesn't enforce dependent loads, reader is allowed to get a 0.
The write to X and the write of 8 don't have to show up in any particular order.
But in that situation I would say that X is "more recent" than 8, because of causality. And if you asked me yesterday I would have called the 0 "stale".
Your example was hard to follow. What do you mean exactly. Write it in terms of memory operations performed and values observed by each CPU and address, with assertions that are or are not violated according to your example.
Not without coherency performed in software though, which is the point.
> Cache coherency doesn't protect you from stale data unless you only read one memory address ever.
It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.
> That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.
What property of cache coherency could you lose and still have it working?