> CPU’s not running the main thread don’t care about the execution order of the instructions of the main thread.
On x86, they do (the x86 family is unusual in having strong memory ordering), but that's not the issue here.
> They only see their local caches of the same memory location got changed from 0 to 5, 5 to 0, and back to 5 repeatedly.
Their local caches of that memory see only a 5, since at the moment they read that cache line, the value in memory is 5; the operating system ensures that the write of the 5 value by the main thread is flushed to memory[*] before the main thread starts the child thread, and also that the cache of the child thread does not have stale data from before that moment. That memory location is only set back to 0 after all the child threads have exited, so there's no instant where the child thread could read a 0 on that location from main memory into its cache.
> When a new thread lands on a CPU with the old 0 cache value, it will hang.
When a new thread lands on a CPU core with an old 0 cache value for that memory location (which could happen if that CPU core had been running the main thread, and the main thread was migrated to another CPU core before it could set it back to 5), it will still see a 5 at that memory location, because the operating system invalidates the cache of a CPU core when necessary before starting a new thread on it.
[*] Actually, it only has to be flushed as far as the last level cache, or the "point of unification" in ARM terminology; I simplified a lot in this explanation.
On x86, they do (the x86 family is unusual in having strong memory ordering), but that's not the issue here.
> They only see their local caches of the same memory location got changed from 0 to 5, 5 to 0, and back to 5 repeatedly.
Their local caches of that memory see only a 5, since at the moment they read that cache line, the value in memory is 5; the operating system ensures that the write of the 5 value by the main thread is flushed to memory[*] before the main thread starts the child thread, and also that the cache of the child thread does not have stale data from before that moment. That memory location is only set back to 0 after all the child threads have exited, so there's no instant where the child thread could read a 0 on that location from main memory into its cache.
> When a new thread lands on a CPU with the old 0 cache value, it will hang.
When a new thread lands on a CPU core with an old 0 cache value for that memory location (which could happen if that CPU core had been running the main thread, and the main thread was migrated to another CPU core before it could set it back to 5), it will still see a 5 at that memory location, because the operating system invalidates the cache of a CPU core when necessary before starting a new thread on it.
[*] Actually, it only has to be flushed as far as the last level cache, or the "point of unification" in ARM terminology; I simplified a lot in this explanation.