NUMA isn't a feature, it's a design compromise. Ideally every line of memory takes the same amount of time to access. But in large complex designs, you can increase performance for some memory at the price of lower performance for other memory, aka Non Uniform Memory Access.
I believe these chips have Uniform Memory Access via a shared memory interface on the I/O die. Am I mistaken?
Edit, just confirmed:
>Thanks to this improved design, each chiplet can access the memory with equal latency. The multi-core beasts can support to 4TB of DDR4 memory per socket.
They exhibit NUMA because if chiplet0 wants a line of memory that is held by chiplet4, it has to go get it from chiplet4. So the degree of NUMA is improved from the previous generation, but it is still not UMA.
Generally caches aren't considered to be "memory" in this sense or otherwise every multi-core chip would be considered NUMA since they all have private caches. Instead you normally talk about an architecture being NUMA when cores have different access speeds to different parts of main memory, as when you need to get another socket to forward you information from a RAM bank. This is something that the OS generally has to be aware of in scheduling decisions, unlike caches which are automatically managed by hardware.
The chiplets have cache, which holds copies of memory. If a process has the line open in an exclusive state, e.g. locked, other chiplets cannot just get the line from memory, because it might be out of date. So they must go ask whoever holds the lock to flush & release.