NUMA is used, when it is actually needed, which is not that often. Most other stuff gets by without requiring NUMA enlightenment (if you think competent parallel software developers are hard to come by... finding competent NUMA people will make you reconsider that assessment).
And NUMA-like architectures on a single CPU die have been increasingly common between Intel's multiple ring busses on larger Xeons and AMD's 4-core clusters on Ryzen. Even per-core L2 caches violate the assumption that a given memory address is equally accessible from any processor core. You can't pretend that memory is all equidistant from the processor cores unless you want everything running equally slow.