That's because on (most) modern systems, main RAM is combined (shared memory spa...

That's because on (most) modern systems, main RAM is combined (shared memory space) and external from CPU, connected through fat but high-latency pipe.

A solution is to include RAM with each CPU core on-die. Afaik this is uncommon approach because semiconductor fabrication processes for RAM vs. CPU don't match well? But it's not impossible - IC's like this exist, and eg. SoC's with integrated RAM, CPU caches etc are a thing.

So imagine a 'compute module' consisting of 6502 class CPU core + a few (dozen?) KB of RAM directly attached + some peripheral I/O to talk to neighbouring CPU's.

Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Examples:

  GreenArrays GA144
  Cerebras Wafer Scale Engine

(not sure about nature of the 'CPU' cores on the latter. General purpose or minimalist, AI/neural network processing specialized?)

Afaik the 'problem' is more how to program such systems easily / securely, how to arrange common peripherals like USB, storage, video output etc. in a practical manner as to utilize the immense compute power + memory bandwidth in such a system.