The memory hierarchy can never completely flatten. Information takes physical space to store and the speed of light is finite. So the more information, the further things are apart, and the longer access times will take.
That being said this theoretical limit might not be the main bottleneck right now. Parallelism can also bypass this limit since not all information needs to be processed at the same location.
> The memory hierarchy can never completely flatten. Information takes physical space to store and the speed of light is finite. So the more information, the further things are apart, and the longer access times will take.
Not necessarily. Memristors were shown to be capable of performing logical operations [1]. Given that, one can imagine computations migrating through the circuit to stay physically as close as possible to the data thereby eliminating the need for caches.
Since we can use the same sort of MOSFETs found in SRAM to perform logical operations you'd could equally well imagine computations migrating out into memory even without memristors. The problem is more one of computational model than physical process. People are certainly looking at things part-way there like grid-computing or whatever, but learning how to use these things efficiently is a hard computer science problem.
Micron built DIMM modules with computation on them as an experiment in this. They of course simply put a CPU inside the DIMM controller (imagine each of your DIMMs was a GUMSTIX rather than a regular DIMM) but the concept they were shooting for was the same. I did some preliminary VHDL work on a 'memory' controller which was basically a DIMM/CPU bus for a psuedo cluster of these things.
The bottom line was that it was buildable but there were a lot of challenges with the way compilers and what not generated code (or not) for these sorts of systems. I think a couple of projects at Umich and UCB got funded but I don't know if they produced any results.
I guess it's similar in idea to something like the Cell[1] processor with 'processing elements'/cores sitting on a blob of local memory/cache, with very NUMA-like access to everything else. Scaling the ram per unit / # units is probably where the complicated lies. What sort of cluster sizes were you thinking of?
I dunno what the current capabilities are of compilers for these cell-like architectures, but I recall that there were (still?) serious problems getting good performance out of the PS3 due to the difficulty of it's programming model. I'd hope they've advanced the state of the art some though.
My (entirely naive) brain screams MPI, but I don't know how well that would work in practice.
That being said this theoretical limit might not be the main bottleneck right now. Parallelism can also bypass this limit since not all information needs to be processed at the same location.