Moving to e.g. shared scratchpad memory would be a major paradigm shift, you'd h...

dragontamer · on June 4, 2019

> Moving to e.g. shared scratchpad memory would be a major paradigm shift

Shared scratchpad memory is (slowly) happening in the GPGPU world, at least in very limited circumstances.

But yeah, I think that's why GPGPU programmers are managing to get better scaling than classical CPUs, because GPGPUs are more willing to toy with the memory model, and GPGPU programmers are willing to use those highly-specialized communication features.

dooglius · on June 4, 2019

I don't know much about GPGPU-land, but all of the difficulties I foresee have to do with multiprocessing/sharing/context switching... my guess is that only one logical program can use the scratchpad at a time?

dragontamer · on June 5, 2019

> I don't know much about GPGPU-land, but all of the difficulties I foresee have to do with multiprocessing/sharing/context switching... my guess is that only one logical program can use the scratchpad at a time?

Well.. that one logical program can have 1024 SIMD-threads (maybe 16x (AMD) or 32x (NVidia) "actual" threads... where an "actual thread" is a "ticking program counter" by my definition), but yeah, its "one logical program" from the perspective of the GPU.

Even more specific than "one logical program", but "one threadgroup". So if you have a program that spins up 4096 SIMD-threads, only 1024 of them at a time can actually share a particular Shared Memory. (The GPU will allocate say, 10kB to the different "thread groups", but Threadgroup 0-1023 can touch its own 10kB block, while Threadgroup 1024-2047 can only touch another 10kB block)

GPU Shared Memory can be "split" to different programs. So one program can reserve 10kB, while a 2nd program can reserve 20kB, and then both can run on one GPU-unit. But the two programs are unable to touch each other's shared memory.

stcredzero · on June 4, 2019

all of the difficulties I foresee have to do with multiprocessing/sharing/context switching...

I'd like to be able to park a high performance actor on a particular CPU, and have it consuming its "inbox" queue, no context switches necessary.

my guess is that only one logical program can use the scratchpad at a time?

How about registers or a scratchpad which has async pub/sub semantics? One CPU can write, and the rest can read? A mechanism like that would use the same hardware which already supports cache coherency, but would let programmers forget about a lot of potential headaches.

stcredzero · on June 4, 2019

How did mainframe I/O channels work?

michaelwilson · on June 4, 2019

As I recalled, you wrote channel instructions, or chains of channel instructions, as "CCW"s (Channel Command Words), which you dispatched to the channels via a "SIO" or "SIOF" instruction (Start I/O, or Start I/O Fast).

Once the operation started, the SIO(F) returned to you and the channel operated asynchronously from the CPU, moving the data directly in or out of memory. I think SIOF returned as soon as the channel received the instruction, whereas SIO waited for the instruction actually began.

So the channel actually added as sort of co-processor with the same access to memory as other CPUS. I'm pretty sure the channel could use virtual addresses, but the memory of course had to be fixed (pagefixed) during the operation.

stcredzero · on June 4, 2019

How about another set of registers, which can be written by one CPU, and read by the others?

Could channels be implemented as co-processors with access to memory, going through the cache? I know there are instructions to fix cache lines.

I suppose some of this could be accomplished by compilers on current hardware, if they had information about CPU cores being targeted.

MaxBarraclough · on June 6, 2019

> How about another set of registers, which can be written by one CPU, and read by the others?

The L3 cache is doing something similar. Whatever you come up with has to cope with each core implementing out-of-order execution, so it can't be both trivially simple and ultra-fast.

The L2 and L1 caches are more local to the particular core, and so are faster. More generally, it seems to me unlikely that it would ever make sense to trade off against per-core performance.

> I suppose some of this could be accomplished by compilers on current hardware, if they had information about CPU cores being targeted.

Are we talking about a new software abstraction, or a performance-enhancement on existing hardware? The latter seems unlikely to me - the parallelism folks would've thought of it.