> It is significantly faster to access memory directly via CXL than to use message passing with JSON or gRPC.
I'm not talking about supercomputing. We're in a thread talking about the feasibility of running larger models on "smaller GPUs".
Do you think that GPUs access host memory using JSON or gRPC?
CXL is nice because it allows coherent, low-latency access to memories. This is not going to move the needle a whole lot on running large models on GPUs, because they are more bandwidth sensitive than latency sensitive, and coherence is not a huge concern.
I also think that RAM interfaces are going to scale much quicker than IO interfaces, so it's at best a temporary win over local GPU memory for inference.