What about power consumption? edit: My understanding from about a year ago is that AMD and NVDA's chips were priced similarly in terms of performance per watt.
Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?
Let's say the model has 50B parameters and 50 layers. That would mean about one billion values have to travel through the wifi for every generated token?
I wonder how much data that is in bytes and how long it takes to transfer them.
So…