The most glaring omission is that they only compared to fp16 models, not to quan...

The most glaring omission is that they only compared to fp16 models, not to quantized models. And of course the benchmarks might be misleading compared to the real experience.

But if you wanted to make LLM-specific hardware (or x64 instructions tuned for LLMs) this model architecture makes that extremely cheap. Multiplication requires a lot of transistors, this architecture requires only two-bit adders. You could make SIMD instructions that do thousands of these in parallel, for fairly little silicon cost.