Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPG...

hackernudes · 2024-09-27T16:05:17 1727453117

I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.

YetAnotherNick · 2024-09-27T16:42:00 1727455320

Google's TPU is an ASIC and performs competitively. Also Tesla and Meta is building something AFAIK.

Although I doubt you could get lot better as GPUs already have half the die area reserved for matrix multiplication.

danielmarkbruce · 2024-09-27T18:11:04 1727460664

It depends on your precise definition of ASIC. The FPGA thing here would be analogous to an MSIC where m = model.

It's clearly different to build a chip for a specific model than what a TPU is.

Maybe we'll start seeing MSICs soon.

YetAnotherNick · 2024-09-27T19:43:40 1727466220

LLMs and many other models spend 99% of the FLOPs in matrix multiplication. And TPU initially had just single operation i.e. multiply matrix. Even if the MSIC is 100x better than GPU in other operations, it would just be 1% faster overall.

danielmarkbruce · 2024-09-27T20:56:20 1727470580

You can still optimize various layers of memory for a specific model, make it all 8 bit or 4 bit or whatever you want, maybe burn in a specific activation function, all kinds of stuff.

No chance you'd only get 1% speedup on a chip designed for a specific model.

pzo · 2024-09-28T00:13:02 1727482382

Apple has Neural Engine and it really speeds up many CoreML models - if most operators are implemented in NPU inference will be significantly faster than on GPU on my Macbook m2 max (and they have similar NPU as those in e.g. iPhone 13). Those ASIC NPU just implements many typical low level operators used in most ML models.

imtringued · 2024-09-28T11:32:21 1727523141

99% of the time is spent on matrix matrix or matrix vector calculation. Activation functions, softmax, RoPE, etc basically cost nothing in comparison.

Most NPUs are programmable, because the bottleneck is data SRAM and memory bandwidth instead of instruction SRAM.

For classic matrix matrix multiplication, the SRAM bottleneck is the number of matrix outputs you can store in SRAM. N rows and M columns get you N X M accumulator outputs. The calculation of the dot product can be split into separate steps without losing the N X M scaling, so the SRAM consumed by the row and column vectors is insignificant in the limit.

For the MLP layers in the unbatched case, the bottleneck lies in the memory bandwidth needed to load the model parameters. The problem is therefore how fast your DDR, GDDR, HBM memory and your NoC/system bus lets you transfer data to the NPU.

Having a programmable processor that controls the matrix multiplication function unit costs you silicon area for the instruction SRAM. For matrix vector multiplication, the memory bottleneck is so big, it doesn't matter what architecture you are using, even CPUs are fast enough. There is no demand for getting rid of the not very costly instruction SRAM.

"but what about the area taken up by the processor itself?"

HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke

Wait..., you were serious? The area taken up by an in order VLIW/TTA processor is so insignificant I jammed it in-between the routing gap of two SRAM blocks. Sure, the matrix multiplication unit might take up some space, bit decoding instructions is such an insignificant cost that anyone opposing programmability must have completely different goals and priorities than LLMs or machine learning.

winwang · 2024-09-27T21:01:15 1727470875

As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.

evanjrowley · 2024-09-28T03:38:26 1727494706

You forgot to place "vertically-integrated unobtanium" after ASIC.

namibj · 2024-09-28T13:40:46 1727530846

Soooo.... TPUv4?

evanjrowley · 2024-09-28T16:25:52 1727540752

Yes, but the kinds that aren't on the market.

bee_rider · 2024-09-27T18:24:17 1727461457

LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.

So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.

p1esk · 2024-09-28T01:08:08 1727485688

would be… mostly office suite, with a little LLM inference sprinkled in.

No, it would be LLM inference with a little bit of an office suite sprinkled in.