How did author just assumed that CPU are competetive for inference. Maybe yes if...

moconnor · on Sept 11, 2023

For local inference there often isn’t a batch. If I chat with my own llama instance the batch size is one. The model processes a single token at a time doing a lot of vector-matrix multiplication, which is bandwidth bound. CPUs like the M1/2 are very competitive here.

Also, for local inference you only need to be fast enough for many applications. No need to do real time object detection at 1000 FPS or chat at 300 tokens/s (code gen changes this).

fomine3 · on Sept 12, 2023

I understand that many people on HN prefer open-ish LLM on local hardware, but I think it doesn't make sense sadly for efficient hardware usage perspective. Transferring input/output text is almost free cost and local hardware can't be fully utilized by a few people. SaaS make sense here, though I understand that privacy and censorship are matter.

gsuuon · on Sept 11, 2023

For straightforward chat batching wouldn't be very useful, but it can still be very useful for building apps on top of local LLM's which I'm hoping we'll see more and more of.

liuliu · on Sept 11, 2023

Speculative decoding have higher batch size.

brucethemoose2 · on Sept 11, 2023

> How did author just assumed that CPU are competetive for inference.

CPUs have IGPs. And they are pretty good these days.

LLMs in particular are an odd duck because the compute requirements are relatively modest compared to the massive model size, making them relatively RAM bandwidth bound. Hence DDR5 IGPs/CPUs are actualy a decent fit for local inference.

Its still inefficient, yeah. Dedicated AI blocks are the way to go, and many laptop/phone CPUs already have these, they just aren't widely exploited yet.

hospitalJail · on Sept 11, 2023

>How did author just assumed that CPU are competetive for inference.

Apple marketing has been fierce. You see it in text on the internet, but in reality, its just Nvidia everywhere (unfortunately).

If AI wasnt so transformative, I'd say we have a sad few years ahead, but the tools have been so useful, I'll just bend the knee to Nvidia.

choppaface · on Sept 11, 2023

There are a variety of tricks for making CPU inference competitive and start-ups who have made a business out of said software e.g. NeuralMagic.

But yes the author does not give a substantive position despite his expertise in the area (he’s worked on e.g. the usb TPU product Google used to sell).

imhoguy · on Sept 11, 2023

Well, if CPUs are equipped with extra instructions like Intel AMX then why not? https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions

brucethemoose2 · on Sept 11, 2023

Only newer Intel server CPUs have this, and even then its a really odd instruction to "activate" and use.

Even without AMX, llama.cpp is already fairly bandwidth bound for short responses, and the cost/response on Sapphire Rapids is not great. I bet performance is much better on Xeon Max (Sapphire Rapids with HBM), but those SKUs are very expensive and rare.

HDThoreaun · on Sept 11, 2023

Th author is suggesting we will start to see specially built servers that are optimized for AI inference. There's no reason these can't use special CPUs that utilize odd instructions. If inference does turn out to be optimized differently from training I think it's unlikely that we won't see a whole ecosystem surrounding it with specially built "inference" cpus and the like.

brucethemoose2 · on Sept 11, 2023

> Th author is suggesting we will start to see specially built servers that are optimized for AI inference.

So far, cool genAI projects are rarely ported to anything outside of Nvidia or ROCM. Hence I am skeptical of this accelerator ecosystem.

There is a good chance AWS, Microsoft, Google and such invest heavily in their own inference chips for internal use, but these will all be limited to those respective ecosystems.

ip26 · on Sept 11, 2023

The author is assuming inference will eventually get broken across many cheaper machines.

To their credit, if an nvidia box is $50,000 and a cpu box is $5,000 then a cluster of ten cpu boxes is only one order of magnitude behind.

YetAnotherNick · on Sept 11, 2023

I think that(no source but heard from few folks) if they run at full capacity, electricity cost will get larger than the base cost in few years. And energy per flop is an order of magnitude lower in GPU.

melling · on Sept 11, 2023

Training can take hours, days, weeks, or months.

Doesn’t the inference part only require seconds? Since it requires a fraction of the computation , can’t CPU’s be optimized for that? A few matrix multiplications

josephg · on Sept 11, 2023

Training an LLM can be batched - you can train using entire sentences / blocks at a time. But when doing inference, you need to do one word at a time so you can put the output word back into the input.

The optimization problem is that it’s often not the CPU that’s bottlenecked. It’s RAM. As I understand it, if you run llama locally you need to matrix multiply a few gigabytes of input data for every output token before your computer can start figuring out the next token. Since the weights don’t fit in your CPU’s cache, DDR bandwidth is the limiting factor, just pulling all the weights over and over into your cpu. GPUs are faster in part because they have much faster memory busses.

To really optimize this stuff on the cpu, we need more than a few new CPU instructions. We need to dramatically increase ram bandwidth. The best way to do that is probably bringing ram closer to the cpu, like in Apple’s M1/2 chips and nvidia’s new H100 chips. This will require a rethink of how PCs are currently built.

dahart · on Sept 11, 2023

Inference is only seconds on a GPU, but have a look at flops of modern GPUs to CPUs - matrix multiplications differ by two orders of magnitude. Seconds on the GPU is minutes on the CPU. And don’t forget inference needs to scale in the data center, it needs to run repeatedly for many users.

tinco · on Sept 11, 2023

It could, and they are. But that's only relevant if you're running the model locally. If the model is being ran at scale, then throughput matters and GPU's would be king still.

rerx · on Sept 11, 2023

GPUs will be better optimized for large matrix multiplications than CPUs, by design.

And you need to the inference again and again, not just a single time (like your training).

dist-epoch · on Sept 11, 2023

Desktop CPUs have integrated GPUs, so it's more complicated. If I infer on the GPU inside my CPU, how do you count that?

rerx · on Sept 11, 2023

True. Memory bandwidth may be the most limiting factor.

chj · on Sept 11, 2023

With co-processors, he's saying.