So they don't get into it until the very end, but the claim is that these things...

wmf · 2024-04-20T01:10:49.000000Z

More like slightly slower but 10x cheaper. Honestly, using CPUs for inference just frees up more Nvidia GPUs to be used for training.

yjftsjthsd-h · 2024-04-20T01:16:45.000000Z

Oops, read the numbers backwards; you're right, the GPU was still a bit faster. Still, yeah, if it's significantly cheaper that's still a huge win.

ein0p · 2024-04-20T00:58:25.000000Z

It’s not possible for a cpu to be as efficient at inference as a gpu. The underlying physics to make that happen is just fundamentally not there if you look at the estimates for energy necessary for data movement. The only way to make CPUs viable again is by colocating them with DRAM and eliminating the memory bus. And even then you’re going to need specialized cores with systolic arrays

adrian_b · 2024-04-20T04:57:13.000000Z

That would be true, but the prices for datacenter GPUs are so inflated (both for NVDIA and for AMD) that it does not matter that CPUs are not as efficient from the technical point of view (i.e. in performance per watt and performance per die area), because the performance per dollar of CPUs is currently much better, unlike 10 years ago, when the GPUs were much cheaper, so they also had better performance per dollar.

ein0p · 2024-04-20T06:05:00.000000Z

The prices of data center CPUs are also quite eye watering. And their compute throughput is one to two orders of magnitude lower than that of a modern, high end GPU.

adrian_b · 2024-04-20T08:47:14.000000Z

"One to two orders of magnitude" may be true only for FP32 or FP16 and only when comparing a single GPU to a single CPU.

For FP64, the ratio in the performance per watt between the few datacenter-only GPU models that support FP64 and CPUs has been only around 3 for multiple GPU and CPU generations.

For inference using the AMX instruction set I have not seen any useful published benchmark that can show which is the ratio between the performance per watt of Emerald Rapids vs. NVIDIA, but the performance per dollar is certainly much worse for NVIDIA.

What limits the performance is normally either the performance per dollar or the performance per watt, it seldom matters if the performance is provided by a single huge GPU or by multiple smaller GPUs or CPUs.

While the top models of the server CPUs are overpriced, there are also models with very decent prices, like the 28-core Xeon 5512U @ $1230 or AMD Ryzen 9 7950X.

ein0p · 2024-04-20T15:36:25.000000Z

Nobody uses FP64 for AI. Nobody uses consumer SKUs in the datacenters.

imtringued · 2024-04-21T10:07:00.000000Z

This has nothing to do with physics. Even a 8700G offers 4 TFLOPs of bfloat16 performance and the NPU is a tiny part of the chip.

The only difference between a GPU and a CPU when it comes to inference is whether it has an inference accelerator or not and how large the on chip SRAM is and how large the memory bandwidth is.

The accelerators are going to be the same on a GPU or CPU so you're left with differences in SRAM and whether the CPU is connected to DDR, GDDR or HBM memory.

That again has very little to do with physics. There is no physical law that prevents you from hooking up GDDR to a CPU if it has a GDDR memory controller.

You are also ignoring the very real possibility of LPDDR PIM which would negate most of the benefits of GPUs since the most expensive operations never leave the memory chip in the first place. HBM PIM only made a GPU 2x times faster vs regular HBM, but it makes CPUs 10x times faster using LPDDR PIM.

UncleOxidant · 2024-04-20T02:54:56.000000Z

> The only way to make CPUs viable again is by colocating them with DRAM and eliminating the memory bus.

Wasn't this an idea that Micron was working on about 10 years ago?

Yeah, here it is, the Automata: https://thememoryguy.com/micron-announces-processor-in-memor...

Wonder whatever happened to that?

ein0p · 2024-04-20T03:32:24.000000Z

It probably was way ahead of its time. One major issue with this is DRAM uses a completely different silicon process from compute, so if you don’t have stacking (which is very new still), you can only put really shitty, inefficient, and low density compute there, and at that point it becomes a hard sell.