Ampere Readies 256-Core CPU Beast, Awaits the AI Inference Wave

yjftsjthsd-h · 2024-04-20T00:50:28

So they don't get into it until the very end, but the claim is that these things have so many cores that they're actually faster running CPU inference than GPUs while being a little bit cheaper? That'd be pretty fun if true - of all the things to put pressure on Nvidia, I would have bet on another GPU vendor before I bet that CPUs would leap back into being faster!

wmf · 2024-04-20T01:10:49

More like slightly slower but 10x cheaper. Honestly, using CPUs for inference just frees up more Nvidia GPUs to be used for training.

yjftsjthsd-h · 2024-04-20T01:16:45

Oops, read the numbers backwards; you're right, the GPU was still a bit faster. Still, yeah, if it's significantly cheaper that's still a huge win.

ein0p · 2024-04-20T00:58:25

It’s not possible for a cpu to be as efficient at inference as a gpu. The underlying physics to make that happen is just fundamentally not there if you look at the estimates for energy necessary for data movement. The only way to make CPUs viable again is by colocating them with DRAM and eliminating the memory bus. And even then you’re going to need specialized cores with systolic arrays

adrian_b · 2024-04-20T04:57:13

That would be true, but the prices for datacenter GPUs are so inflated (both for NVDIA and for AMD) that it does not matter that CPUs are not as efficient from the technical point of view (i.e. in performance per watt and performance per die area), because the performance per dollar of CPUs is currently much better, unlike 10 years ago, when the GPUs were much cheaper, so they also had better performance per dollar.

ein0p · 2024-04-20T06:05:00

The prices of data center CPUs are also quite eye watering. And their compute throughput is one to two orders of magnitude lower than that of a modern, high end GPU.

adrian_b · 2024-04-20T08:47:14

"One to two orders of magnitude" may be true only for FP32 or FP16 and only when comparing a single GPU to a single CPU.

For FP64, the ratio in the performance per watt between the few datacenter-only GPU models that support FP64 and CPUs has been only around 3 for multiple GPU and CPU generations.

For inference using the AMX instruction set I have not seen any useful published benchmark that can show which is the ratio between the performance per watt of Emerald Rapids vs. NVIDIA, but the performance per dollar is certainly much worse for NVIDIA.

What limits the performance is normally either the performance per dollar or the performance per watt, it seldom matters if the performance is provided by a single huge GPU or by multiple smaller GPUs or CPUs.

While the top models of the server CPUs are overpriced, there are also models with very decent prices, like the 28-core Xeon 5512U @ $1230 or AMD Ryzen 9 7950X.

ein0p · 2024-04-20T15:36:25

Nobody uses FP64 for AI. Nobody uses consumer SKUs in the datacenters.

imtringued · 2024-04-21T10:07:00

This has nothing to do with physics. Even a 8700G offers 4 TFLOPs of bfloat16 performance and the NPU is a tiny part of the chip.

The only difference between a GPU and a CPU when it comes to inference is whether it has an inference accelerator or not and how large the on chip SRAM is and how large the memory bandwidth is.

The accelerators are going to be the same on a GPU or CPU so you're left with differences in SRAM and whether the CPU is connected to DDR, GDDR or HBM memory.

That again has very little to do with physics. There is no physical law that prevents you from hooking up GDDR to a CPU if it has a GDDR memory controller.

You are also ignoring the very real possibility of LPDDR PIM which would negate most of the benefits of GPUs since the most expensive operations never leave the memory chip in the first place. HBM PIM only made a GPU 2x times faster vs regular HBM, but it makes CPUs 10x times faster using LPDDR PIM.

UncleOxidant · 2024-04-20T02:54:56

> The only way to make CPUs viable again is by colocating them with DRAM and eliminating the memory bus.

Wasn't this an idea that Micron was working on about 10 years ago?

Yeah, here it is, the Automata: https://thememoryguy.com/micron-announces-processor-in-memor...

Wonder whatever happened to that?

ein0p · 2024-04-20T03:32:24

It probably was way ahead of its time. One major issue with this is DRAM uses a completely different silicon process from compute, so if you don’t have stacking (which is very new still), you can only put really shitty, inefficient, and low density compute there, and at that point it becomes a hard sell.

mmastrac · 2024-04-20T00:19:46

If someone gets their hands on one and wants to build Deno with `cargo build --release`, I'd love to get myself release-optimized executables without waiting ten minutes for my M2 to fully compile and link the project.....

Seriously. Hit me up via email.

pavelstoev · 2024-04-20T02:24:46

I think what will emerge eventually are LLM architecture specific ICs.

UncleOxidant · 2024-04-20T02:52:02

Seems quite likely that there are already several in the works.

ksec · 2024-04-20T01:27:50

In terms of HyperScale this leaves Ampere only serving Microsoft. ( Microsoft is also an investor of Ampere ) As both AWS and now Google has their own design. Along with the much smaller cloud vendors.

They also have their own custom core design instead of using standard ARM Neoverse Core. Historically speaking there hasn't been a single vendor that outperform ARM in terms of CPU IP Core design price / pref apart from Apple.

Would love to see how it perform against a similar 3nm Zen 5C with 128 Core / 256 Thread.

Twirrim · 2024-04-20T01:31:56

Oracle cloud also ships Ampere based servers. It's probably a larger cloud than you realise

ksec · 2024-04-20T13:30:06

It is big but I dont think Oracle is defined as HyperScale. It is smaller than even Tencent Cloud and Alibaba Cloud.

wmf · 2024-04-20T01:37:08

Microsoft already announced their own ARM server processor so it's really Oracle that's left using Ampere.

faeriechangling · 2024-04-20T02:23:23

I've been looking at these a possible Mac Studio competitor for prosumer inferencing but the current models on the market are not attractive for this purpose. Expect the model being advertised here to be available to own for quite awhile.

I do sort of think this is where things are going to be headed though for inferencing - massively parallel DDR memory and a fast ARM cpu. GDDR and HBM based solutions just seem to cost too much.

KingOfCoders · 2024-04-17T16:36:35

Any experiences on using Ampere as a development machine?

mistyvales · 2024-04-18T01:00:38

Jeff has a nice teardown of it, but doesn't offer much actual usage

https://www.jeffgeerling.com/blog/2023/everything-ive-learne...

ijidak · 2024-04-20T03:48:38

So, is Intel just too far behind to catch up at this point?

It looks like there is a lot of progress by quite a few companies, but Intel is still floundering.

wmf · 2024-04-20T04:23:48

Intel is releasing a 288-core CPU pretty soon.

adrian_b · 2024-04-20T05:13:32

That is right, but the cores used by Intel in Sierra Forest are intermediate in performance between the slower Neoverse N2 cores and the faster Neoverse V2 cores that are used in the latest server CPUs introduced by NVIDIA, Amazon and Google.

It is not known for now with how many Neoverse V2 cores will Sierra Forest be equivalent and how the Ampere proprietary cores compare with the Neoverse cores, so it is not known whether the planned 256-core Ampere CPU will be faster or slower than a 288-core Sierra Forest.

However, if the Ampere CPU will be ready only much later than Sierra Forest and the faster AMD Zen 5c CPUs (which are likely to also reach 256 cores), its performance might not matter much, as there is too much competition for it.