Hacker News new | past | comments | ask | show | jobs | submit login
Everything I've learned building the fastest Arm desktop (jeffgeerling.com)
157 points by alphabettsy on Oct 27, 2023 | hide | past | favorite | 52 comments



From the article: "The GPU can be seen by Windows, but Nvidia only publishes Arm drivers for Linux, not Windows. So in device manager you just see a Basic Display Adapter, and it can't really do anything."

Never thought I'd live in a world where drivers are published for Linux first. This is great.


I mean, arm drivers for GPUs are used a ton in scientific computing and deep learning.

Nvidia sells ARM servers with 8-16 GPUs to my knowledge.

So that makes sense


Nvidia also wanted to buy ARM and has just announced their own ARM CPU, so yeah.


And they have been pushing Tegra for more than a decade.


Their embedded SoCs include GPUs as well.


The NVIDIA Jetson is Arm based, with (duh) their GPU. Funnily enough, at $DAYJOB I used an Altra Arm server to do builds of our codebase - it replaced having to build the code directly on a device sitting on a rack in our server room (slow! Especially on Xavier), and saved me from having to set up a cross compile.


I'm very confused by the talk of "teraflops". Author references linpack (which is a double precision benchmark), but then says "if you really care about teraflops, you need a graphics card" and installs a 4070 Ti, which does like 600 gigaflops in double (IIRC). This would make more sense if we were talking about single precision (4070 does ~40 TFlops single?); which is it?

The original 1.2 teraflop CPU number also smells funny; 128 cores * 2 neon units / core * 2 doubles / neon unit * 2 flops / cycle * 2.8 billion cycles / second = 2.8 TFlops peak, an even halfway decent BLAS will get you to 80% of that, for 2.3 TFlops. Double that number for single precision. Either a completely untuned BLAS or benchmarking a problem that's much too small for the CPU under test.


Caveat: I consider myself a noob at benchmarking, after spending probably 1,000 hours on it over the years.

The CPU benchmark used HPL linpack following a standard-ish Top500-style benchmark, mostly because I think it's fun to see how various single CPU systems compare to historic 'supercomputers' on the official lists.

There are different ways to calculate (and benchmark) flops, the way I'm doing it is with this open source project: https://github.com/geerlingguy/top500-benchmark — which can use OpenBLAS, Blis, or ATLAS, and I've tried all three un-tuned.

I also worked with some Ampere engineers to run their optimized version for Ampere Altra Max: https://github.com/AmpereComputing/HPL-on-Ampere-Altra

On their own server systems with 8 channels of memory they topped out around 1.6 Tflops. There are other tests you can run and get more or less Tflops, but I based my own results on the top500 test.

See more discussion about the results and testing in the following links:

https://github.com/geerlingguy/top500-benchmark/issues/19 https://github.com/geerlingguy/top500-benchmark/issues/17 https://github.com/geerlingguy/top500-benchmark/issues/10

(And see some of the issues linked back to the Ampere repo in those issues.)

Edit: And regarding the video card mention—it was meant more as a generic reference (e.g. running A100 or 4090 since you can go much further there than Altra Max), and not specifically to the 4070 Ti... but I can see how that is not as clear!


Hey Jeff, I loved the video. I know it was more about "current state of the tech" and less about what we should actually be buying, but it would be very cool to hear more about how each of these setups are priced on the scale of "dollar per unit of performance" or something like that. (Or maybe that's not fair to do, since most consumer software can't handle all those cores yet?)

I'm also curious whether you think Apple's decisions on memory architecture (despite being non-upgradeable) will have a leg up in the long run. You mentioned that memory bandwidth tops out around 174GB/s. Although you handily beat the Mac Pro in a multicore benchmark thanks to core count, one of the Mac Pro's claims to fame is its memory bandwidth of 800GB/s, as well as its unified memory architecture.


As with all things, it's a tradeoff. HBM on servers is similar to Apple's choice, and Xeon, EPYC, Nvidia H100, and some other designs incorporate it. There are good things (performance) and bad things (price/non-upgradeability) about it. Best of both worlds would he chip on module plus expansion slots, so the fast RAM is like L4 cache.


That's essentially how things are likely to go with CXL, though the latency isn't likely to be quite as good as on a dedicated DIMM connection or even as "good" as it was with IBM's OMI. The future (imminent in enterprise and sometime around when PCIe 6.0 hits consumer machines) looks to be mostly a combination of HBM and CXL memory.


Hi Jeff, great article!

Fast & efficient cpu cores, perhaps slightly memory bandwidth limited, sounds like a great compile/build machine.

Any chance of doing a Linux kernel compile, and time that? Bonus points for 2nd, 3rd etc run with much of the stuff in RAM.

Curious minds would like to know.


Also the slowest, according to the Cinebench single-core numbers.

But it can only get better from here. I'm glad that PC users are starting to benefit from Apple's "desktop ARM" revolution.


I seriously doubt ARM will hit consumer desktops any time soon. Servers, where people care about power efficiency? Yes. Laptops, where people care about power efficiency? Yes.

But on "gaming rigs" where people are content to eat nvidia's absolute bed-shitting power consumption, and where replacing ram is a rallying cry? Not going to happen. It'll take until they're dragged along by the other two segments, sometime in 2-3 hardware generations. Maybe 2029-30 at the earliest.


An interesting aspect of this is that gaming notebooks/laptops already outsell gaming desktops by ~2:1¹, and that ratio will grow as desktops continue to become niche devices.

I like your estimate, but I think it's on the conservative side by at least a couple of years. The migration from x86 to ARM will be slow until it isn't.

¹ https://www.statista.com/statistics/1003576/gaming-pc-shipme...


yeah -- I like a beefy desktop. but man oh man. the power that the m-series chips on macbooks deliver is nice. light package. good performance. no noise. less power consumption.

i'm sure people gaming on windows will appreciate that too -- given the success of the steamdeck. if gamers could get 2K or 4K gaming. on machines using 15-30W whether ARM or x86. then similarly us dev's would benefit as well.


Expectations of a sudden, rapid transition have been made for the better part of a decade. Microsoft teamed up with qualcomm to build win10-on-ARM laptops back in 2016.


Does Office 365 runs on ARM ? Are there Chrome, Firefox builds for ARM ?

I don't see PTC, Cadence, Autodesk etc. releasing any ARM binaries soon.


For applications that aren’t ARM-native, Microsoft’s x86 ARM emulation works well in my experience.


128 cores is impressive, but the single core performance is painfully low compared to the M series.


To be fair the Neoverse N-series was designed for datacenter style usage where lower clock frequencies and going wide is more common even on x86. For client uses some of the things the Apple M-series does for single core performance may or may not be worse; in particular 128-bit cache lines would potentially add a lot more false-sharing, and the really big-and-fast per-core L2 is probably not practical even if you added the 16kb page size that Apple uses to achieve it (L2 sizes it didn't get any better with AmpereOne either, though). Those are big, good single-core wins that make sense for client SKUs. I think you'll always have some of those tradeoffs with CPUs like this.

Hopefully Qualcomm Oryon will spice things up on the client side a bit. Maybe we can get some real HEDT designs after that.


That's all true, but then maybe it's not "the fastest Arm desktop" if it's 25% faster in multi-core workload but the M2 Ultra is _168% faster in single core_ (that's 168% faster as in 2.68x as fast, not 68% faster). There are plenty of workloads where the M2 Ultra will be significantly more performant.

I'm excited for the future of ARM desktops, but that also means they need to significantly increase single core performance. It looks like maybe the Snapdragon X Elite will get us there? Or at least significantly closer.


AmpereOne uses a customized CPU core too, and I haven't had a chance to test on it yet to see how it compares. Still probably far below the latest Qualcomm or Apple core designs, but should be better than the Altra/Altra Max series.

I would love to see an EPYC-style Arm CPU with M1 or better cores.


Outside classical C UNIX applications, most language runtimes are multihreaded anyway.

Windows has been multithreaded for decades across all layers, even classical Win32 written in C is using multiple threads.

Android, ChromeOS and macOS likewise.

In the end, single core performance being that relevant is quite niche case in modern software stacks.


Having multiple threads does not mean that they are all doing equally useful work. Single threaded performance is absolutely critical for a desktop machine.

Even in multithreaded desktop applications, it's rare to see them effectively use more than 8 threads.


There are some tasks where single core definitely limits the performance (some games especially). For most of the 'compute' oriented tasks like CAD/3D, LLMs, etc. multicore is great, and the slow single core speed doesn't seem to get in the way.

I would still rather have 128 M1-class cores than 128 Neoverse-N1 cores :)


I would rather have 10 M series and take the rest of the area for cache.


Yeah, on classical UNIX tradition.

macOS, ChromeOS, Windows and Android are heavily multithreaded, even when there is one main application thread, the underlying OS APIs are using auxiliary threads.

Easily observable in any system debugger.


A lot of software that doesn't support multithreading could easily support it by implementing a single threaded graph partitioning algorithm or an algorithm to find disconnected subgraphs. Your ability to feed your cores is now limited by this piece of single threaded code.

The easy way to get around this is to have heterogeneous cores. Ampere could have easily slapped 2 A76 cores for every 32 Ampere cores to prevent this problem.

The harder way to get around this is to multithread the partitioning algorithm, but that will convince more people that multithreading isn't worth it, because their Ryzen runs the same code faster even though it might only have 12 cores.

https://box2d.org/posts/2023/10/simulation-islands/


Python enters the chat.


If we ignore the threads that might be executing in the background for GC, async io support, and in some runtimes JIT compiler.


This is very promising.

The benchmark results are great, with the exception of the single core speeds. We just need non-Apple ARM to catch up in the single core speeds across the board -- they seem particularly slow here compared to everyone else. Even Intel and AMD I think outdo Apple ARM now on single core.

I think it is a bit of a shame that Apple's top of the line M1/M2 go to 24 cores and the rumoured M3 CPUs only go up to 32 cores at most. I think the Ultra should have 64 cores personally. Rendering and video people want that.


> Even Intel and AMD I think outdo Apple ARM now on single core.

Yeah they do currently. But, if the single-core performance jump we see between A15 and M2 (same architecture), about 25%, can be replicated between A17 and M3, we should expect Apple to retake the single-core performance lead, as M2 is only ~8-12% behind. Its not unreasonable to expect a perf jump like this, given its mostly just shoving more watts through the same core architecture, and Apple's tremendous efficiency lead gives them more headroom to do that than Intel's 14th gen over 13th.

Its also a totally valid question whether M3 will be based on A17 or A16. I'm not sure if we have strong rumors to point one way or the other. We'll know more next week.


Just a nitpicking, but calling an Altra system "the fastest ARM desktop" is misleading. The CPU does have a huge flops number, but nothing "desktop" can saturate that many cores, especially when the system comes with a GPU. Altra is design for cloud workloads, and might be useful for some specialized tasks, but high core count (>8) has little-to-no impact in the "desktop" context. Single-core performance is more important because it can improve response time without performing peculiar synchronization dances.


> The CPU does have a huge flops number, but nothing "desktop" can saturate that many cores, especially when the system comes with a GPU.

I've seen this blend of argument being made towards multicore CPUs when AMD and Intel started selling them. Critics started claiming these new chips were being designed like Gilette razors, and there was no way desktop applications, not even AAA games, could ever saturate so many cores. It took Supreme Commander to be released to get these critics to finally shut up.

The moral of the story is that today's software was designed targeting yesterday's hardware. Nevertheless, even today you'd be hard-pressed to find a single application that doesn't run multiple threads, and on top of that you'd be hard-pressed to find a single desktop user that only runs a single application when working on a computer. Even smartphones today are packing CPUs with 8 cores.

And finally, HN being a tech-related forum, most of the people here work on software. Your typical run-of-the-mill build process is embarrassingly parallel. This means that all it takes to saturate that many cores is for anyone to hit "build". And I won't even go into ML.


Yeah, I agree overall, but my complaint was that the author was covering literally "desktop" usages.

> The moral of the story is that today's software was designed targeting yesterday's hardware.

The problem with CPUs w/ high core counts is that GPU can be more efficient in most cases. The tasks that specifically need CPU are likely complex and dynamic - in terms of codeflow and dataflow - so that they can't benefit from SIMD/SPMD. I don't think "desktop" need any such complexities, especially when the actual complexity of desktop software hardly increased for decades except web browsers.


This is not right. This is a desktop and I use it to write programs which implies invoking a compiler. Having a snappy UI is important but short compile times can be importanter(!) once a baseline latency has been achieved.


> Single-core performance is more important because

Because there are a lot of svchost.exe instances fighting for this core. /s


Anyone know the cost of a box like this? Couldnt find it in the article


You can spec out the prices here, which are the same boxes used in this article. The entry price is about $2k USD for the 32-core SKU: https://www.ipi.wiki/products/com-hpc-ampere-altra#

The Dev Kit is probably what most people here would want, but if you probe around the site you can also find the fully loaded (PSU, case, etc) option too.


Yeah sorry about that; there was one link to the dev kit buried in the text, but I've added links to the Dev Platform and Dev Kit near the top now.


$2000 for 32 core up to $3300 for 128. It's non-trivial to get hold of. The biggest hurdle Ampere has is distribution (at least if the plan is to move from dev kits to consumer hardware), they are obnoxiously enterprise.


I've been thinking about buying one of these. Unfortunately they do not seem to support SVE, which is a showstopper for me. I wonder when we'll get the first affordable systems that have it.


Oh this is interesting. Thanks for the article Jeff.

I am trying to think of alternative desktop environments for the use of multiple cores. Perhaps in the future we shall all do parallelisable matrix multiplication and translate and scale relationship vectors with our desktops and mash up new applications trivially. So we can use all 128 cores


Wasn't there a Fujitsu A64FX workstation or am I totally misremembering?


On the cloud provider side of things, there is so much demand for Windows Server on Arm, it's astounding that Microsoft still haven't shipped a version yet.


Am I weird that my first thought about a 128 core CPU is all the scrolling in htop?


You can bring it to its knees with MSFS2020.


I lost him when he said this:

And RAM goes a lot deeper too. Look at these two sticks of RAM. See how the one on the right has twice the number of memory modules? That allows the individual stick of RAM to pump through data more quickly than the one on the left, even though both of them are rated at DDR 3200 and CL22.

The bus width is defined by the DIMM standard. A 64-bit DIMM (which is what he is showing) always has a 64-bit data interface, regardless of what devices are on the DIMM. The DIMM with fewer devices on it just has 16-bit-wide microchips on it. That doesn't have anything to do with the latency or clock speed of the chips themselves.

And it's strange that he calls each microchip a "module". I would call that a microchip or a device. A "module" is an assembly that contains multiple microchips.

I just get the vibe that he's someone skilled at putting systems together but doesn't have any real engineering background.


He's referring to Rank Interleaving, and he's correct:

https://frankdenneman.nl/2015/02/20/memory-deep-dive/#:~:tex....

You've put off the vibe of someone that assumes they're correct without double checking themselves ;)


Memory rank has been a real factor for performance for a long time. For some reason the topic never got too much traction in the pc building/tech world. Probably because most people dismiss it because it does sound questionable at first thought.


For most use cases it isn't noticeable to end users, especially your standard office worker or gamer. Even the highest end of gamers.

Its really only a concern in the realm of ultra large memory applications: science, machine learning, HPC, etc




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: