The Ampere Altra Max Review: Pushing It to 128 Cores per Socket

Syonyk · on Oct 8, 2021

That was about as expected. Going to 128 cores for cache-resident compute workloads helps, it hurts for anything else. That is a seriously cache starved chip... only 16MB SLC?

Even with 1MB of L2... that's still a cache starved chip. The new iPhone has almost as much cache on it!

That said, I would absolutely love some sort of mid-range ARM desktop (that isn't Apple). There are the "small workstation" class ARM boxes (quite a few thousand dollars, 64 cores or so), there are the various small board ARM computers (4-6 cores, $100), and... there's really nothing in the middle.

Give me 4 X2 cores, a cluster of A74s clocked hard, and put it in a NUC type case with at least 16GB of RAM, preferably more, and some NVMe.

Teknoman117 · on Oct 8, 2021

Have you seen the honeycomb lx2? It has 16 cortex a72 cores @ 2.0 GHz, USB 3.0, quad 10G Ethernet, PCIe, and supports up to 64 GiB of DDR4-3200 in SODIMM format. Performance wise it’s around a x86 quad core from a few years ago.

https://shop.solid-run.com/product/SRLX216S00D00GE064H09CH/

It $750, but it’s the only thing that comes to mind. There’s a $1000 variant that adds 40 GbE support.

adrian_b · on Oct 8, 2021

Unfortunately, the more than 6-year old Cortex-A72 cores are really obsolete.

Especially when having many ARM cores, it is desirable to have cores that support at least the ARMv8.2-A ISA, which corrects the initial mistake of the 64-bit ARM ISA of omitting the atomic instructions.

Actually the mistake was corrected in ARMv8.1-A, but all easily available ARM cores support either ARMv8.2-A (Cortex-A55, Cortex-A75 and newer, also NVIDIA Carmel) or ARMv8.0-A (Cortex-A73 and older).

If you want to experiment with many ARM threads and how to organize the synchronization and communication between them, the results on ancient Cortex-A72 cores can be misleading.

Sadly the vast majority of ARM vendors offer only very old ARM cores. Except for the chips intended for mobile phones, the only devices with not too old ARM cores are slow devices with quadruple Cortex-A55 cores.

If you want more threads than that, the best that can be found would be an 8-core NVIDIA Xavier AGX or a 6-core NVIDIA Xavier NX.

An 8-core Xavier might actually have a speed not much different than a 16 core Cortex-A72, even if the NVIDIA cores are relatively slow when compared with more recent ARM cores, i.e. Carmel is a little slower than Cortex-A75.

Buying a Honeycomb lx2 makes sense only when your interest is in the multiple 10G Ethernet and you do not care about what kind of CPU is provided on the board, i.e. you are not interested in the board as a software development platform for future ARM devices.

danieldk · on Oct 8, 2021

Sadly the vast majority of ARM vendors offer only very old ARM cores. Except for the chips intended for mobile phones, the only devices with not too old ARM cores are slow devices with quadruple Cortex-A55 cores.

I fear that Asahi Linux on a Mac Mini (especially when the M1X comes out with possibly 6 or 8 high-performance cores) is going to be the best option for a modern Linux ARM workstation for a while.

Teknoman117 · on Oct 8, 2021

The largest hurdles in these sorts of projects tend to the GPU and video codecs.

I wish them luck and hope they succeed in entirety, but I can only early adopt so many things. Hopefully Apple doesn't reinvent their GPU every single year.

mhh__ · on Oct 8, 2021

That's a lotta money for not much oomph. Single threaded performance is probably going to hurt, not everything is a SPEC benchmark ;)

Mistletoe · on Oct 8, 2021

Can anyone provide insight into who buys that sort of device? It seems kind of pricey for what you get.

matoro · on Oct 8, 2021

I bought one of these. Unfortunately due to supply-chain holdup the order is pending indefinitely. The primary use-case is as a core router for high-throughput networks like ISP backbones, hence the focus on built-in SFP+ which would usually need an array of PCIe cards to accomplish on a standard box. But you can also use it as a NAS which is how I intend to because of the 4x SATA. Both of these use cases require only moderate CPU power and benefit from the ECC support.

smackeyacky · on Oct 8, 2021

I'm keen on an ARM workstation. I do all my dev work on Linux and deploy everything to Linux hosts. AWS offers ARM hosts at half the cost of amd64 ones, you can cross compile but I would be happier if my build box was the same processor.

What might work is some kind of local headless cloud machine running a pile of VMs until somebody builds something I can hang two big monitors off. Laptops are too constraining.

exikyut · on Oct 9, 2021

I just thought for a moment it could be interesting to try and convince Amazon to do for their ARM kit what Google have done with their TPU chipset (https://cloud.google.com/edge-tpu, https://google.com/search?q=tpu+board).

Then I realized that

- If such an initiative succeeded, the developer side of the fence would lose ground somewhere, because Amazon.

- It would never actually happen in the first place, because ARM is so fast-moving now that everyone's doing everything they can to maintain their first-mover advantage, which of course includes being as secretive as possible.

- In the ideal case where an engineer actually considered the situation, it would be all too easy (given the iteration that's happening right now) for them to counter-argue that it makes the most sense to just buy the Altera kit available on the retail market because it's technically the closest to what Amazon are using internally.

The evolutionary focus is entirely on servers because of the appreciable overheads associated with that market that can be tapped into to drive the R&D process - especially given the chip shortages at the moment. Inn the same vein, there are not likely to be any consumer-class implementations (or new products) until the supply chain has had a chance to figure itself out, which is probably not going to happen for a while.

:(

smackeyacky · on Oct 9, 2021

This is a shame for a lot of reasons. Chiefly that I would really like a graviton2 local server to get as close to replicating the metal my server will run on. Perhaps Amazon could just ship a PCI card with a system on it or something like that.

Its interesting to me that all of a sudden amd64 seems like a legacy system. I can hear the fan in my old proliant dev server from here and while I am somewhat sentimental about it, its definitely time to move on, preferably withou Apple involved. Maybe its time to start a 1980s style workstation company again.

exikyut · on Oct 10, 2021

Ha, I would not be surprised if a lot of the related iteration work is done via PCI(-e). Those cards are probably a) already being made and are well-understood, and b) part of an accepted single-order-quantity cycle, which would mean the runway needed to iterate a few more times to make the card releasable would theoretically not only be that hard to justify but be able to reuse existing processes.

I feel like the amd64 legacy (and x86 in general) will maybe be like IPv4. Itanium, OpenSPARC, SH4, MIPS, Coldfire, etc (incl. many more I'm forgetting) all eked out a living for a while after they "died" as platforms, while others like POWER are still around. x86 on the other hand has been the leading-edge architecture for virtually all client and server applications on for literally 40 years, without missing a beat. That graph has been pinned at pretty much 100% for that entire time.

So, yeah, the ARM revolution is here, but even in a world where Intel and AMD pivoted all their advertising to ARM, the x86 industry would still have enough ecosystem-driven momentum to continue to stay pegged at 100% for probably another 15-20 years. But that's not the reality - Intel and AMD aren't going to pivot away from all the contracts and partnerships and carefully nurtured deals and stuff they've hard won over the years, so there's definitely going to be continued life in the platform.

Or at least that's my mental model, which absolutely does not account for network effects.

This being said, the relative openness of the ARM licensing situation, along with the fact that the architecture hasn't fully matured in terms of peripheral and chipset configuration and whatnot, makes ARM a very very interesting platform to do exactly what you mention regarding 1980s-esque "industrial-scale experiment" type computing. I actually got bitten by precisely this bug a little while ago :) and have been noodling over the subject ever since.

Sadly, virtually every existing enterprise out there would probably fear such an idea as anything other than an eyes-only thought experiment, and I feel it would be ideologically impossible to fully convince security-conscious customers of the foundational integrity of the systems' roots of trust if I went down the VC route, even if every investor was a unicorn :/

dragontamer · on Oct 8, 2021

> That was about as expected. Going to 128 cores for cache-resident compute workloads helps, it hurts for anything else.

It looks like it has really good SPECjbb? I assume Java is non-cache resident? (Especially if SPECjbb is designed to emulate server-scale tasks). If anything, the SPECjbb results come full circle IMO: the large caches of Intel/AMD no longer help, and you're back to latency bound (Maybe??). Its very curious to me that this 16MB / 128core part is out performing the 32MB / 80core config on so many benchmarks.

I think you're right in that this processor has very little cache (and no hyperthreading). But it seems to just get around that issue by having a crapton of cores. If things are waiting on memory, it just brute-forces more cores to do more things.

I can't help but think that SMT2 or SMT4 would really help here. I have to imagine that most of those cores are waiting on RAM, given this architecture. Or maybe they ran the math and decided that 128 threads was enough to saturate the memory channels already?

I dunno, memory latency is always a killer, especially since this architecture has so little cache. My instincts suggest that SMT would still help.

cogman10 · on Oct 8, 2021

IDK what SPECjbb does, but it really sort of depends on a lot of factors around how the JVM was setup.

The JVM CAN be fairly cache friendly in the case where allocations are used relatively close to one another. The JVM will (typically) keep memory allocated at the same time in roughly the same memory region.

dragontamer · on Oct 8, 2021

https://www.spec.org/jbb2015/docs/designdocument.pdf

SPECjbb is a large, Java benchmark. The idea is to create a "realistic enterprise system", and then measure the number of requests before 100ms latency is violated (aka: critical-jOPs), as well as maximum throughput (aka: max-jOPs). The general model is on a hypothetical Supermarket: each time a POS system scans things, the Headquarters has to double-check the status of coupons, report back to the POS system and determine if coupons are still valid.

These hypothetical messages are GZipped XML that's getting passed back and forth between distinct JVMs, across the Grizzly-based non-blocking HTTPS server (default).

The benchmark artificially shoves more-and-more requests through the system until the 100ms response time is violated (ex: the system is beginning to show signs of being overburdened). Then it continues to shove more data into the system until the maximum throughput is measured.

It involves many JVMs doing many different tasks and communicating over Java-APIs. Thread pools, IPC, sockets, XML, HTTPS, encryption, the whole shebang.

LeifCarrotson · on Oct 8, 2021

I'm curious what workloads are suitable. You do have eight channels of 25 GB/s DDR4-3200 bandwidth (theoretically 200 GB/s) and 128 lanes of 2 GB/s PCIe 4.0(theoretically another 256 GB/s) to feed the thing, but getting it to make efficient use of those peripherals is another question altogether.

fer · on Oct 8, 2021

My thoughts exactly. Can be also useful for certain high volume workloads as you pass through data once and cache isn't all that useful.

But otherwise it is severely deprived of cache for anything in the middle.

TradingPlaces · on Oct 8, 2021

I think the upcoming Qualcomm-Nuvia chip will be what you want

volta83 · on Oct 8, 2021

You can get an Ampere Altra Q32-17 with 32 cores at 45W TDP.

rowanG077 · on Oct 8, 2021

well soon you can get a mac mini. It's doubtful you could get a better bang for your buck elsewhere.

MayeulC · on Oct 8, 2021

As we are witnessing multiple relatively small companies entering the server CPU market, it would be a really great thing if they could standardize on a compatible socket.

This way, instead of designing a motherboard for a single CPU vendor in a small market, integrators could target all of them at once.

Intel reportedly keeps changing sockets to please OEMs who want to sell motherboards; I would be happy to see something change in that landscape.

Otherwise, I'm extremely pleased to see more options are starting to become viable (again).

cameron_b · on Oct 8, 2021

If you abstract that ask just a bit there is already an answer, although it's more of an industrial answer. the COM-Express Spec is something of this Universal socket [0]

And while all of that link above looks like a creamy beige box cringefest, the COM-HPC[1] interface is being used for some heavy-hitting systems including Xeon and Ampere Altra [2]

So, yea, you can get this in something more interoperable.

0 - https://www.picmg.org/openstandards/com-express/

1 - https://www.picmg.org/openstandards/com-hpc/

2 - https://www.ipi.wiki/pages/com-hpc-altra

dragontamer · on Oct 8, 2021

> If you abstract that ask just a bit there is already an answer, although it's more of an industrial answer. the COM-Express Spec is something of this Universal socket [0]

So are PCIe slots, except GPUs seem to have won on that front (RIP Xeon Phi).

Moving forward: PCIe 5.0 is having more-and-more atomic instructions execute over the PCIe 5.0 specification. Its looking like PCIe 5.0 will have cache-cohesive, in particular, the Compute Express Link (CXL).

So PCIe is looking more-and-more like a "standardized socket", rather than an I/O mechanism.

jeffbee · on Oct 8, 2021

Does that LLVM comparison make sense? It seems to target the native platform, which doesn't seem like it would make for a valid comparison between x86 and arm.

tyingq · on Oct 8, 2021

Maybe for the relative 1 socket / 2 socket differences.