Arm Announces Neoverse V1, N2 Platforms and CPUs, CMN-700 Mesh

klelatti · on April 27, 2021

> AWS Graviton2-based EC2 Instances make up 14% of the installed base within AWS

> 49% of AWS EC2 instance additions in 2020 are based on Graviton2

Surprised at this level of Graviton2 adoption in AWS at this stage. Any clues as to who is using these instances?

Edit: Presumably Intel's shrinking Q1 2021 Data Center revenues are partly as a result of this.

ksec · on April 27, 2021

>Presumably Intel's shrinking Q1 2021 Data Center revenues are partly as a result of this.

It was both AMD and ARM.

There are many work loads that G2 offer immediate cost / performance advantage. AWS charges per vCPU, which is one thread on Intel/AMD and one Core on ARM. So you get ~30% performance improvement along with a ~30% lower cost for using ARM Graviton Series. Most of them have reported a total of 50% reduction in cost. For those that have hundreds if not thousands of EC2 running which fits that workload advantage, this is too much saving to pass on.

There are many SaaS running on EC2 that has mentioned their success on twitter and various other places.

Worth pointing out, this is with Amazon installing as many as they get from TSMC.

A few months ago on HN I wrote [1] about how half of the Intel DC market will be gone in a few years time.

Edit: Another point worth mentioning, this is as much of a threat to Medium and Smaller Size Cloud like Linode and DO where they dont have access to ARM (Yet). And even when they get it Amazon have the cost advantage of building their own instead of buying from a company ( Ampere ).

[1] https://news.ycombinator.com/item?id=25808856

qzw · on April 27, 2021

Linode and DO could always offer a physical x86 core instead of a virtual SMT core. It would cut into margins somewhat, but maybe Intel and AMD would be more willing to discount when they have to play defense. I think one problem for the x86 guys is that because the demand for chips far exceeds supply, they’re still doing “fine” or even “well” right now. So the threat from ARM may still be perceived on mostly an intellectual level instead of provoking the necessary visceral survival response.

klelatti · on April 27, 2021

I think it's unlikely that anyone at Intel thinks -20% Q1 datacenter revenue YOY is doing well.

The question is what options do they have to deal with this?

baybal2 · on April 27, 2021

I believe Intel couldn't have imagined with what ease their biggest customers can turn into their biggest competitors overnight.

Even a decade ago that would've been unthinkable, but today, making a cookiecutter SoC is relatively easy because nearly everything can be taken off the shelf.

Production costs though.... sub-10nm mask set costs completely rule out anything resembling a startup competing in this area.

I think 65nm was the last golden opportunity to jump on the departing train. It was still posible to ship a cookie cutter chip under $1m, now... no way.

Now, Semi industry is basically Airbus vs. Boeing.

sitkack · on April 27, 2021

Startups can absolutely compete here. There is sufficient capital to fund chip design (integration) and it is relatively low risk. We are going to see a huge number of Arm and RISC-V solutions on the market 14 months from now.

slimsag · on April 27, 2021

What companies are working on RISC-V solutions? I always feel like it's "5 years away" but not sure if that's just my own perception

scoopertrooper · on April 28, 2021

A few RISC-V SBCs are already on the market. I suspect RISC-V will come to dominate the IoT/Edge space in the next few years before graduating to other market segments.

IoT/Edge deployments are less standardised than other computing workloads. Developers and integrators in this area already expect to deal with a lot of bother when working with a new chips. Also, the margin on these devices is usually razor thin, so the potential savings from not paying ARM licensing fees would be more appreciated.

Finally, RISC-V's modular approach allows for a greater level of flexibility and innovation, which will allow manufacturers to further differentiate and gain a competitive advantage. This is especially relevant for IoT/Edge solutions where thermal and power budgets are heavily constrained.

sethhochberg · on April 28, 2021

SiFive is probably the leader, though I admit I don't know much about their competition.

klelatti · on April 27, 2021

Agreed .. except weren't Ampere (2017) and Nuvia (2019) startups?

baybal2 · on April 27, 2021

Ampere basically started as a re-labeled XGene from Applied Micro which started back in 40nm days. And they came with quite some cash to start with: their backer is Carlyle Group, the biggest LBO shop in the world.

Nuvia basically never intended to really compete Intel, or AMD heads on. Their $30m stash would've been just enough for a single "leap of faith" tapeout on a generation old node, and a year of life support after.

They were aiming for a quick sell from the start too.

klelatti · on April 27, 2021

Depends on your definition of startup I guess. Certainly seems to be enough capital available.

I definitely don't agree with premise that it's now Boeing vs Airbus now (certainly less so than it was a few years ago when x86 was the only game in town).

floatboth · on April 27, 2021

Ampere kinda was an acquisition of Applied Micro, but its internal X-Gene uarch was dropped into the trash very quickly in favor of Arm Neoverse…

jleahy · on April 27, 2021

Do you actually know how much a sub-10nm mask set costs? There’s a lot of speculation from people who don’t have access to those numbers. Those who do are bound by NDAs.

baybal2 · on April 27, 2021

I do hear figures in single megabucks for relatively small tapeouts.

Back in 65nm, 40nm days, big tapeouts were already costing in high 6 figure digits in masks.

And... masks are not the most expensive items on the signoff costs these days.

Specialist verification, outsourced synthesis, layout, analog, physical, test, and other specialist services will easily cost more than the maskset for <40nm.

I would not be surprised if tier 1 fabless already spend $10m+ per design just on them.

craigjb · on April 28, 2021

You are absolutely correct that design costs swamp mask costs by far. For 7 nm, it costs more than $271 million for design alone (EDA, verification, synthesis, layout, sign-off, etc) [1], and that’s a cheaper one. Industry reports say $650-810 million for a big 5 nm chip.

[1] https://semiengineering.com/racing-to-107nm/

aliswe · on April 27, 2021

Cookie cutter?

somethingwitty1 · on April 27, 2021

There have been a bunch of higher-profile, "we moved to Graviton2 and cut costs". Twitter, for example, migrated: https://www.hpcwire.com/off-the-wire/twitter-selects-aws-and...

neuronexmachina · on April 27, 2021

Also Netflix, although I don't think they've said what portion of their instances they've migrated: https://aws.amazon.com/ec2/graviton/customers/

KuiN · on April 27, 2021

They're the cheapest EC2 instance type, so they're very attractive to small scale deployments like side projects, personal sites etc. (basically anything that can run on one or two small nodes) where budget is a major concern. The t4g.micro is in the free tier as well, so that'll help.

I host a few very low traffic sites & I'm in the process of switching from a basic DO Droplet to a pair of low-end Gravitons. Will save me money and give better peak performance for my workloads.

ac29 · on April 27, 2021

> switching from a basic DO Droplet to a pair of low-end Gravitons. Will save me money and give better peak performance for my workloads.

I'm having trouble figuring this out - a t4g.micro is $6/month, before any storage or data transfer costs. The roughly equivalent DO offering is $5/month, inclusive of 25GB SSD and 1TB transfer. Even with a reserve instance discount and significantly less than 1TB outbound transfer, DO seems likely to be cheaper.

floatboth · on April 27, 2021

CPU power on $5 offerings from others is likely not as great. Also AWS did a free tier for everyone, and the spot market is fun…

klelatti · on April 27, 2021

Maybe, but it would take a _lot_ of people moving small deployments (where by definition the savings would be small especially relative to the fixed costs of getting to work on Arm) in a relatively short space of time to have this impact - so I'm sceptical (and if it is then it must be very easy to move to Arm - which I'm also sceptical of).

More likely some very big customers (peer comment mentions Twitter) moving to Graviton2 for cost savings.

tyingq · on April 27, 2021

Graviton might be the top and/or default choice in their management console when you create an EC2 instance. That would swing things pretty quickly for all the free tier folks.

Edit: Nope, not yet, but close..you still have to change the radio button: https://imgur.com/a/W0Sweyy

klelatti · on April 27, 2021

I'm confused - the x86 box is ticked by default there.

tyingq · on April 27, 2021

Yes...edited my edit. I'm pretty sure there was no radio button for some time, you would have had to scroll into other choices to get a Graviton instance.

vosper · on April 27, 2021

The company I work for has migrated hundreds of heavily-utilised Elasticsearch and Storm nodes to Graviton. No performance issues, pure cost saving. We’re working on the rest of our systems now. We’re going to save hundreds of thousands of dollars over the next few years.

carlosf · on April 27, 2021

AWS's own offerings such as RDS and internal control plane stuff are very likely using ARM behind the sheets.

I have evaluated going ARM, but I ended up deciding the savings were not worth it.

Not only you need to mantain 2 archs simultaneously for some time, but porting some stuff to ARM, (e.g. Python) can be a pain in the ass.

Finally, my devs work in AMD64 and that would be another source for "why does this work in dev but not prod".

bushbaba · on April 28, 2021

Just wait for the m1 MacBook to become common place. Then it’ll be my devs use ARM.

loudmax · on April 27, 2021

> Finally, my devs work in AMD64 and that would be another source for "why does this work in dev but not prod".

I can see a use case for building a CI/CD pipeline on Raspberry Pi's.

TheGuyWhoCodes · on April 27, 2021

RDS support Graviton2 as the instance type, maybe people with supported versions just migrated.

eqvinox · on April 27, 2021

"instance additions" also doesn't take instance size/performance into account. If ARM-based instances are overall smaller, that'd allow more of them, distorting the numbers...

Percentage of compute power would be cool to know here.

phamilton · on April 27, 2021

We haven't migrated yet but we expect to do some benchmarking this quarter for Aurora.

For EC2 we run on spot and spot c5.metal are cheaper per vcpu than c6g.metal, so we haven't prio'd benchmarking our compute loads.

sleepy_keita · on April 27, 2021

I wouldn't be surprised if AWS is using Graviton2 pretty heavily for internal processes as well, stuff like control planes for the major services like S3, SQS, SNS, etc...

speedgoose · on April 27, 2021

AWS itself is probably the main user I would guess. So you use them indirectly through AWS' vendor lock APIs.

StreamBright · on April 27, 2021

I know many users. Basically any non-x86 workload that cost sensitive can benefit from moving to arm instances. Database instances are good candidates, big data workloads as well.

eqvinox · on April 27, 2021

Funnily coincident timing: current post #2 (GCC 11.1 released) adds support for the CPUs mentioned here (currently post #4):

  AArch64 & arm

    A number of new CPUs are supported through arguments to the -mcpu and -mtune options in both the arm and aarch64 backends (GCC identifiers in parentheses):
        Arm Cortex-A78 (cortex-a78).
        Arm Cortex-A78AE (cortex-a78ae).
        Arm Cortex-A78C (cortex-a78c).
        Arm Cortex-X1 (cortex-x1).
        Arm Neoverse V1 (neoverse-v1).
        Arm Neoverse N2 (neoverse-n2).

Good to see work going into this at the proper times. (Not that that was much of a problem for CPU cores in recent times. Still not a matter of course though.)

floatboth · on April 27, 2021

These tunings will only be used if you compile stuff yourself with -march=native (or specifying one particular model). Most software out there would be compiled with generic non-tuned optimizations. The tuning is rarely a huge deal though.

eqvinox · on April 27, 2021

True, but it's still relevant for 3 things:

- when you have a particularly CPU-intensive application, you'd hopefully compile it to target your system

- the cloud providers can just do a custom Debian/Ubuntu/... build for their zillions of identical systems

- the library loading mechanism on Linux is slowly getting support for having multiple compile variants of a library packaged into different subdirectories of /lib (e.g. "/usr/lib64/tls/haswell/x86_64")

Also I was mostly trying to point out as a positive how well the interaction is working there between ARM and the GCC project. I wish it were like this for other types of silicon.

(CPU vendors all seem to be getting this right, and GPUs are slowly getting there, but much other silicon is horrible… e.g. wifi chips)

sitkack · on April 27, 2021

That is not entirely true. Binaries in the packaging systems might not be compiled for the most recent atomic instructions which can really affect performance.

https://blog.dbi-services.com/aws-postgresql-on-graviton2-aa...

https://github.com/microsoft/STL/issues/488

We are about 9-14 months away from the right pieces making their way through the software ecosystems where this will be almost a non-issue.

Exciting times for everyone!

floatboth · on April 28, 2021

Well, that – yeah. But it doesn't strictly have anything to do with the actual CPU model specific tuning that the news was about, only in that setting a specific CPU in -march (-mtune would not do it!) would imply the features. Typically though you'd just do -march=armv8-a+the+desired+features for that like the first post you linked does.

Really the important piece for making distribution binaries not suck is ifuncs/multiversioning. But library and app authors currently are required to deliberately use them. Which is fine for manual optimizations that use intrinsics or assembly (and e.g. standard library atomics) but I'm not sure any compiler currently would automatically just do that for autovectorization.

truth_seeker · on April 27, 2021

Hah. I like it when I can enjoy my hammock instead of fine tuning my code to weird limits for performance.

DDR5,PCIE 5.0, SVE speedup and 40% IPC improvement put a big smile on my face.

akmittal · on April 27, 2021

Apple's M1 has made ARM mainstream for laptops, Lets see which company does same for server space.

Hopefully ARM on cloud will result in cheaper prices.

ericbarrett · on April 27, 2021

I nominate Amazon for this award. As mentioned in another comment here, ~50% of newly allocated EC2 instances are ARM.

lifty · on April 27, 2021

Apple's M1 will make ARM mainstream on the server side.

Symmetry · on April 27, 2021

Apple hasn't seemed interested historically. And the Nuvia folks left Apple to found their company explicitly because they thought an M1 style CPU core would do well in servers but Apple wasn't interested in doing that.

mushufasa · on April 27, 2021

it's not that apple will sell server chips. it's that developers can locally work on arm which makes it easier to deploy to severs. linus torvalds had a quote about this...

gregsadetsky · on April 27, 2021

Linus' quote/post:

""" And the only way that changes is if you end up saying "look, you can deploy more cheaply on an ARM box, and here's the development box you can do your work on". """

(emphasis in original)

https://www.realworldtech.com/forum/?threadid=183440&curpost...

Thanks, I did not know about this!

josephg · on April 28, 2021

It’s wild to consider that my next computer (an arm M1 Mac) will be compiling code for mobile (arm) and the cloud (arm). I wonder if we’ll ever see AMD releasing a competitive arm chip and joining the bandwagon.

akmittal · on April 29, 2021

Now with NVIDIA owning ARM I'm not sure AMD will do that

Symmetry · on April 27, 2021

Ah, ok, that makes sense.

leesalminen · on April 27, 2021

Personally, I don’t see many server admins choosing to pay the Apple Tax to get M1 into their data center. I don’t see how the watt/performance ratio could pay off that kind of tax.

lifty · on April 27, 2021

I did not mean to imply that the actual M1 will be used in data centers. Apple is quite popular among developers and its also a trendsetter which will probably lead to other computer manufacturers to adopt ARM for personal computers. So having more people use ARM on their personal computers will lead to more ARM adoption in the data center.

wayneftw · on April 27, 2021

> Apple is quite popular among developers...

The great majority of developers use Windows or Linux according to every Stack Overflow survey from the past ten years. Only ~25% use a Mac.

alireza94 · on April 27, 2021

I believe interpreting statistics from those surveys in this way isn't fair. There are so many developers around the world but the pattern of value/money generation by them is not uniform; in other words, a small percentage of developers work for companies that pay the largest share of server bills and penetration rate of macOS devices among developers of top companies is probably higher than average. (I'm not implying that developers who work on non-macOS devices, make less value because your device doesn't have - nearly - anything to do with your impact. I'm just talking about a trend and possible misinterpretation of data)

kalleboo · on April 28, 2021

25% is still "quite popular".

If 25% of servers switch to ARM that is massive.

leesalminen · on April 27, 2021

Ah, ok, I understand. Thanks for clarifying!

simondotau · on April 27, 2021

The OP wasn't suggesting Apple M1 chips in the data centre, but rather that Apple M1 chips in developer workstations will disrupt the inertia of x64 dev –> x64 prod. It will be easier for developers to choose ARM in production when their local box is ARM.

jamesfmilne · on April 27, 2021

Apple have been hiring for Kubernetes and related roles. This may well be for their own devops for Apple services.

However I'd be amazed if they don't release some kind of managed service for running Swift code in the cloud. Caveat emptor, though.

jacques_chester · on April 27, 2021

I have a similar view of what Apple are up to. Too many high-profile hires to be purely toiling in the mines.

ageyfman · on April 27, 2021

this is inevitable.

k__ · on April 27, 2021

My operating systems teacher 2001 was a total RISC fan and always said it would eventually overtake CISC.

I guess, he didn't expect this to take well after his retirement.

zamadatix · on April 27, 2021

ARM today is probably more CISCy than what he considered CISC in 2001.

fanf2 · on April 27, 2021

The best analysis of RISC vs CISC is John Mashey's classic Usenet comp.arch post, https://www.yarchive.net/comp/risc_definition.html

There he analyses existing RISC and CISC architectures, and counts various features of their instruction sets. They clearly fall into distinct camps.

But!

Back then (mid 1990s) x86 was the least CISCy CISC, and ARM was the least RISCy RISC.

However, Mashey's article was looking at arm32 which is relatively weird; arm64 is more like a conventional RISC.

So if anything, arm is more RISC now than it was in 2001.

kijiki · on April 28, 2021

amd64 is more RISC now than ia32 was in 2001 as well.

floatboth · on April 27, 2021

AArch64 is load-store + fixed-instruction-length, which is basically what "RISC" has come to mean in the modern day. X86 in 2001 was already… not that :)

k__ · on April 27, 2021

I always understood it as that too.

monocasa · on April 27, 2021

Eh, it has a lot of instructions, but that was only the surface of RISC. It's a deeper design philosophy than that.

lofi_lory · on April 27, 2021

Also, isn't x86 ISA just a translation layer today? I thought on the metal, there is a RISC like architecture these days anyway.

astrange · on April 27, 2021

Not really, because the variable length instructions have consequences - mostly good ones because they fit in memory better.

Also, the complex memory operands can be executed directly because you can add more ALUs inside the load/store unit. ARM also has more types of memory operands than a traditional RISC (which was just whatever MIPS did.)

k__ · on April 28, 2021

I had the impression that M1 would outperform others because it didn't had variable length instructions.

Why do you think they have good conequences?

kalleboo · on April 28, 2021

I've understood it, the tradeoff is

The upside to variable length instructions is that they are on average shorter so you can fit more into your limited cache and you make better use of your RAM bandwidth.

The downside is that your decoder gets way more complex. By having a simpler decoder Apple instead has more of them (8 wide decode) and a big reorder buffer to keep them filled.

Supposedly Apple solved the downside by simply throwing lots of cache at the problem and putting the RAM on-chip.

I'm not a CPU guy and this is what I've gathered from various discussions so I'm happy to be corrected.

nonameiguess · on April 27, 2021

In most cases, yes, but it doesn't get rid of the complexity for compiler backends that can't directly target the real instruction sets Intel uses and have to target the compatibility shim layer instead.

up6w6 · on April 27, 2021

Does anyone know if these processors make mobile development easier ? I mean, its the same architecture now, right ?

The only thing I could find is https://www.genymotion.com/blog/just-launched-arm-native-and...

jayd16 · on April 27, 2021

Not really. Maybe the emulators are faster but the main languages are managed anyway.

danudey · on April 27, 2021

Plus for iOS, even the idevice simulator just builds for whatever platform you're testing on.

mwcampbell · on April 27, 2021

I hope that ARM servers with reasonable specs won't be exclusive to AWS and the other hyperscalers. For example, it would be nice if OVH would offer ARM-based dedicated servers.

ksec · on April 27, 2021

Tl;dr

V1 = Slightly tweaked ARM Cortex X1 with SVE ( Used on Snapdragon 888 ) on 7nm aiming at ~4W per Core.

N2 = New Cortex with AMRv9, ~40% IPC improvement over N1 or 10% lower than V1, SVE2, 5nm aiming at ~2W per Core. With Similar die size to N1. ( I fully expect Amazon to go 128 Core with their N2 Graviton )

So in case anyone is wondering, no, it is not Apple M1 level. Not anywhere close.

CMN-700 = More Cores and support of Memory partitioning, important for VMs.

andrewcchen · on April 27, 2021

The cores don't serve the same purpose as the M1 cores. M1 is optimized for single thread at the cost of die size (and a bit of power). I don't have exact numbers, but say the apple M1 core takes 1.5x the die area of N2, the you'd get better performance by putting in 1.5x the number of N2 cores.

ksec · on April 28, 2021

Yes. M1 / A14 is also a 5W+ Core. So a different set of trade off. I mentioned the M1 because it is the question that always comes up and people keep banging on about it. I wish most tech site would simply point this out since they have the reach. But it is obvious neither Apple nor ARM have the interest for their pieces to be named and compared in this way. And I guess tech site wont do this to harm any relationship.

paulpan · on April 27, 2021

Nice TLDR!

It's apples and oranges comparison to Apple M1 chips (server vs. consumer) but does hint at what's possible with the next generation ARM Cortex "X2" cores, that could appear in next year's flagship smartphones and laptops. A 30-40% IPC jump, partly due to moving to 5nm fabrication process, is huge.

Given the right implementation, namely squeezing more big cores than the current 1-3-4 configuration, it could close the gap considerably with Apple.

plekter · on April 27, 2021

Process node changes generally doesn't do anything for IPC - those are generally rather due to microarchitecture improvements, so I doubt the move to 5nm has anything to do with the IPC gain..?

wmf · on April 27, 2021

The node shrink lets you afford more transistors that provide more IPC.

plekter · on April 27, 2021

I agree with that - but if you take an unchanged core and manufacture it at a different node, then you won't see a change in IPC, which in my book makes it questionable to attribute IPC gains to the process node.

MangoCoffee · on April 27, 2021

this is good. if ARM take off for consumers and business. i'm hoping RISC-V will get some traction. there are alternative to Intel's x86.

simondotau · on April 27, 2021

You can tell it's a modern CPU because its name matches the [A-Z][0-9] pattern.

simondotau · on April 28, 2021

That was a joke by the way. Sorry that I'm the only one who thought it was funny.

ykl · on April 28, 2021

I thought it was funny!

simondotau · on April 28, 2021

High five, there are two of us!

saagarjha · on April 28, 2021

As opposed to those old [a-z][0-9] CPUs that Intel puts out ;)

potlee · on April 27, 2021

I wonder why they didn't use AWS wide numbers rather then just EC2. I would have thought EC2 would lag in the transition while AWS services would make the switch quickly

dogma1138 · on April 27, 2021

Because EC2 represents a more realistic market adoption, it’s more important to know if you can run the software of your choice on ARM than can Amazon develop a service on an ARM stack.