Memory access on the Apple M1 processor

jeffbee · on Jan 6, 2021

Great practical information. Nice to see people who know what they are talking about putting data out there. I hope eventually these persistent HN memes about M1 memory will die: that it's "on-die" (it's not), that it's the only CPU using LPDDR4X-4267 (it's not), or that it's faster because the memory is 2mm closer to the CPU (not that either).

It's faster because it has more microarchitectural resources. It can load and store more, and it can do with a single core what an Intel part needs all cores to accomplish.

pdpi · on Jan 6, 2021

This seems to be a recurring theme with the M1, and one that, in a sense, actually baffles me even more than the alternative. There is no "magic" at play here, it's just lots and lots of raw muscle. They just seem to have a freakishly successful strategy for choosing what aspects of the processor to throw that muscle at.

Why is that strategy simultaneously remarkably efficient and remarkably high-performance? What enabled/led them to make those choices where others haven't?

jeffbee · on Jan 6, 2021

I don't have any inside-Apple perspective, but my guess is having a tight feedback cycle between the profiles of their own software and the abilities of their own hardware has helped them greatly.

The reason I think so is when I was at Google is was 7 years between when we told Intel what could be helpful, and when they shipped hardware with the feature. Also, when AMD first shipped the EPYC "Naples" it was crippled by some key uarch weaknesses that anyone could have pointed out if they had been able to simulate realistic large programs, instead of tiny and irrelevant SPEC benchmarks. If Apple is able to simulate or measure their own key workloads and get the improvements in silicon in a year or two they have a gigantic advantage over anyone else.

martamorena2 · on Jan 6, 2021

That's bizarre. As if CPU vendors were unable to run "realistic" workloads. If they truly aren't, that's because they are unwilling and then they are designing for failure and Apple can just eat their lunch.

proverbialbunny · on Jan 6, 2021

As a data scientist, I feel this. Intel and AMD don't own an OS or an app store, and you might be surprised how hard it is to get good data. Data is the new gold. If a company that can corner a piece of the market, they can collect data no one else can, and from that companies are often forced to partner or they can't properly provide services that will keep them competitive.

epistasis · on Jan 7, 2021

This makes me think that any sort of data advantage Apple may have has nothing to do with them owning an OS. Intel has a massive computer network, managed by their own IT team, just like any other large corporation. Intel could collect whatever performance data they want from actual users of actual programs just as easily as Apple could.

eyelidlessness · on Jan 7, 2021

Apple doesn’t just control an operating system or an App Store. They also control a development toolchain and the two primary languages compiled for their platforms, as well as most of the frameworks used in commonly used apps (excepting the dreaded electron). They have a platform that’s been tailored to be profiled and optimized.

One early benchmark showed allocating and destroying an NSObject performing drastically better on the M1 vs recent Intel Macs. This wasn’t an accident. It’s probably not representative of performance overall. They have enough vertical integration to make their own first party solutions clear optimization targets.

epistasis · on Jan 7, 2021

This sort of "data," that optimizing contention free locks could have big rewards, isn't something that you need to control the OS or compiler/profiler/debugger toolchain to understand and learn. And for that matter, Intel has excellent compilers and profilers too.

All it takes is looking at what's going on in commonly used code, deciding to optimize for X, Y, and Z, and commit to it. If Intel isn't doing this already, that's all the fault of current management for not making it a priority.

The only way that Apple's vertical integration helped them make that management decision is that they were able to say "our customer is a typical laptop user." Intel tries to cater to much larger markets, so perhaps when management goes to plan a laptop chip, they are less aggressive with deciding to optimize. But I have a feeling that Apple's optimizations are generally good for nearly all code, not just for specific use cases.

jeffbee · on Jan 7, 2021

There's really no explanation for why EPYC "Naples" was so bad other than AMD did not internally understand the performance of realistic large-scale programs. I mean even if they had taken anything off the shelf, for free, like MySQL, they could have determined at some point before mass production that their CPU, in fact, sucked. But they shipped it and prospective customers rejected it.

Don't discount how a weak organization can make poor decisions even when all necessary information seems to be readily available.

philistine · on Jan 7, 2021

Apple is the only large company with a functional organization. Could that be it ? Coupled with their unparalleled ownership of a family of platforms (intel’s OS comparably is nonexistent).

epistasis · on Jan 7, 2021

I think you're probably right, but it's funny because I think Intel was well known for having really exceptional organizational function in the past. They used to be paranoid about everything!

ohazi · on Jan 8, 2021

> Intel and AMD don't own an OS or an app store

They could run their benchmarks on the Debian repository and it would probably be representative enough.

If they haven't done so, perhaps because it seemed tedious or unimportant, well, that mistake would be on them.

pabs3 · on Jan 7, 2021

Intel do own an OS, Clear Linux, but they probably lack profiles of typical usage of that OS, and probably there are not many users of it apart from Phoronix when they do benchmarking.

https://clearlinux.org/

kortilla · on Jan 7, 2021

It’s a big world out there. Workloads in data science vs gaming vs hft vs packet processing vs web servers are all extremely different.

Even if you know about them, you need an expert in each to truly push the hardware to the real limits that get hit in the respective industry. The small differences between real implementations and simulated loads can drastically alter the performance characteristics and cause proc manufacturers to miss the mark.

bradknowles · on Jan 10, 2021

There’s another factor here, that I learned when I did a six month contract at Apple Retail Software Engineering.

In short, Apple has a number of 10x engineers that they move around to whatever project needs the most help, whether that is hardware, or application software, or operating system software, or services infrastructure, or whatever.

If some project starts getting enough negative “above the fold” coverage, then they will be temporarily gifted one or more of these special engineers.

Do that enough times, and those 10x engineers will gain enough experience in enough different areas that they will be able to reason well about the other end of whatever pipeline they’re on, and will know other 10x engineers that they can work with on that other end of the pipeline, because they had previously worked with them on some other project in the past months or years.

And those 10x engineers really will make a huge difference in what that project is capable of delivering.

The key failure of this operating mode is that most of the 10x engineers never get enough time to transfer much knowledge or skills to the others on the temporary team they are currently working with, and so things will start slowly deteriorating when they are necessarily moved on to the next project.

Rinse and repeat.

gumby · on Jan 7, 2021

> I don't have any inside-Apple perspective, but my guess is having a tight feedback cycle between the profiles of their own software and the abilities of their own hardware has helped them greatly.

Also we see not the first chip but the first one that met their needs (demonstrably better performance on their workloads).

By which I mean: presumably MacOS has been running a many generations of A processors, so they have had a lot of time to figure out what tweaks would be good and which turn out to be pessimization and overkill. It doesn't hurt that there is significant internal overlap between modern macOS and iOS.

dcolkitt · on Jan 6, 2021

Interesting point. This would suggest pretty sizable synergies from the oft-rumored Microsoft acquisition of Intel.

ChuckNorris89 · on Jan 6, 2021

Microsoft doesn't need to acquire Intel, they need to do what Apple did and acquire a stellar ARM design house that will build a chip with x86 translation, tailored to accelerate the typical workloads on Windows machines and sell those chips to the likes of Dell and Lenovo and tell developers "ARM Windows is the future, x86 Windows will be sunset in 5 years and no longer supported by us, start porting your apps ASAP and in the mean time, try our X86 emulator on our ARM silicon, it works great."

Microsoft proved with the XBOX and Surface series they can make good hardware if they want, now they need to move to chip design.

sitkack · on Jan 7, 2021

Microsoft has a pretty good relationship with AMD from the Xbox. AMD already made an Arm Opteron. Windows has been multiplatform since NT 3.1 (Alpha, MIPS) and then in 3.51 adding in PowerPC. You can download Windows for Arm for free and run in on a Raspberry Pi.

Microsoft has at least one homegrown processor that it has ported Windows and Linux to with the confusingly named 'Edge'.

https://www.theregister.com/2018/06/18/microsoft_e2_edge_win...

Microsoft doesn't even _need_ to target Arm, they could easily team up with AMD or go the whole thing solo and target anything from RISC-V, Arm to an in-house ISA.

mietek · on Jan 7, 2021

> Windows has been multiplatform since NT 3.1 (Alpha, MIPS) and then in 3.51 adding in PowerPC.

What was the last version of Windows to support either of these platforms?

jeffbee · on Jan 7, 2021

PowerPC was the last architecture standing. It was supported by NT 4.0 SP2 (technically this means Microsoft supported it on paid support contracts as recently as 2006).

gsnedders · on Jan 7, 2021

Windows Server 2008 R2 supported Itanium.

sroussey · on Jan 7, 2021

There are rumors of Ryzen CPUs with ARM ISA, but I think AMD has enough on its plate.

ralfd · on Jan 6, 2021

Apple has at most 10% of the computer market and is just one player among many. I am skeptical Microsoft with their 90% dominance would or should be allowed this much power over the industry.

sjwright · on Jan 7, 2021

The traditional personal computer market isn't nearly as important as it used to be. However you slice the pie, there's no way you can define the pieces to be 10% Apple and 90% Microsoft with a straight face.

mechEpleb · on Jan 7, 2021

90% dominance of what is increasingly a small niche market. Apple controls a large fraction of the mobile device market, and everything else runs linux.

icedchai · on Jan 7, 2021

The fact is outside of the tech scene, most businesses and consumers runs Windows. To say this is a "small niche market" is laughable. Microsoft is everywhere.

bradknowles · on Jan 10, 2021

Compare how many millions of desktop PCs there are in the world, versus the billions of mobile devices.

eyelidlessness · on Jan 7, 2021

From what I understand, Microsoft is excellent at hardware design. They’re just focused on a different market (services->business rather than consumers->services)

f6v · on Jan 6, 2021

> Microsoft acquisition of Intel

Could that possibly be approved by governments?

gigatexal · on Jan 6, 2021

Nope. Not a lawyer but I doubt it at all.

TYPE_FASTER · on Jan 6, 2021

Apple has been iterating on their proprietary mobile ARM-based processors since 2010, and has gotten really good at it. I would imagine that producing billions of consumer devices with these chips has helped give them a lot of experience in shortened time frame.

I also wonder if having the hardware and software both worked on in-house is an advantage. I mean, if you're developing power management software for a mobile OS, and you're using a 3rd-party vendor, then you read the documentation, and work with the vendor if you have questions. If it's all internal, you call them, and could make suggestions on future processor design too based on OS usage statistics and metrics.

simonh · on Jan 7, 2021

In fact there is clear evidence of this with M1. It has optimised instructions to speed up retain/release of NSObject subclasses, which is a frequent operation on almost all Objective-C and Swift classes. They also designed the M1 to support a memory management profile used by x86 (and not ARM) to accelerate Rosetta translated binaries. I'm sure there are more.

coldtea · on Jan 6, 2021

>Why is that strategy simultaneously remarkably efficient and remarkably high-performance? What enabled/led them to make those choices where others haven't?

The things people give them complains about:

(a) keeping a walled garden,

(b) moving fast and taking the platform to new directions all at once

(c) controlling the whole stack

Which means they're not beholden to compatibility with third party frameworks and big players, or with their own past, and thus can rely on their APIs, third party devs etc, to cater to their changes to the architecture.

And they're not chained to the whims of the CPU vendor (as the OS vendor) or the OS vendor (as the CPU vendor) either, as they serve the role of both.

And of course they benchmarked and profiled the hell out of actual systems.

jeffbee · on Jan 6, 2021

Neither A nor C makes any sense, are not supported by evidence. There is no aspect of the mac or macOS that can be realistically described as a "walled garden". It comes with a compiler toolchain and ... well, some docs. It natively runs software compiled for a foreign architecture. You can do whatever you want with it. It's pretty open.

A "walled garden" is when there is a single source of software.

matthew-wegner · on Jan 7, 2021

"A" does matter a bit. Builds are uploaded to the App Store include bitcode, which Apple strips on distribution.

According to docs, enabling bitcode: "Includes bitcode which allows the App Store to compile your app optimized for the target devices and operating system versions, and may recompile it later to take advantage of specific hardware, software, or compiler changes."

It seems quite likely they have (and probably used) the capability to recompile any app on their platform to benchmark real workloads against prototype silicon changes.

mamp · on Jan 7, 2021

Most non-Apple apps aren’t from the App Store

coldtea · on Jan 7, 2021

Doesn't matter for the purposes of what the parent mentioned.

It's enough that enough of them are.

Plus, even if most aren't, the "short tail" people use these days probably are.

coldtea · on Jan 6, 2021

Walled garden has many meanings, depending on context.

macOS promotes the App Store as the source of software (even if it's not the sole), and has walls like notarization requirements and the Gatekeeper to prevent weeds from intruding.

With the App Store Apple knew that there's a pool of N apps that follows its guidelines, has passed internal checks for API use, and can be converted quite easily to a different architecture, that it could count on.

Their control over the platform allowed them to enforce Metal and deprecate OpenGL pronto, to add a new combined iOS/macOS UI libs, to introduce Marzipan.

They have also added stuff like Universal Binary support, and most importantly Bitcode, which abstracts away parts of the underlying architecture.

All of those where steps towards the ARM/M1 (and future developments), and all were enabled via Apple's control of the hardward, software, and - sure, partial - control of third party apps.

ogre_codes · on Jan 6, 2021

People seem to get creative about terminology. It's not remotely like what I'd consider a walled garden (xBox, iOS, Playstation, etc).

paulryanrogers · on Jan 6, 2021

Running apps downloaded outside the store requires jumping through an increasing number of hoops or vendors pay to get every build signed off by a single party.

coldtea · on Jan 7, 2021

You don't have to pay to have your apps signed off (notarized).

paulryanrogers · on Jan 7, 2021

I'll have to check again because if that's the case it's still a walled garden, just with free admission.

astrange · on Jan 7, 2021

Just right-click > Open and you can do anything you want.

perryizgr8 · on Jan 7, 2021

Which is a short wall you just jumped. To get into the walled garden.

fartcannon · on Jan 6, 2021

They could still do all this shit without the walled garden. To me, it suggests they aren't willing to compete. They're anti-competitive.

ogre_codes · on Jan 6, 2021

> They could still do all this shit without the walled garden.

They do. MacOS isn't a walled garden.

> They're anti-competitive

Have you heard of this little company from Washington called Microsoft? They have something like 85% of the PC market. There is another OS called Linux. About 85-90% of the internet runs on it.

I can understand a little where people get the idea the iPhone is anti-competitive, but we're talking about MacOS here.

fartcannon · on Jan 7, 2021

It's the same cowardly leadership that stewards both iOS and OSX.

Ask Amphetamine about how open they are.

ogre_codes · on Jan 7, 2021

I do wish people would keep the quasi-religious aspects out of things.

Amphetamine is a perfect example of how the Mac isn't a walled garden. They always had the option top sell outside the App Store. That is fundamentally the difference between what makes a platform a walled garden. They might have lost some sales because they couldn't participate in the Mac App Store, but they could still sell their product. Some companies choose to avoid the Mac App Store because they don't like Apple's policies.

coldtea · on Jan 7, 2021

>Ask Amphetamine about how open they are.

That's neither here nor there.

(a) Amphetamine could still be sold outside the Mac App Store.

(b) An app name could be problematic even in FOSS land. It's just that instead of Amphetamine being the name that causes it, it will be something else. E.g. with the trend of banning/changing terms like "master" (as in replication primary master, not as in the owner of slaves), unfortunately named apps could be thrown out something or ask to be renamed to be included in a distros package manager or a project.

astrange · on Jan 7, 2021

There are a lot of juvenile named open source projects that will definitely get in trouble or already have. GIMP and LAME are examples.

marrvelous · on Jan 6, 2021

With the walled garden, Apple can set enforceable timelines for the software ecosystem to adopt to architectural changes.

Remember the transition to arm64? Apple forced everything on the App Store to ship universal binaries.

Without the App Store walled garden, software isn’t required to keep up to date with architectural changes. Instead, keeping current is only a requirement to being featured on the App Store (which would just be a single way to install software, not the only method).

danaris · on Jan 6, 2021

Well, and on the Mac, it's not the only method. The walled garden here has big open gates.

That said, all software on the Mac, post-Catalina, has to be 64-bit, whether it's distributed through the Mac App Store or not, because the 32-bit system libraries are no longer included at all.

coldtea · on Jan 6, 2021

>Well, and on the Mac, it's not the only method. The walled garden here has big open gates.

Gates are not incompatible with walled gardens. Most walled gardens have those.

Plus, I mentioned the walled garden as a good thing. It's part of the Apple proposition (even if not alll get it), and part of what it enables it to move at the speed it does (whether in the right or wrong direction).

But one can susbstitute "walled garden" with "tight control of the OS, hardware and imposed requirements on most of third party software, and willingness to enforce hard schedules (e.g. regarding removing 32-bit, OpenGL, etc) to all (or tons) of its developers at once.

astrange · on Jan 7, 2021

32-bit Windows software is actually supported through WINE and works in Rosetta.

coldtea · on Jan 7, 2021

This is about 32-bit OS X software/libs.

ogre_codes · on Jan 6, 2021

They are little tiny 6" tall walls that you can step over. Like micro-walls. Except for the bits where you have no walls at all. Like if you install literally any programming language, HomeBrew, or MacPorts.

The walls in the walled garden only exist in the heads of people who never use a Mac.

coldtea · on Jan 6, 2021

>They could still do all this shit without the walled garden

With much slower adoption, pushback, and bike-shedding, like in the Microsoft and Linux world.

>To me, it suggests they aren't willing to compete.

Compete with what? With themselves? They compete with Windows (and to a degree Linux, though few care for that), and with Android. They'd compete with Windows Phone too if MS wasn't incompetent.

But they didn't do anything to preclude others from making their OS/hardware and selling it to customers. In fact, they have nowhere near a monopoly in either the desktop (10% or less) or the mobile space (40% or less).

Whereas MS for example, had 98% of the desktop (home and enterprise), and abused its power to threaten OEMs to do its bidding against Linux etc.

anfilt · on Jan 6, 2021

I will be honest as long apple keeps this walled garden shenanigans going. I am not buying any of their hardware.

anfilt · on Jan 6, 2021

What's with downvotes? I know some people don't mind, but it's a deal breaker to me, and it something I don't want to support.

I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.

A walled garden does not make their hardware any better. If anything it makes it worse. I hope for Mac users apply does not clamp down further on Macs.

coldtea · on Jan 6, 2021

>What's with downvotes?

Didn't downvote, but I think it's the same as when people read a "letter to the editor" of yore, declaring that some person "cancelled their subscription" because of something in the magazine.

A natural response is "Don't let the door hit you on your way out", which on HN might be expressed through a downvote by some.

>I don't care how good their hardware is. Moreover, good luck sourcing parts if the device has trouble. Apple will not sell you the parts. Even if you wanted.

Well, they repair all kinds of parts, and have guarantees and guarantee extension programs. But in any case, their allure was never "can find parts to build my own / repair damages forever" or in their stuff being cheap to own or fix/replace.

>A walled garden does not make their hardware any better.

Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster.

anfilt · on Jan 7, 2021

> "Don't let the door hit you on your way out"

I never bother to enter apple's tyrannical ecosystem . So there is no door to hit me on the way out.

> Well, they repair all kinds of parts, and have guarantees and guarantee extension programs.

You can not get a lot surface mount chips to repair a mac book or iPhone without having to look on the gray-market. This is even before possible firmware issue if you manage to find parts. Heck even getting full replacement boards is basically impossible, unless they are from donor machines that have other problems.

> "Well, it does in a few ways. Mandating how the software is made, and what software is sold, when it should adapt new libs to continue being sold, etc, means that they can move the platform in different ways faster."

I disagree, allowing people to side-load does not stop apple from having policies in place for it's app stores. That's honestly the only problem. It's the owners hardware, they should not need apples permission to run code on it. Unless the owner can sign software themselves and/or run it without apple's consent this will always be a problem. You can't even install gcc without jailbreaking an iPhone.

At the very least I should be able to install an other OS on the device like GNU/Linux. If apple does not want to open iOS the user should at least have that option for the hardware.

This is even before you get into how apple treats developers. Have you read the entire App store guidelines. Some of it is ridiculous. Some of the insanity prevents Firefox from even porting their own browser engine.

mhh__ · on Jan 6, 2021

I think it's worth saying that because AMD have only just really hit their stride, Intel were under almost zero pressure to improve which has really hurt them especially with the process.

X86 is definitely a coefficient overhead, but if Intel put their designs on 5nm they'd look pretty good too - Jim Keller (when he was still there) hinted their offerings for a year or so in the future are significantly bigger to the point of him judging it to be worth mentioning so I wouldn't write them off.

jandrese · on Jan 6, 2021

It seems like Apple listened when people talked about how all modern processors bottleneck on memory access and decided to focus heavily on getting those numbers better.

Of course this leads to the question that if everyone in the industry knew this was the issue why weren't Intel and AMD pushing harder on it? They already both moved the memory controller onboard so they had the opportunity to aggressively optimize it like Apple has done, but instead we have year after year where the memory lags behind the processor in speed improvements, to the point where it is ridiculous how many clock cycles a main memory access takes on a modern x86 chip.

wmf · on Jan 7, 2021

The Apple, Intel, and AMD memory controllers all look pretty similar in performance to me. Memory latency is the same at ~100 ns; Firestorm is clocked lower so latency is lower in terms of cycles. One Firestorm core can saturate the memory controller while Intel/AMD can't so that should be an advantage for single-threaded scenarios. Intel/AMD are behind, but I wouldn't say embarrassingly so and they haven't been lazy.

proverbialbunny · on Jan 6, 2021

My guess is it had to do with limitations tied to the x86_64 instruction set. It doesn't matter how much modifications you do, if you don't start with a good foundation, you're going to be limited to that foundation.

nkurz · on Jan 7, 2021

I think the current consensus among experts is that the instruction set is not the limiting factor. Modern x64 microprocessors have a separate front-end that handles instruction decoding. These instructions are decoded to internal proprietary "micro-ops". The internal buffers and actual execution units see only these µops. One can measure where the bottlenecks are, and it's rare to find that the front-end is the bottleneck. While it's arguably true that x64 is a poorly designed "foundation", it's unlikely to be causing any performance difference here.

wtallis · on Jan 7, 2021

> One can measure where the bottlenecks are, and it's rare to find that the front-end is the bottleneck.

Part of this is due to the fact that x86 processor designers won't include more execution units than they can feed from their instruction decoders. Apple's processors are much wider than x86 on both the decode and execution resources, and it's pretty clear that the M1 would not perform as well if its decoders were as narrow as current x86 cores.

nkurz · on Jan 7, 2021

> it's pretty clear that the M1 would not perform as well if its decoders were as narrow as current x86 cores

This would imply that it's able to sustain ILP greater than 4 (or maybe 5 with macro-fusion). Does it actually manage to do this often? If so, that's really impressive. I was guessing that most of the advantage was coming from the improved memory handling, and possibly a much bigger reorder buffer to better take advantage of this, but I'm happy to be shown otherwise.

astrange · on Jan 7, 2021

There are real differences in processors caused by their ISAs - it's not true that decoders mean it's all the same RISC in the backend.

For instance, it's hard to combine instructions together, which is actually an advantage for x86 (the complex memory operands come for free). But it also guarantees memory ordering that ARM doesn't which is a drawback.

I'm not sure how important this is in practice.

nkurz · on Jan 7, 2021

> For instance, it's hard to combine instructions together, which is actually an advantage for x86 (the complex memory operands come for free).

True, although I just looked at the ARM assembly for Daniel's example, and it's making good use of "ldpsw" to load two registers from consecutive memory with a single instruction. So in this particular case, it may be a wash.

> But it also guarantees memory ordering that ARM doesn't which is a drawback.

Yes, I wasn't considering the memory model to be part of the instruction set. I agree that in general this could be a big difference in performance, although I don't think it comes up in Daniel's example.

I added a comment to Daniel's blog with my guess as to what's happening to cause the observed timings in his example. Feedback from anyone with better knowledge of M1 would be appreciated.

rstuart4133 · on Jan 7, 2021

> I think the current consensus among experts is that the instruction set is not the limiting factor.

Yes and no. Yes, because modern super scalar CPU's don't execute the instructions directly, but rather use a different instruction set entirely (the "micro-ops") and effectively compile the native instructions into that. This makes them free to choose whatever micro-ops they want. Ergo the original instructions don't matter.

But .... that means there is a compile step now. For a while that was no biggie - it can pipelined if the encoding is complex. But now the M1 has 12 (iirc) execution units. In the worst case that means they can execute 12 instructions simultaneously, so they must decode 12 instructions simultaneously. The is a wee exaggeration as it isn't that bad. In reality the M1 appears to compile 8 instructions in parallel.

This is where the rot sets in for x86. Every ARM64 instruction is 32 bits wide. So the M1 grabs 8 32 bit words, compilers them in parallel to micro-ops. Next cycle, grab another 8 32 words, compile them to micro-ops, and so on. But the x86 instructions can start on any byte boundary, and can be 1 to 16 bytes in length. You literally have to parse the instruction stream a byte at time before you can start decode it. In practice they cheat a bit, making speculative guesses about where instructions might start and end, but when you're being compared to someone who effortlessly processes 32 bytes at a time that's like pissing in the wind.

So the instruction set may not matter, but how you encode that instruction set does matter, at lot. Back in the day, when there we few caches and every instruction fetch cost memory accesses, you were better off using tricks like using one byte for the most common instructions to squeeze the size of instruction stream down. That is the era x86 and amd64 hark from. (Notably, the ill-fated Intel iAPX 32 took it to an extreme, having instructions start and end on a bit boundary.) But now with execution units operating in parallel, and on chip caches putting instruction stores on chip right beside the CPU's, you are better off making storage size worse in order to gain parallelism in decoding. That's where ARM64 harks from.

It's interesting watch RISC-V grapple with this. It's a very clever instruction set encoding that scales naturally between different word sizes. This also naturally leads to a very tight, compressed instruction set. But in order to achieve that they've got more coupling between instructions than ARM64 (but far, far less than x86), and any coupling makes parallelism harder. Currently RISC-V designs are all at the small, non-parallel end, so it doesn't effect them at all. In fact at the low power end it's almost certainly a win for them. But I get the distinct impression the consensus of opinion here on HN is it will prevent them from hitting the heights ARM64 and the M1 achieve.

nkurz · on Jan 8, 2021

Great comment, and great accurate explanations of complex stuff! I'm still going to disagree with the conclusion, though. Yes, x64 has to jump over horrible hurdles to prevent instruction decode from being a bottleneck. Yes, this makes for more complex processors, possibly with poorer performance per Watt. But I'm asserting (in my reasonably expert opinion) that the ridiculous contortions currently in use are (in almost all cases) adequate to prevent the instruction decoder from being the bottleneck.

What's missing from your description is the extra level of decoded µop cache between the decoder and instruction queue on modern Intel chips. In a tight loop, this pre-decoder kicks in and replays the previously decoded µops at up to 6 per cycle. It's a mess (and complicated enough that Intel needed to disable part of it with a microcode update on Skylake) but it provides enough instruction throughput that the real bottleneck is almost always elsewhere. Specifically, the 4-per-cycle instruction retirement limit almost always maxes out my attempts at extremely tight loop code earlier than instruction decoding.

Which is to say, you are right about how much easier it is to decode ARM64 instructions, but I think you are wrong that decoding x64 is in-practice a limiting factor for performance. If you have a non-contrived example to the contrary, I'd love to see it.

More details here: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

And in this nice blog post: https://paweldziepak.dev/2019/06/21/avoiding-icache-misses/

fctorial · on Jan 7, 2021

> but instead we have year after year where the memory lags behind the processor in speed improvements

That's what cache and tlb are for.

sitkack · on Jan 7, 2021

Because they both use new memory standards to force a refresh in their CPU platforms to cause more churn and revenue.

adam_arthur · on Jan 6, 2021

Certainly Apple's processors are far ahead, but they're a full process generation (5nm) ahead of their competitors. They paid their way to that exclusive right through TSMC.

I'm sure they'll still come out ahead in benchmarks, but the numbers will be much closer once AMD moves to 5nm. You absolutely cannot fairly compare chips from different fab generations.

I don't see many comments hammering this point home enough... it's not like the performance gap is through engineering efforts that are leagues ahead. Certainly some can be attributed to that, and Apple has the resources to poach any talent necessary.

GeekyBear · on Jan 6, 2021

A node shrink gives you a choice of cutting power, improving performance, or some mix of the two.

Apple appears to have taken the power reduction when they moved to TSMC 5nm.

>The one explanation and theory I have is that Apple might have finally pulled back on their excessive peak power draw at the maximum performance states of the CPUs and GPUs, and thus peak performance wouldn’t have seen such a large jump this generation, but favour more sustainable thermal figures.

Apple’s A12 and A13 chips were large performance upgrades both on the side of the CPU and GPU, however one criticism I had made of the company’s designs is that they both increased the power draw beyond what was usually sustainable in a mobile thermal envelope. This meant that while the designs had amazing peak performance figures, the chips were unable to sustain them for prolonged periods beyond 2-3 minutes. Keeping that in mind, the devices throttled to performance levels that were still ahead of the competition, leaving Apple in a leadership position in terms of efficiency.

https://www.anandtech.com/show/16088/apple-announces-5nm-a14...

justsid · on Jan 7, 2021

That seems like a pretty good trade off for mobile devices. They usually don’t have sustained performance needs, you aren’t going to render a movie or do other long running computational tasks. But mobile has a lot of bursty power demands: Launch dozens of app many times a day for example. You want your short-ish interactions with your phone to be snappy.

wmf · on Jan 6, 2021

From a customer's perspective it's not my problem. Everyone had the opportunity to bid on that fab capacity and they decided not to.

adam_arthur · on Jan 6, 2021

Yeah, totally agreed. But if you read these comments, they seem to be in total amazement about the performance gap and not acknowledging how much of an advantage being a fab generation ahead is.

Customers don't care, but discussion of the merits of the chip should be more nuanced about this.

It also implies that the gap won't exist for very long, as AMD will move onto 5nm soon

tandr · on Jan 6, 2021

> It also implies that the gap won't exist for very long, as AMD will move onto 5nm soon

... yes, if there is any capacity left. Capacity for the new process is a limited resource after all.

jack_pp · on Jan 7, 2021

People keep pointing this out but has Intel had such significant performance improvements since sandy bridge? With x86 it seems that lately you would be foolish to upgrade less than once every 3-4 years because the difference is just not that significant

adam_arthur · on Jan 7, 2021

The i7-2600K (Sandy Bridge) benchmarks at ~5000 on Passmark, and the i7-10700K at about 20,000. So it seems they've had quite a bit of improvement. Note this is going from 32nm to 14nm.

Intel is in a really bad place now (in a forward-looking sense), primarily due to their fab process falling behind TSMC and others. You can't design your way ahead while using old manufacturing technology

https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-2600K...

wmf · on Jan 7, 2021

Over the last decade or so Apple has gone from 10x slower than Intel to parity, mostly by implementing techniques that were already known. Surpassing the state of the art may be harder to do consistently.

valuearb · on Jan 6, 2021

A node shrink is going to help AMD by 15% at best, they are much farther behind than that on performance per watt.

AMD has done mobile CPUs that look as if they are close to or even ahead of the M1 in performance, but they all use 2x to 4x as much power. When higher core count versions of Apple Silicon are available, they will be able to have double the core counts of AMD chips at the same power levels.

And each those cores are significantly faster than individual AMD cores.

GeekyBear · on Jan 7, 2021

AMD's mobile chips are still on Zen 2 cores, so they are behind the M1 on single core performance in both integer and floating point.

It's their desktop chips running Zen 3 cores that trade blows on single core integer math, depending on which benchmark you look at.

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

Of course, on multi-core performance you can buy a desktop chip with a much higher Zen 3 core count than Apple offers.

valuearb · on Jan 7, 2021

The Ryzen 5950x has 135 watt TDP in actual use. That’s roughly 8x the M1.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

And Ryzen chips are offered with more cores, but that’s an extremely temporary advantage (reminds me of the friend who told me not to buy Apple stock because they didn’t have big screen phones).

When Apple fits 32 Firestorm cores in a 135 watt TDP package, AMD isn’t going to have an answer.

adam_arthur · on Jan 7, 2021

Power consumption is not linear as it relates to performance. CPUs designed for the desktop are going to use excessive power by design. They'll often use many times more power than mobile equivalent, but only have slightly better single core performance.

Here's a great example, Intel Core i9 (Desktop, 125w TDP) vs Intel Core i7 (Laptop, 15W). Huge power difference, only ~10% difference in single core clock speeds. https://cpu.userbenchmark.com/Compare/Intel-Core-i9-10900K-v...

valuearb · on Jan 8, 2021

It might not be linear in terms of single core performance, where Apple already dominates, but it sure has a nearly linear relationship to multicore performance.

adam_arthur · on Jan 8, 2021

Ok, but their multicore score is only on par, not better than equivalent class chips on older fab generations. Where is the huge gap?

https://wccftech.com/intel-and-amd-x86-mobility-cpus-destroy...

Or check multicore scores here: https://browser.geekbench.com/mac-benchmarks

m1 is better by 20% or so, but is 2 fab generations ahead of the intel chips

valuearb · on Jan 10, 2021

The M1 only has 4 Firestorm cores and 15 watts TDP. It’s other four cores are Icestorm cores offering roughly 1/10th the performance and 1/3rd the power draw.

https://www.anandtech.com/show/16192/the-iphone-12-review/2

The Ryzen 4900H is a 55 watt TDP part with eight performance cores. Despite that it’s far behind the M1 in GeekBench single core and multicore scores.

https://browser.geekbench.com/v5/cpu/search?q=4900H

Now as you shown it’s way ahead in Cinebench 23 Multicore benchmark. Let’s assume that’s more representative of real world use, and that being a process behind it will add 15% higher performance when AMD makes a 5 nm successor. That would increase its Cinebench Multicore to roughly 12,700, roughly 60% higher than the M1.

But all Apple has to do is come out with an M1X with eight Firestorm cores. That’s a multicore performance in the same range as the best possible 5 nm AMD CPU, and a TDP barely half of the AMD CPU. And far higher Cinebench single core, and GeekBench single/multicore ratings.

Obviously to swap Firestorm for IceStorm cores they need more transistors or something else has to go (on chip GPU?).

And Apple won’t be making 20 hour MacBook Airs out of this M1X, but they will be able to make 14 hour MacBook Pros with faster discrete GPUs.

And they are about 6 months from doing exactly that. Which is likely going to before AMD 5 nm, so top of line 4900H laptops get smoked on all Cinebench and GeekBench scores while using nearly double the power.

fomine3 · on Jan 7, 2021

Apple M1 is fully optimized for efficiency because it's delivered from mobile SoC and still aimed for portable laptop, meanwhile Zen3 (or Willow Cove) arch is aimed for all laptop/desktop/server category so they optimized for both efficiency and max performance.

Even though M1 is designed for efficiency, it sometimes outperforms AMD/Intel for performance. That confuses the story.

inkyoto · on Jan 7, 2021

The 5nm vs 7nm vs 10nm vs whatever nm narrative is highly reminiscent that of clock rate flame wars raging in the tech bubble over a decade ago. Back in the early aughts, when the great engineering was still a thing, alternative CPU designs (DEC Alpha 21264, MIPS 10k/12k, PA RISC, POWER and later UltraSPARC designs) were consistently outperforming any x86 design by at least an order of magnitude or more whilst being clocked at 30% to 50% less of the x68 designs, especially in FP operations. The alternative designs explored and utilised wider and deeper pipelines, bigger L1 caches and various optimisations across the entire CPU arch. Every alternative CPU design had something unique to offer, and that was a great thing to read about and study.

The commoditisation of the PC hardware has driven great CPU designs into an extinction. Heck, even Oracle, that are now in the business of litigation for fun and a massive profit, with its prodigious cash war chest has discontinued the UltraSPARC architecture due to it requiring extraordinary investments on multiple fronts. PC users have long been forced to be content with whatever bone the CPU architecture coloniser would throw at them. There appears to be a resurgence of the great engineering with M1, and, hopefully, that will lead to more of the thoughtful engineering in medium to long term.

M1 is fast due to: a solid, single vision of what a modern CPU should be like, continuous investment into R&D over an extended period of time, a well concerted effort of the engineering, design ideas reuse across multiple product lines, supply chain management, and, of course, the manufacturing process. Nanometers do not make for a great CPU design but rather play a supporting role. If the nanometers were so important, the 2017 POWER9 design manufactured at a 14 nm process with a smaller L1 cache would not have been able to outperform any existing x86 design in 2020 in both, single core and multi core (with 25% to 50% lesser number of physical cores) setups? Ryzen 3 has narrowed the gap, but POWER9 still takes the lead and POWER10 is around the corner.

There is a great quote by Michael Mahon, a principal HP architect, in the foreword to the PA RISC 2.0 CPU architecture handbook from 1995:

The purpose of a processor architecture is to define a stable interface which can efficiently couple multiple generations of software investment to successive generations of hardware technology. Stability and efficiency are the goals, and the range of software and hardware technologies expected during the architecture’s life determine the scope for which the goals must be achieved

...

Efficiency also has evident value to users, but there is no simple recipe for achieving it. Optimizing architectural efficiency is a complex search in a multidimensional space, involving disciplines ranging from device physics and circuit design at the lower levels of abstraction, to compiler optimizations and application structure at the upper levels.

Because of the inherent complexity of the problem, the design of processor architecture is an iterative, heuristic process which depends upon methodical comparison of alternatives («hill climbing») and upon creative flashes of insight («peak jumping»), guided by engineering judgement and good taste.

To design an efficient processor architecture, then, one needs excellent tools and measurements for accurate comparisons when «hill climbing,» and the most creative and experienced designers for superior «peak jumping.» At HP, this need is met within a cross-functional team of about twenty designers, each with depth in one or more technologies, all guided by a broad vision of the system as a whole.

Well executed holistic approach is the reason why the entry level M1 is fast. We need more of «holistic-ism» in engineering everywhere.

icedchai · on Jan 7, 2021

Sorry, but alternate CPU designs were never outperforming x86 by an "order of magnitude", especially not at a lower clock speed. That is a complete exaggeration. I was around during that time period and can find nothing that supports this. The Alpha was fast, yes, but you're talking 2 to 3x best case with floating point compared to a 2 to 3x cheaper Intel system. I did find some old benchmarks: http://macspeedzone.com/archive/4.0/WinvsMacSPECint.html

UltraSPARC was not very competitive. Those machines were very, very expensive and you didn't get much bang for the buck. They weren't even that fast. The later chips had tons of threads but single threaded performance was pretty bad...

baybal2 · on Jan 6, 2021

> There is no "magic" at play here, it's just lots and lots of raw muscle. They just seem to have a freakishly successful strategy for choosing what aspects of the processor to throw that muscle at.

There is no freakishly successful strategy at play there as well. It's just all previous attempts at "fast ARM" chip were rather half hearted "add a pipeline step there, add extra register there, increase datapath width there," and not to squeeze it to the limit.

barkingcat · on Jan 6, 2021

The answer is that they have raw hard numbers from the hundres of millions of iPads/iPhones sold each year, and can use the metrics from those devices to optimize the next generation of devices.

These improvements didn't come from nowhere. It came from iterations of iOS hardware.

SurfingInVR · on Jan 6, 2021

Something I've seen no one else mentioning: Apple's low-spec tier is $1000, not $70.

square_usual · on Jan 7, 2021

It's $699, for a complete device, not a part of one.

perryizgr8 · on Jan 7, 2021

What's the screen resolution on that $699 "complete device"?

square_usual · on Jan 7, 2021

Comparable to a $70 processor, at least.

FreshFries · on Jan 7, 2021

5120 by 2880

acdha · on Jan 6, 2021

> What enabled/led them to make those choices where others haven't?

Others have to some extent — AMD is certainly not out of the game — so I'd treat this more as the question of how they've been able to go more aggressively down that path. One of the really obvious answers is that they control the whole stack — not just the hardware and OS but also the compilers and high-level frameworks used in many demanding contexts.

If you're Intel or Qualcomm, you have a wider range of things to support _and_ less revenue per device to support it, and you are likely to have to coordinate improvements with other companies who may have different priorities. Apple can profile things which their users do and direct attention to the right team. A company like Intel might profile something and see that they can make some changes to the CPU but the biggest gains would require work by a system vendor, a compiler improvement, Windows/Linux kernel change, etc. — they contribute a large amount of code to many open source projects but even that takes time to ship and be used.

astrange · on Jan 7, 2021

Intel does lots of contributions across the OS (Linux and glibc) to compilers including their own (gcc, icc, ispc, etc). Their problems aren't their ability, it's that Intel is poorly managed and internal groups are constantly fighting with each other.

Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.

acdha · on Jan 7, 2021

> Intel does lots of contributions across the OS (Linux and glibc) to compilers including their own (gcc, icc, ispc, etc). Their problems aren't their ability, it's that Intel is poorly managed and internal groups are constantly fighting with each other.

I wasn't just talking about Intel but the concept of separate CPU and compiler vendors in general. Intel contributes a ton of open source but even if they were perfectly organized it takes time for everything to happen on different schedules before it's generally available: get patches into something like Linux or gcc, wait possibly years for Red Hat to ship a release using the new version, etc. Certain users — e.g. game or scientific developers — might jump on a new compiler or feature faster, of course, but that's far from a given and it means they're not going to get the across-the-board excellent scores that Apple is showing.

> Also, compiler support for CPUs is very overrated. Heavy compiler investment was attempted with Itanium and debunked; giant OoO CPUs like Intel's or M1 barely care about code quality, and the compilers have very little tuning for individual models.

This isn't entirely wrong but it's definitely not complete. Itanium failed because brilliant compilers didn't exist and it was barely faster even with hand-tuned code, especially when you adjusted for cost, but that doesn't mean that it doesn't matter at all. I've definitely seen significant improvements caused by CPU family-specific tuning and, more importantly, when new features are added (e.g. SIMD, dedicated crypto instructions, etc.) a compiler or library which knows how to use those can see huge improvements on specific benchmarks. That was more what I had in mind since those are a great example of where Apple's integration shines: when they have a task like “Make H.265 video cheap on a phone” or “Use ML to analyze a video stream” they can profile the whole stack, decide where it makes sense to add hardware acceleration, and then update their choice of the compiler toolchain and higher-level libraries (e.g. Accelerate.framework) and ship the entire thing at the time of their choosing whereas AMD/Intel/Qualcomm and maybe nVidia have to get Microsoft/Linux and maybe someone like Adobe on board to get the same thing done.

That isn't a certain win — Apple can't work on everything at once and they certainly make mistakes — but it's hard to match unless they do screw up.

astrange · on Jan 7, 2021

> Itanium failed because brilliant compilers didn't exist and it was barely faster even with hand-tuned code, especially when you adjusted for cost, but that doesn't mean that it doesn't matter at all.

What you said is true for libraries, I just don't think it's true for compiler optimizations. Even Apple's clang just doesn't have any new optimizations that work on their own; there are certainly new features but they're usually intrinsics and other things that need to be adopted by hand. They thought this would happen (it's what bitcode was sold as doing) but in practice it has not happened.

smoldesu · on Jan 7, 2021

The big "enabler" was their mass-purchase of 5nm lithography across the board. Even still though, 4ghz*8c isn't anything new, and isn't really that remarkable besides the low TDP (which is incidentally dwarfed by the display, which draws up to 5x more power than the CPU does). I think the big issue is that Apple has painted themselves into a corner here: ARM won't play nice with the larger CPUs they want to make, and the pressure for them to provide a competent graphics solution on custom silicon is mounting. They spent a lot of time this generation marketing their "energy efficiency" and battery life, but many consumers/professionals (myself included) don't really care about either of these things.

krunkcoin · on Jan 8, 2021

"the display, which draws up to 5x more power than the CPU does" - wat? Apple-supplied monitoring tools report that M1 under full load (all CPU and GPU cores) can draw over 30 watts. Laptop displays don't consume 150 watts. If anything, the displays in either of the M1 portables likely consume about 5 times less than 30W, even at full brightness.

"ARM won't play nice with the larger CPUs they want to make" - wat? Apple holds an architectural license. This means they paid a lot upfront a long time ago and therefore have a more or less perpetual right to design their own Arm cores without input from Arm.

"a competent graphics solution" - Also wat? M1 has an excellent GPU. It doesn't compete with discrete GPUs that use 300 watts, but that's fine: M1 is the chip for entry level Macs, designed for the smallest and lightest segments of their notebook line. And in that product segment, it has been every bit as much a revelation as the CPU. It's very fast, and uses little power given the performance.

What exactly do you think is going to happen when they scale that basic GPU design up? Despite your dismissiveness, in modern silicon architecture energy efficiency is incredibly important: for any given power budget, the more efficient you are the more performance you can deliver. The performance Apple gets out of about 10W on M1 suggests they'll have few problems building a larger GPU to compete with Nvidia and AMD discrete GPUs.

dv_dt · on Jan 6, 2021

No fighting the sales department on where to put the market segmentation bottlenecks?

gavin_gee · on Jan 6, 2021

didnt they also make some interesting hires a few years ago like Anand from Anandtech and some other silicon vets that likely helped them design the M1 approach?

gameswithgo · on Jan 6, 2021

I see two main things behind it:

1. they are the only ones who have 5nm chips because they paid a lot to TSMC for that right 2. they gave up on expandable memory, which lets them solder it right next to the cpu, which likely makes it easier to ship with really high clocks. and/or they just spent the money it takes to get binned lpddr4 at that speed.

So a good cpu design, just like AMD and Intel have, but one generation ahead on node size, and fast ram. Its not special low latency ram or anything, just clocked higher than maybe any other production machine, though enthusiasts sometimes clock theirs higher on desktops!

epistasis · on Jan 6, 2021

> So a good cpu design, just like AMD and Intel have

The design seems to be very different, in that it's far far wider, and supposedly has a much better branch predictor.

> fast ram

Is that a property of the RAM clock, or a function of a better memory controller? The RAM certainly doesn't appear to have any better latency.

gameswithgo · on Jan 6, 2021

Right, latency isn't (much) affected by a higher clock rate. Getting ram to run fast requires both good ram chips and good controller/motherboard.

and yes, obviously apples bespoke ARM cpu is quite a bit different than Zen3 Ryzens x86 cpu, but I'm not sure it is net-better. When Zen4 hits at 5nm I expect it will perform on par or better than the M1, but we won't know till it happens!

titzer · on Jan 6, 2021

Frankly, I find Lemire does oversimplified, poor-quality control, back-of-the-envelope microbenchmarking all the time that provides little to no insight other than establishing a general trend. It's sophomoric and a poor demonstration about how to well-controlled benchmarking that might yield useful, repeatable, and transferrable results.

alecco · on Jan 6, 2021

Can you give an example? I've seen Lemire correct his posts on many occasions and the source code is published. I don't know many blogs doing anything remotely like that.

titzer · on Jan 6, 2021

Sure. He often benchmark some small C++ code on his "laptop" CPU (which one exactly? microarchs matter!) and then committing classic microbenchmarking pitfalls such as:

- benchmarking something small enough to inspect machine code, but not inspecting machine code

- not plotting distribution, average, variance etc

- no attention paid to CPU frequency governor settings

- measuring too short a run

- measuring too small a dataset that it fits entirely in L1

sliken · on Jan 6, 2021

The most impressive thing I've seen is that when accessed in a TLB friendly fashion that the latency is around 30ns.

Anandtech has a graph showing this, specifically the R per RV prange graph. I've verified this personally with a small microbenchmark I wrote. I've not seen anything else close to this memory latency.

tandr · on Jan 6, 2021

Sorry, what would AMD's or Intel's "latest and greatest" numbers for the same be?

sliken · on Jan 6, 2021

Here's the M1: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

Scroll down to the latency vs size map and look at the R per RV prange. That gets you 30ns or so.

Similar for AMD's latest/greatest the Ryzen 9 5950X: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

The same R per RV prange is in the 60ns range.

epistasis · on Jan 7, 2021

Could this be coming from the page size being 4x as large for Apple Silicon versus x86? I don't fully understand the benchmark, but it appears to be accessing a variety of pages from the same first level TLB lookup?

It's been a long time since I dealt with this stuff (wanted to get 1GB huge pages in Linux for some huge huge hash tables), so maybe I'm misunderstanding.

sliken · on Jan 7, 2021

Cachelines, page sizes, and size of the TLB all play a role. But with tinkering you can see those effects yourself and I played with 1, 2, 4, 8, 16, and 32 "pages" which I assumed were 4KB each and didn't see much difference. Measured latencies do increase slowly, but you expect that as the TLB becomes progressively more of a bottleneck.

If you use a 1GB array and see full random with much higher latency than a sliding window then you can be pretty sure that the page size is much less than 1GB.

Getting the cacheline off by a factor of 2 does make a small difference since you get occasional cache hits instead of zero, but as long as the array tested is several times larger than cache the impact is small.

But all in all the M1 has excellent memory bandwidth, excellent latency, and shows significantly better throughput on random workloads as you use more cores. Normal PC desktops have 2 memory channels (even the higher end i7/i9/ryzen7/ryzen9), only the $$$$ workstation chips like threadripper and some of the $$$$ Intel's have more. The little ole M1 in a mac mini, starting at $700 has at least 8 memory channels. So basically the M1 delivers on all fronts, larger and lower latency caches, wide issue, large reorder buffers, excellent IPC, and excellent power efficiency.

andrei-at · on Jan 9, 2021

Hi, Andrei here.

Just a clarification as to why the P per RV prange numbers are good: This pattern is simply aggressively prefetched by the region prefetcher in the M1, while Zen3 doesn't pull things in as aggressively.

tandr · on Jan 7, 2021

Thank you very much. So we are talking about doubling (or halving, depending what side you are looking from) the access times.

reasonabl_human · on Jan 6, 2021

Mind sharing the micro benchmark you wrote? I’m curious to know how that would work

sliken · on Jan 6, 2021

https://github.com/spikebike/pstream

It's designed to graph latency/bandwidth for 1 to N threads. My 1 thread numbers match Anandtech's. Use -p 0 for full random, which thrashes the TLB or -p 1 to be cache friendly (visit each cacheline once, but within a sliding window of 1 page).

To see the apple results (if you have gnuplot installed): ./lview results/apple-m1

PragmaticPulp · on Jan 6, 2021

I don't understand this competition to attribute the M1's speed to one specific change, while downplaying all of the others.

M1 is fast because they optimized everything across the board. The speed is the cumulative result of many optimizations, from the on-die memory to the memory clock speed to the architecture.

adam_arthur · on Jan 6, 2021

It's fast because they optimized everything across the board, and also paid for exclusive access to TSMC 5nm process.

gameswithgo · on Jan 6, 2021

What other laptop ships with LPDDR4X clocked at 4267? I agree though that being closer to the cpu isn't having any appreciable effect on latency, but being soldered close to the cpu probably does make it easier for them to hit that high clock rate.

jeffbee · on Jan 6, 2021

As WMF mentions, Tiger Lake laptops like my Razer Book have the same memory. It is not appreciably closer to the CPU in the Apple design. In Intel's Tiger Lake reference designs the memory is also in two chips that are mounted right next to the CPU.

celrod · on Jan 7, 2021

I have a Dell XPS 13 with a Tiger Lake CPU. Out of curiosity, running the script:

  > ./two_or_three
  N = 1000000000, 953.7 MB
  starting experiments.
  two  : 30.6 ns
  two+  : 39.6 ns
  three: 45.1 ns
  bogus 1422321000

This is much slower than the times Lemire reported for the M1. `two+` is 62% of the way between `two` and and `three`, vs 88% for the M1.

EDIT: Adding `-march=native` didn't really change the results, which makes sense given that it's a memory benchmark.

gspr · on Jan 7, 2021

Strange on my XPS 15 7590 (i7-9750H), I get

   N = 1000000000, 953.7 MB
   starting experiments.
   two  : 12.8 ns
   two+  : 13.7 ns
   three: 19.5 ns
   bogus 1422321000

celrod · on Jan 7, 2021

I just ran it again, and got more or less the same results:

  N = 1000000000, 953.7 MB
  starting experiments.
  two  : 29.7 ns
  two+  : 36.5 ns
  three: 43.8 ns

This surprises me. Normally, it does very well in most benchmarks I run.

Looking a little closer at the script, it loads numbers from "random", a vector of 3 million `Int` (this is hard coded, separate from `N`). This vector is about 11.4 MiB.

The Tiger Lake CPU has 12 MiB of L3 cache (same as your i7-9750H), so it barely fits. Meanwhile, the L1 cache is 48 KiB and the L2 cache is 1.5 MiB -- huge compared to most recent CPUS, and a lot of benefit in most benchmarks, but at the cost of higher latency. https://www.anandtech.com/show/16084/intel-tiger-lake-review...

Skylake's L3 latency was 26-37 cycles, and in Willow Cove's (Tiger Lake), it is 39-45 cycles. That difference by itself isn't big enough to account for the difference we're seeing, so something else must be going on.

nkurz · on Jan 7, 2021

The caching of the random[] array (almost) shouldn't matter, as the access is sequential.

I'm wondering if the difference is the the number of active memory channels. How many channels does your respective computers support? Do you have enough RAM installed so all channels are in use? Are you able to do a RAM bandwidth test by some other means to verify?

Another possibility is that for some reason the base latency is just different between your machines. A commenter added a pointer-chasing variation of Daniel's test on his blog. Maybe run this to find the full latency and see how the times differ?

Finally, there was one more commenter on the blog who reported anomalously fast times on a Windows laptop. It's possible there is a bug with Daniel's time measurements on windows.

gspr · on Jan 8, 2021

I'll run the pointer-chasing version later. Thanks

Re timing issues: I am on Linux.

celrod · on Jan 7, 2021

Yeah, that is strange. Why was it so slow? Trying on a desktop with a 7900X, I get

  N = 1000000000, 953.7 MB
  starting experiments.
  two  : 17.7 ns
  two+  : 19.1 ns
  three: 26.4 ns
  bogus 1422321000

This is again close to 50% slower than your time, but nearly twice as fast. I'll try again on the laptop and make sure I don't have other processes running.

gspr · on Jan 7, 2021

Strange. I suppose the compiler shouldn't matter much here, right? At any rate, I'm using GCC 10.2.1.

celrod · on Jan 7, 2021

I'm using gcc 10.2.0. I tried clang 11 and got more or less the same thing, so it doesn't seem to make much of a difference.

Neither did messing with flags, like (I tried -fno-semantic-interposition -march=native and a few others).

danaris · on Jan 6, 2021

And (genuine question) how do the Tiger Lake laptops compare with the M1 MacBooks thus far?

skavi · on Jan 6, 2021

AnandTech has decent benchmarks for both Tiger Lake [0] and M1 [1].

[0]: https://www.anandtech.com/show/16084/intel-tiger-lake-review...

[1]: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

jeffbee · on Jan 6, 2021

The outcome seems to depend greatly on the physical design of the laptops. The elsewhere-mentioned Dell XPS 13 has a particularly poor cooling design, which is why I chose the Razer Book instead. Despite being marketed in a very silly way to gamers only, it seems to have competent mechanical design.

SAI_Peregrinus · on Jan 6, 2021

Gamers are likely to run their systems with demanding workloads, for hours, with a color-coded performance counter (FPS stat). They'll notice if it throttles. They're particularly demanding customers, and there's quite a bit of competition for their money.

wil421 · on Jan 7, 2021

How common are laptops for gamers? I always build my windows boxes but I’m a casual gamer.

SAI_Peregrinus · on Jan 7, 2021

I'm not sure. I know they've been getting more popular with the increased power of laptops and the ability to use external GPUs (via Thunderbolt 3). I'd guess desktops are more common, but some people will have both.

valuearb · on Jan 7, 2021

The raw CPU performance of the M1 is about 15% faster single core, and 50% faster multicore, while using nearly half as much power.

sliken · on Jan 6, 2021

Mind running a memory latency benchmark on your Razer Book? Does it run linux by chance?

wmf · on Jan 7, 2021

You can compare https://www.anandtech.com/show/16084/intel-tiger-lake-review... and https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

wmf · on Jan 6, 2021

Tiger Lake laptops such as the XPS 13.

titzer · on Jan 6, 2021

> it can do with a single core what an Intel part needs all cores to accomplish.

Care to explain what you mean specifically by this?

saagarjha · on Jan 6, 2021

The M1 has extremely high single-core performance.

temac · on Jan 6, 2021

It is not 4 times faster than an Intel core, though...

saagarjha · on Jan 6, 2021

It is in memory performance, which is what I assumed was being measured here.

kllrnohj · on Jan 6, 2021

How are you defining memory performance and where are your supporting comparisons? This article only discusses the M1's behavior, and makes no comparisons to any other CPU.

FabHK · on Jan 6, 2021

FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports), 2.4 GHz Quad-Core Intel Core i5, 8 GB 2133 MHz LPDDR3:

  two  : 49.6 ns  (x 5.5)
  two+ : 64.8 ns  (x 5.2)
  three: 72.8 ns  (x 5.6)

EDIT to add: above was just `cc`. Below is with `cc -O3 -Wall`, as in Lemire's article:

  two  : 62.8 ns  (x 7.1)
  two+ : 69.2 ns  (x 5.5)
  three: 95.3 ns  (x 7.3)

namibj · on Jan 6, 2021

You _need_ to use -mnative because it otherwise retains backwards compatibility to older x86.

FabHK · on Jan 6, 2021

  (base) Coding % cc -mnative two-three.c
  clang: error: unknown argument: '-mnative'

  (base) Coding % cc -v
  Apple clang version 12.0.0 (clang-1200.0.32.28)
  Target: x86_64-apple-darwin20.2.0
  Thread model: posix

astrange · on Jan 7, 2021

It's spelled "-march=native" in gcc and "-arch x86_64h" in clang.

It doesn't make much difference though, autovectorization doesn't work very well and there is not a lot of special optimization for newer x86 CPUs.

namibj · on Jan 8, 2021

All recent Intel Core-i microarchitectures require using full vector width loads to max out L1d bandwidth, because the load/store units don't actually care about the width of a load, as long as it doesn't cross a cache line (in which case the typical penalty is an additional cycle).

Only using 128 bit wide instructions on a core that has 512 bit hardware results in 4x less L1d bandwidth.

africanboy · on Jan 6, 2021

there must be something wrong there, on my late 2014 laptop that mounts

    Type: DDR4
    Speed: 2133 MT/s

I get

    two  : 27.1 ns (3x)
    two+ : 28.6 ns (2.2x)
    three: 39.7 ns (3x)

which is not much, considering this is an almost 6 years old system with 2x slower memor

FabHK · on Jan 6, 2021

Dunno, I didn't reboot and didn't close all other programs (browser, editor, mail, calendar, notes, editor)... Top shows

Load Avg: 2.36, 2.01, 1.97 CPU usage: 2.10% user, 3.39% sys, 94.49% idle

titzer · on Jan 6, 2021

Sure, and it has a very large out-of-order execution engine, but it is not fundamentally different from what other super scalar processors do. So I am curious what the OP meant by that offhand comment.

jeffbee · on Jan 6, 2021

One core of the M1 can drive the memory subsystem to the rails. A single core can copy (load+store) at 60GB/s. This is close to the theoretical design limit for DDR4X. A single core on Tiger Lake can only hit about 34GB/s, and Skylake-SP only gets about 15GB/s. So yes, it is close to 4x faster.

titzer · on Jan 6, 2021

Thanks for clarifying. But this isn't any fundamental difference IMO. There isn't any functional limitation in an Intel core that means it cannot saturate the memory bandwidth from a single core, unless I am missing something.

nkurz · on Jan 7, 2021

One could argue that it's not "fundamental", but it's definitely a functional limitation of the current Intel cores. The memory bandwidth of a single core is hardware limited by the number "Line Fill Buffers". Each buffer keeps track of one outstanding L1 cacheline miss, thus the number of LFB's limits the memory level parallelism (MLP). "Little's Law" gives the relationship between the latency, outstanding requests, and throughput. With 10 LFB's and the current latency of memory, it's physically impossible for a single core to use all available memory bandwidth, especially on machines with more than 2 memory channels.

The M1 chip allows higher MLP, presumably because it has more LFB's per core (or maybe they are using different approach where the LFB's are not per-core?). I apologize for using so many abbreviations. I searched to try to find a better intro, but didn't find anything perfect. I did come across this thread that (apparently) I started several years ago at the point where I was trying to understand what was happening: https://community.intel.com/t5/Software-Tuning-Performance/S....

AnbeSivam · on Jan 7, 2021

Came across someone else mentioning the similar bandwidth constraint w.r.to LFB per core a month back.

https://news.ycombinator.com/item?id=25221968

jeffbee · on Jan 6, 2021

I agree, it's not fundamental. It is, in particular, not that other popular myth, that it's "because ARM". It's only that 1 core on an Intel chip can have N-many outstanding loads and 1 core of an M1 can have M>N outstanding loads.

ed25519FUUU · on Jan 6, 2021

In other words, it’s better architecture. If anything this makes it seem more impressive to me.

amelius · on Jan 6, 2021

No, it's the same architecture but with different parameters.

It's like the difference between the situation where every car uses 4 cylinders, and then Apple comes along and makes a car with 5 cylinders.

kllrnohj · on Jan 6, 2021

Your analogy was so close! It's Apple comes along and makes an 8 cylinder engine. Since, you know, the other CPUs are 4-wide decode and Apple's M1 is 8-wide decode :)

ben-schaaf · on Jan 7, 2021

Zen 3 is effectively 8 wide with their micro-op cache. Intel similarly has been 6 wide for ages.

megablast · on Jan 6, 2021

You’ll get people guessing, since Apple itself puts out so little information.

foldr · on Jan 6, 2021

>or that it's faster because the memory is 2mm closer to the CPU (not that either)

Not to disagree with your overall point, but 2mm is a long way when dealing with high frequency signals. You can't just eyeball this and infer that it makes no difference to performance or power consumption.

jeffbee · on Jan 6, 2021

If it works, it works. There will be no observable performance difference for DDR4 SDRAM implementations with the same timing parameters, regardless of the trace length. There are systems out there with 15cm of traces between the memory controller pins and the DRAM chips. The only thing you can say against them is they might consume more power driving that trace. But you wouldn't say they are meaningfully slower.

foldr · on Jan 6, 2021

You can't just eyeball the PCB layout for a GHz frequency circuit and say "yeah that would definitely work just the same if you moved that component 2mm in this direction". It's certainly possible to use longer trace lengths, but that may come with tradeoffs.

>The only thing you can say against them is they might consume more power driving that trace

Power consumption is really important in a laptop, and Apple clearly care deeply about minimising it.

For all we know for sure, moving the memory closer to the CPU may have been part of what's enabled Apple to run higher frequency memory with acceptable (to them) power draw.

RachelF · on Jan 7, 2021

>that it's "on-die" (it's not)

It appears to be mounted on the same chip package.

Why did Apple do this if not for speed?

wmf · on Jan 7, 2021

On-package memory is not faster. I suspect it is more power efficient though.

rkangel · on Jan 7, 2021

It makes it easier to get to a particular clock speed. The geometry, interconnect lengths etc are all tightly controlled, the noise is less because you're not on the main PCB and you have interconnect options that aren't whatever your PCB process is (e.g. commonly gold wires).

ksec · on Jan 6, 2021

>HN memes about M1 memory will die

It is not only HN. It is practically the whole Internet. Go around the Top 20 hardware and Apple website forum and you see the same thing, also vastly amplify by a few KOL on twitter.

I dont remember I have ever seen anything quite like it in tech circle. People were happily running around spreading misinformation.

jeffbee · on Jan 6, 2021

Yeah, I know. There was some kid on Twitter who was trying to tell me that it was the solder in an x86 machine (he actually said "a Microsoft computer") that made them slower. Apple, without the solder was much faster.

According to this person's bio they had an undergraduate education in computer science ¯\_(ツ)_/¯

Bootvis · on Jan 6, 2021

What is a KOL?

tyingq · on Jan 6, 2021

"Key Opinion Leader". I think it's the new word for "Influencer".

ksec · on Jan 6, 2021

I am pretty sure KOL predates Influencer in modern internet usage. Before that they were simply known as Internet Celebrities. May be it is rarely used now. So apology for not explaining the acronyms.

secondcoming · on Jan 6, 2021

First I've heard of it!

walterbell · on Jan 6, 2021

Who introduced the term KOL and bestows the K title?

fctorial · on Jan 7, 2021

I can't find any info about the memory bus of apple m1. Is it 8 channels 16 bit each? That's drastically different from AMDs 2 channels 64 bit each.

It looks like apple m1 is much less eager when caching memory rows. Maybe because it doesn't have l3 cache.

Edit: This test utilizes the 8x16bit memory bus of apple m1 fully. It's mostly just fetching random locations from memory, which can all be parallelized by the cpu pipeline. It explains why the results are exactly 4x slower on my ryzen 3 with 2 memory channels.

So the summary is that m1 is optimized for dynamic languages that tend to do a DDOS attack on RAM with a lot of random memory access, but it might take a performance hit with compiled languages and traditional HPC techniques that tend to process data in sequence like ECS.

skohan · on Jan 7, 2021

That's an interesting observation - as someone who's built a few ECS implementations, one of the things I've always taken for granted is that things like cache line size are more or less set in stone, given the ubiquity of x86, so it's interesting to consider that the rise of ARM might create additional complexities there.

I'm a bit of two minds about this: on the one hand, for a long time I've wanted a language for writing allocators which is more explicit about memory, and offers good abstractions for low-level memory operations (maybe Zig is going in this direction). In some sense, it feels like the move towards programmers thinking less about memory management has been a bit of a dead-end, and what we really want is better tools for memory management. Fragmentation in terms of how processors handle memory goes against this goal in some ways.

On the other hand, it's a bit of a "holy grail" to imagine a hardware stack which obviates the need for memory optimization, and really does treat loading from and storing to memory anywhere on the heap as the same. But I imagine that the interesting things which the M1 is doing with memory are helping a lot with the worst case performance, and maybe even average case performance, but they're probably not doing much for the best case.

Dylan16807 · on Jan 7, 2021

That would make sense for LPDDR4 but it apparently claims to have a 128 byte cache line size and I'm not sure how to square that with 16 bit channel width.

djacobs7 · on Jan 6, 2021

Is the article saying that the M1 is slower than we would have expected in this case?

My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.

But the M1 is so parallelized that it goes to grab both arr[idx] and arr[idx+1] separately. So we have to wait for both of those two return. Meanwhile, on a less parallelized processor, we would have done arr[idx] first and waited for it to return, and the processor would realize that it already had arr[idx+1] without having to do the second fetch.

Am I understanding this right?

phkahler · on Jan 6, 2021

>> My understanding, based on the article, is that a normal processor, we would have expected arr[idx] + arr[idx+1] and arr[idx] to take the same amount of time.

That depends. If the two accesses are on the same cache line, then yes. But since idx is random that will not happen sometimes. He never says how big array[] is in elements or what size each element is.

I thought DRAM also had the ability to stream out consecutive addresses. If so then it looks like Apple could be missing out here.

Then again, if his array fits in cache he's just measuring instruction counts. His random indexes need to cover that whole range too. There's not enough info to figure out what's going on.

SekstiNi · on Jan 6, 2021

> There's not enough info to figure out what's going on.

If you only look at the article this is true. However, the source code is freely available: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

africanboy · on Jan 6, 2021

I ran the benchmark on my system

It's a 6 years old system, fastest times are in the 25ns range

- 2-wise+ is 5% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 39% slower than 2-wise+

on the M1

- 2-wise+ is 40% slower than 2-wise

- 3-wise is 46% slower than 2-wise

- 3-wise is 4% slower than 2-wise+

nottorp · on Jan 7, 2021

Ouch. This is on my $2500 i5 mbpro from 2018.

$ ./two_or_three

N = 1000000000, 953.7 MB

starting experiments.

two : 53.3 ns

two+ : 60.1 ns

three: 78.6 ns

bogus 1375316400

------

2+ 12% slower than 2

3-wise 47% slower than 2

3-wise 30% slower than 2+

-------

Ratios aside, that's an interesting speed leap when the article gets 9 ms for 2-wise. Mind, the laptop had lots of applications running, i didn't clear it up to do a proper benchmark, but still.

SekstiNi · on Jan 6, 2021

Interesting, I ran it on my laptop (i7-7700HQ) with the following results:

- 2-wise+ is 19% slower than 2-wise

- 3-wise is 48% slower than 2-wise

- 3-wise is 25% slower than 2-wise+

However, as mentioned in the post the numbers can vary a lot, and I noticed a maximum run-to-run difference of 23ms on two-wise.

mrob · on Jan 6, 2021

I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC 10.2.1 -O3, 64 bit) and got results almost perfectly matching the conventional wisdom:

  two  : 97.4 ns
  two+  : 97.9 ns
  three: 145.8 ns

phkahler · on Jan 6, 2021

He's only got 3 million random[] numbers. Weather that's enough depends on the cache size. It also bothers me to read code like this where functions take parameters (like N) and never use them.

egnehots · on Jan 6, 2021

TLDR: he is using a random index with a big enough array

slashdev · on Jan 6, 2021

He mentioned it's a 1GB array, and the source code is available.

phkahler · on Jan 7, 2021

That array is indexed by an array of random numbers and there are only 3M of them. That should be enough assuming even 4 bytes per index it will just fit in the 12MB cache, but then there are accesses to the big array as well.

jayd16 · on Jan 6, 2021

Its a little confusing because they're conflating the idea that you almost certainly read at least the entire word (and not a single byte) at a time with the other idea that you could fetch multiple words concurrently.

duskwuff · on Jan 6, 2021

Any cached memory access is going to read in the entire cache line -- 64 bytes on x86, apparently 128 on M1. This is true across most architectures which use caches; it isn't specific to M1 or ARM.

kzrdude · on Jan 6, 2021

(As I learned from recent Rust concurrency changes) on newer Intel, it usually fetches two cache lines so effectively 128 bytes while AMD usually 64 bytes. That's the sizes they use for "cache line padded" values (I.e making sure to separate two atomics by the fetch size to avoid threads invalidating the cache back and forth too much).

alblue · on Jan 6, 2021

To be clear here, it fetches two cache lines but it doesn’t put the second in exclusive state until it’s written to; the unit of granularity is still 64b. In a scanning read mode you will see the benefit but you won’t see the contention on writes. (The contention will come from subsequent reads on that cache line though)