My M1 MacBook is significantly "faster" than my previous Intel 16", even though the per-core performance are roughly similar in CPU benchmarks (small advantage to M1).
I saw a previous HN comment about this being due to memory bandwidth and cache latency, but I can't seem to substantiate that comment.
M1 is a bigger core than any other core I know of. Just larger reorder buffers, wider execution, etc. etc.
Nominally speaking, its inefficient at that point. You pretty much can make 2 cores fit inside of the M1 core, and each x86 core supports two threads (Would you rather have 4x x86 threads, or 1x M1 core??). I'm intrigued that people continue to find benefits from such a large core (even without SMT / Hyperthreading).
--------
But yes, M1 has huge L1 cache, huge reorder buffers, and extremely wide execution. I'd expect it to win clock-for-clock vs any other core in the market.
But I'm not fully convinced that its the best design / tradeoff. Intel's E-cores + P-cores suggests that modern CPU cores may have become overly big.
Apple is designing for more IPC and lower clocks, whereas Intel and AMD basically need to target their designs towards a 5GHz target clock rate to remain competitive in the marketplace. A lower clocked, but much wider core can be much more power efficient, and Apple almost never provided sufficient cooling to run Intel processors at sustained top speeds anyway. A larger die increases cost, but vertical integration means you don't have to find margin on each component, just the assembled product.
E cores work the problem the other way, if you narrow the core, you can get about half the work done per core, but in about a fourth the space. That's a win for throughput, but not for latency/interactivity. Drastically different per core performance makes OS schedulers work harder though.
Why can't you use E-cores for better latency and interactivity? Unless you're doing heavy-duty VR or video gaming, you don't ever need more than a single efficiency thread to run your UX. Just need to avoid doing any real workload on the UI thread.
A P-core is going to give you better latency than an E-core, because it does the work twice as fast. Yes, if you can split the work, you can get it done faster with 4 Es than one P, but lots of interactive work is hard to split. You can't meaningfully handle a network interrupt with more than one core; you can split your network interface into multiple queues and service each queue with a separate core, but parallelizing within the queue only brings sadness and cache coherency latency.
You might get some interactivity benefits depending on your workload and core configuration... One P vs 4 Es is an easy choice, but 4-P vs 16-E, the 4-P is probably going to feel more interactive because UI latency is lower, even though throughput is lower.
Area-wise I'm pretty sure Apple's P-cores are in line with Intel and AMD - including L2 cache and other shared logic, M1's P-core was 4mm^2 per core, M2's P-core is about 5mm^2 [1], Zen 3 and 4 were about 4mm^2 [2], and Intel is 7mm^2 for Golden Cove [3]
With M1/M2 in the ~5mm^2 area or so, I definitely argue that Zen3 cores are 1/2 the size of M1/M2 cores, transistor-for-transistor.
Its a fat core. Maybe it will work thanks to how advanced processes are getting. Maybe this will encourage others (ie: Intel) to experiment with larger cores as well. Its hard for me to say, but I do welcome the benchmarks.
Well yeah, area budgets per-core have never shrunk linearly with transistor density; it serves a wider range of use cases to balance beefing up cores with adding more of them. Like, Intel 7 is 20x denser than Intel 32nm, but a Sandy Bridge core is less than 3x larger than Golden Cove.
Also the L2 cache and shared logic make up a larger percentage of M1/M2 per-core at >45%; that's only 20% of the per-core area for Zen 3... if you include LLC that doubles the per-core Zen3 area but only adds 30% for M2...
Point is that M1/M2 and Zen 4 show that the per-core area budget within the same process is now similar across Apple and AMD, not an order of magnitude different. It used to be an order of magnitude different, like back on 32nm Apple A6 was about 8 mm^2/core and Sandy Bridge was 18.5 mm^2/core, or 30 mm^2 including LLC that A6 didn't have.
> Also the L2 cache and shared logic make up a larger percentage of M1/M2 per-core at >45%;
AMD's L2 cache is 1MB on Zen4, and L3 cache is like 4MB. (and L2 cache compares to Apple L1 cache, while L3 AMD cache is LLC / Comparable to Apple's L2 cache).
AMD's L2 cache is per-core. AMD's L3 cache is decentralized last-level, is 32MB for 8-cores (equivalent to Apple's 16MB for 4 cores). Except... AMD's cores have 2 threads on them while Apple only has 1.
I think my overall point is clear: Apple's cores are abnormally large. AMD / Intel have smaller cores (and larger caches). This is _despite_ shoving 2-thread per core on AMD/Intel through SMT or Hyperthreading.
Remember that only one thread gets that HUGE core on Apple. Its very, very unusual. Even POWER10 (which has oversized cores) allows 8x SMT (8 threads per core) to compensate for its oversized nature.
I mean, I was trying to be fair by counting L2 and associated logic. If you only count down to L1 (and no, Zen4's 15-cycle latency L2 is not comparable to Apple's L1 that achieves a 3-4 cycle latency; Zen4's combined L2+L3 averages close to Apple's 18-cycle L2), then a Zen4 core only takes 72% of that 3.84 mm^2, or 2.76 mm^2. M2's P-core is estimated at 2.756 mm^2 if scaled to match M1's density, or 2.519 mm^2 if you accept Apple marketing's scaling.
And the M1's P-core was 2.281 mm^2.
Hyperthreading barely costs any area, but anyway I guess you can say that thanks to that plus the clock speed advantage, Zen4 gets like 15% more performance per mm^2 than M2 P-cores? That's not a massive improvement by any measurement.
(as an aside: if annotations of Zen4 I've seen are correct, its branch predictor has almost as much SRAM as the entire µop+L1i+L1d caches. Which... actually I can completely believe of TAGE)
L2 on Zen3/Zen4 is __PER CORE__. That's private memory, inside of each core, for operations.
If you cut out L2 cache, the Zen3 / Zen4 core shrinks significantly.
As per the Zen4 article you had:
> The L2 cache in the cores has increased from 512 kB to 1MB, which also increases the occupied area a bit, but the cores are still smaller overall than Zen 3 on 7nm thanks to the 5nm process. The area including L2 cache is 3.84mm²
The 3.84mm^2 figure _INCLUDES_ 1MB of L2 cache. If you wanna cut that out, doing so will damage your own argument, as the Zen4 core will shrink rather dramatically. (Especially with that "unshrinkable SRAM" argument you're trying to make).
-----------
Look, I don't even know where you're going with this. It shouldn't be a surprise to anybody that a 8x wide M2 core with like 800-entry reorder buffer and 600-entry register file will be bigger than a 6x wide AMD Zen4 core with like 400-entry reorder buffer and like 300-entry register file.
M2 was designed to be big, fat, and wide in execution. That's just how it works. And its a very interesting (arguably brilliant) tradeoff. But if you look at the damn chip, its just bigger. That's what happens when you add more stuff to a core, the core gets larger.
AMD on the other hand, is narrower (especially on a per-thread basis: 2-threads fit on this smaller core), and instead spends way more transistors on L2 cache. Maybe _YOU_ don't like the tradeoff (Sure, I agree that AMD's L2 cache is 15 cycles latency), but maybe throughput is more important and you're overly focused on unimportant / hypothetical latency issues (the entire L2 cache can be accessed at full throughput IIRC).
At the end of the day, we gotta get the devices and then benchmark them with real programs to really see what the sum of all these tradeoffs are. But I don't think there's much argument to be had here that the M1/M2 Apple cores are just bigger. I mean... we know the buffer sizes. We all know Apple's buffers are just bigger.
Look at this. As I stated before, the biggest "penalty" to the AMD Zen4 core is the uop cache (which is unnecessary in the Apple chip). You can just... look at the damn die shot.
If you want to argue about legitimate space-saving ability of ARM systems, focus on _THAT_ part of the chip. You're talking about all sorts of things that aren't actually helping your side of the argument.
> But I don't think there's much argument to be had here that the M1/M2 Apple cores are just bigger
> You pretty much can make 2 cores fit inside of the M1 core
My entire point has simply been debunking this. I'm pointing out that Apple, Intel, and AMD have similarly large area budgets for their big cores. Like, looking at actual chips produced on the same TSMC processes, you cannot fit two Zen4 cores inside of the space taken by one M1 or M2 P-core. You cannot fit two Zen2 or Zen3 cores within the space taken by one A12Z P-core. All of them have a somewhat similar per-core area budget, with difference in L2/LLC cache tradeoffs being the biggest differentiator in area.
And yes, even outside of cache they make different tradeoffs with what they spend the area on. Zen4 spends area on 512b registers and 256b ALUs, and clocking past 5GHz. Apple spends it on scalar resources and deep reordering. I'm not arguing that one tradeoff is universally better than the other, just that they end up similarly big.
Since you brought up cache throughput, Zen4 does 32B/cycle between L1 and L2 [1]. Anandtech measured M1's L2 cache throughput at about 440 GB/s across the 4 P-cores [2], which works out to 34B/cycle/core. Which sure sounds like the same per-core throughput to me.
> You can just... look at the damn die shot
I... did? That's how I estimated L2 cache and tags at 28% of Zen4's 3.84mm^2. Do you believe that that is incorrect, or that the M1 and M2 P-core areas of 2.281-2.756 mm^2 quoted by Semianalysis are incorrect?
AMD shoves 512kB of L2 per core in Zen3, and 1MB of L2 per core in Zen4.
I'm pretty sure Zen3 has more SRAM (aka: L2 cache) than the M1/M2 (128kb + 192kB is a LOT of L1 cache, but its still less SRAM than what AMD is stuffing into its cores).
Even with all the extra register files + ROB buffer (also SRAM), the M1 just ain't getting close to 512kB L2 alone (plus all the L1 I$ and L1 D$, and uOp cache, and ROB and Register Files and 256-bit AVX registers on the Zen3).
If anything, bringing up the Apple L1 cache vs AMD L1/L2 per-core caches just emphasizes how big Apple's logic units are in comparison.
Interesting final point. Makes me think that a solution to the apparent crisis around Moore's law may be to break up monolithic super-chips into smaller co-processors. This would help alleviate the bottleneck of waste heat management that apparently these mega-chips seem to be getting throttled by. This is already being done for other reasons, but mitigating the volume-to-surface-area ratio extremes of large chips may have second-order benefits for computing performance.
> This would help alleviate the bottleneck of waste heat management that apparently these mega-chips seem to be getting throttled by.
The opposite.
A lot of these chips are being designed as "dark silicon", with extremely specific hardware that is rarely used.
When the silicon is dark, they form a heatsink that can absorb excess heat from all other parts of the chip and help regulate the temperature.
I expect more-and-more extremely specific circuitry to be added to future chips, so that more and more silicon can be "dark" and serve as a place for heat to go inside of the microseconds of execution.
Agreed. Most whole cores on most consumer CPUs are idle, or close to idle. much / most of the time. The big-little architecture is built around the idea of large cores idling / shut down most of the time, and only waking up to serve a peak load, e.g. when a new web page is being rendered.
I an even imagine spacing functional nodes wider on the silicon, where size and latency limitations allow, to give them more heat-dissipating area.
> The big-little architecture is built around the idea of large cores idling / shut down most of the time, and only waking up to serve a peak load
This would suggest placing E cores between (or even inside) P cores, every cluster of them maybe on a small chiplet.
I wonder if the idea could go as far as having efficient and performant execution units in the same processor, with the reorder buffer issuing instructions to different units according to how far down the instruction stream the result is needed.
At first I thought it would be the close integration of memory, but it seems the consistent instruction size makes the reorder buffers a lot simpler to implement than for x86, so you can get a much larger aarch64 reorder buffer for the same area of an x86 (amd64) one.
A ROB / Reorder Buffer is just RAM from my understanding. Since writes are all out of order, the ROB holds the "future values" of things for sections of code that have already executed (due to out-of-order scheduling), that are waiting for "earlier" code to execute.
Ex: If line#105 has been out-of-order executed, and knows that "MemoryLocationX = 100" happens there... the ROB holds it until line#90 through line#014 are done executing.
-----------------
Consistent instruction size would decrease:
* uop cache (AMD / Intel require a uop cache, where "decoded" instructions stay). This is more optional for ARM... and I'd personally expect this to be the "biggest cost" of x86 instructions.
* Decoder (due to the complexity of x86 instructions, the decoder on those chips is probably bigger).
The ROB / Reorder buffer is "after the decoder", probably just interacting with uops directly. I doubt that instruction set size-or-complexity has anything to do with ROB sizes.
Not really. You can always cut down power consumption by running them slower GHz. Among other tricks. As far as I'm aware, AMD EPYC continues to win performance/watt crown in practical server applications (servers also cut down on GHz to try to keep things power-efficient. Better to have many cores at lower GHz to win power-efficiency races).
The interesting thing to me is die-area. Because that's what determines how many cores you get per chip.
> Not really. You can always cut down power consumption by running them slower GHz
But that ruins the original argument... Not that any amount of down clocking can let Intel or AMD get close on the perf/power graph. Besides heat is the main bottleneck on data center density and operational cost.
This is the denialist argument I keep seeing where people want to have their cake and eat it (or perpetually pin their hopes on Zen+1 that hasn't actually shipped).
If Intel could match M1/M2 power/perf they'd do it. But they can't. They can win the perf crown by absolutely burning power or they can get massively trounced to kinda get close on power. Zen is better but still has the same fundamental trade off.
Instead of saying M1/M2 are nothing special I'm super excited to see actual competition in the CPU space for the first time in a long time - competition that has proven a lot of conventional wisdom to be bunk.
??? My argument is, look at server power/performance. AMD EPYC takes the crown. 80-core ARM does not.
Maybe M1 will win, but we've at best got like 8 cores right now. As I stated earlier: M1 cores are huge. I'm not convinced they're better yet, but if Apple wants to make a 32-core M1 or M2 and compare it to an AMD EPYC 64x2 computer, that's when I'll start looking. We will see what they can do moving forward, but I don't expect that this M1 core can scale to a manycore size like EPYC (or Xeon, to a lesser extent).
Benchmarks are always crap, but they're the best we've got. Benchmarks, for now, show that Dual EPYC still is your best computer. At least for the server-scale that Supermicro operates at.
-------
ARM themselves are keen on this. ARM has V, N, and E cores moving forward because they're heading their bets. This Supermicro system is probably an N2 or N1 system? (I haven't looked into it much).
No one else in the world is making cores as large as Apple's M1. Its an aberration, abnormally huge. If Apple fans find it useful then cool but there's other workloads out here. I'm not fully convinced that such a large core like M1 is the best design.
Very weird measure of efficiency. Perf/(Max TDP) doesn't really measure efficiency. Running a workload you care about and measure power usage at the wall would be a MUCH better metric.
The Intel chip might well spend most of it's time at Max TDP and throttling while the M1 might require unusual circumstances (like maxing out the fast cores, slow cores, matrix multiply (which is outside the core), memory controller, and iGPU simultaneously. There's not really any way to tell from the posted benchmark.
Wonder if the i7-1250 runs in any laptops without a fan.
Also, TDP is a spec point with a lot of BS. Intel's TDP is widely mocked for being far under the actual maximum power draw of the chip. I don't know how Apple is working out their TDP but it's almost certainly different to Intel. But then if you have accurate numbers it's still not a great measure because cores are at their least efficient when maxed out (like desktop CPUs usually are and laptop CPUs usually aren't), so it's a comparison which is unfair to intel and you should really compare the real power consumption of an intel CPU when clocked to get equivalent performance to an M1. So the data is basically useless as a comparison.
But is it real competition at the μarch level or just a more advanced fabrication process being used? We have yet to see x86 chips on the same process as the Apple M1 or M2.
>Even the E-cores draw more power than the M1/M2 cores.
To be clear, this is also a product of the fabrication process and node size. Intel 12/13th gen are on Intel 7[1] - which is fairly old at this point, while M1s are fabricated on 5mm process[2]. Smaller lithography size isn't the whole story, but it certainly does make a difference
I just really appreciate how quiet it runs. I basically never hear the fan, and I remember having to pin the fan in my old Macbook to max to force it to adequately cool the machine.
My personal macOS "machine" is a VM running on a supermicro threadripper pro board (along with a Linux desktop VM + random LXC/VMs). The macOS VM is significantly faster than my M1 Pro work system. I should run some benchmarks.
My M1 air is only a few minutes behind my old i7-7700k in handbrake encoding a 4k 10 bit HDR movie into a 1080p version. I still have a 2017 16inch MBP that throttles horribly on any major task.
I’m going to some personal benchmarks for all all 3 machines plus my new Ryzen 7950x. It still amazes me the fanless M1 Air is nearly as fast as my old desktop and laptop.
Not to cause any arguments or anything but comparing an Apple Intel laptop go their new ARM units isn’t a great comparison because I swear apple started thermally gimping them several years before releasing the M1s to make them look even better.
Seriously. The thermals the the last couple of intel apple laptops are atrocious. And I mean from a design standpoint point. With shitty coolers and fan arrangements etc. its kinda wild.
Perhaps the simplest explanation is they just wanted to optimize for size and weight as much as possible, and the copper/aluminum budget is a natural target.
Example 2: https://benchmark.clickhouse.com/hardware/ various servers, while Aarch64 servers have good places, the top results are from EPYC servers. Never EPYC servers have 12-channel DDR-5 memory - sounds like a heaven for ClickHouse :)
> I saw a previous HN comment about this being due to memory bandwidth and cache latency, but I can't seem to substantiate that comment.
I don't know if Apple has released any stats, but it makes a certain amount of sense. The technical reason to put memory and the CPU/GPU on a single package is to decrease memory latency (and possibly boost operating frequency). There's probably also the option to have a really wide memory bus.
Of course, a nice side benefit for Apple is that people can no longer buy 3rd part ram at market rates.
That being said...
> My M1 MacBook is significantly "faster" than my previous Intel 16", even though the per-core performance are roughly similar in CPU benchmarks (small advantage to M1).
The thing about benchmarks is, they're sometimes particular to a specific application, and even data sets. So M1 can be largely tied with an Intel chip across a broad range of benchmarks (e.g. Geekbench), but it can also be a lot faster on your specific workloads.
> The technical reason to put memory and the CPU/GPU on a single package is to decrease memory latency (and possibly boost operating frequency).
You've got that backwards, I think. Putting memory on package, or even just soldered on the PCB nearby rather than in removable DIMMs primarily helps hit higher frequencies or lower power at similar frequencies (GPUs and smartphones provide a wealth of examples). There's little or no impact on latency, because most DRAM latency occurs within the DRAM dies or inside the CPU's cache hierarchy rather than on the link between the CPU and DRAM.
Seems to be both. One big advantage which is not visible on the M1 (which has a 128 bit wide memory), is the M1 pro, max, and ultra have 2x, 4x, and 8X the memory bandwidth. Something that's much harder on the x86-64 size, and (AFAIK) impossible on x86-64 laptop.
The latency does seem improved in the last level cache and main memory, at least compared to AMD's Ryzen 5000 series.
Sadly looks like Anandtech has lost the people who did the deep dives and I can't find similar latency numbers for cache and ram for the Ryzen 7000 series.
Apple's power efficiency allows for a 512 bit wide memory system in a laptop, 4 times any x86-64 laptop. I do hope that AMD or Intel ups their game and allows for more aggressive memory systems on laptops. Only similar design I know of is the Xbox series X and PS5, which both use AMD chips with improved memory systems to keep from bottlenecking the iGPU.
I think a big reason for Anandtech losing the deep dive expertise is that the founder of Anandtech, Anand Lal Shimpi, was hired by… Apple! That was a while ago (2014), but that sort of knowledge isn't easy to find I imagine.
The 16" MBP is competitive, on wall power, for short periods of time before it's thermally throttled (even while the fans scream). Which usually means single core benchmarks.
The 16" MBP, at least mine, thermally throttles at the drop of a hat. Teams video conf, backups, virus scan, even small builds, etc.
The M1 in comparison is quiet, cool, not sure I've heard the fan going. So sure the performance is similar, the perf/watt is not.
It's due to larger caches, memory bandwidth, and thermal throttling or lack thereof. Intel laptops tend to run hot and therefore tend to not reach peak performance for longer periods of time. With M1/M2 and especially with the pro laptops with fans you will get peak performance much or all of the time.
I saw a previous HN comment about this being due to memory bandwidth and cache latency, but I can't seem to substantiate that comment.
That being said, I'd love to have this sort of performance out of servers. Arm released https://www.anandtech.com/show/17575/arm-announces-neoverse-... in 2022, but it's not really available yet anywhere.