Hacker News new | past | comments | ask | show | jobs | submit login
Single-chip processors have reached their limits (ieee.org)
203 points by blopeur on April 4, 2022 | hide | past | favorite | 152 comments



Despite the limitations apparently present in single chip/CPU systems, they can still provide an insane amount of performance if used properly.

There are also many problems that are literally impossible to make faster or more correct than by simply running them on a single thread/processor/core/etc. There always will be forever and ever. This is not a "we lack the innovation" problem. It's an information-theoretic / causality problem you can demonstrate with actual math & physics. Does a future event's processing circumstances maybe depend on all events received up until now? If yes, congratulations. You now have a total ordering problem just like pretty much everyone else. Yes, you can cheat and say "well these pieces here and here dont have a hard dependency on each other", but its incredibly hard to get this shit right if you decide to go down that path.

The most fundamental demon present in any distributed system is latency. The difference between L1 and a network hop in the same datacenter can add up very quickly.

Again, for many classes of problems, there is simply no handwaving this away. You either wait the requisite # of microseconds for the synchronous ack to come back, or you hope your business doesnt care if john doe gets duplicated a few times in the database on a totally random basis.


All of this is true, but I’ll just be nitpicky to make a point:

> Does a future event's processing circumstances maybe depend on all events received up until now?

In a parallel prefix sum, the final sum does depend on all prior inputs, but a good parallel implementation runs in O(log(n)) time. It is, of course, not a total ordering problem, but that’s not obvious at first glance — I’ve always thought it was a beautiful example of something that appears to be entirely sequential but actually parallelizes really well.

All of which is to say that, yeah, we’re latency bound at the end of the day, but there’s a lot more innovation and performance left to be wrung out of most systems. A little creativity goes a long way, and I like the idea of a universe where the software is the main determinant of performance — where you can’t get away with just waiting for the next generation of chips.


It's a very simple example and knowing that addition is commutative, it's obvious that there is no ordering.


Prefix sum depends on associativity, not commutativity.


The alternative is speculative execution. If you can guess what the result is going to be, you can proceed to the next calculation and you get there faster if it turns out you were right.

If you have parallel processors, you can stop guessing and just proceed under both assumptions concurrently and throw out the result that was wrong when you find out which one it was. This is going to be less efficient, but if your only concern is "make latency go down," it can beat waiting for the result or guessing wrong.


Not necessarily. There are problems you can't speed up even if you are given a literal infinity of processors - the problems in EXP for example (well, EXP - NP). Even for NP problems, the number of processors you need for a meaningful speed up grows proportionally to the size of the problem (assuming P!=NP).


Computational complexity and parallelism are orthogonal. Many EXP algorithms are embarrassingly parallel. You still have to do 2^n calculations, but if you have 1000 processors then it will take 1000 times less wall clock time because you're doing 1000 calculations at once.

The reason parallelism doesn't "solve" EXP problems is that parallelism grows linearly against something whose time complexity grows exponentially. It's not that it doesn't work at all, it's that if you want to solve the problem for 2n in the same time as for n, you need to double the number of processors. So the number of processors you need to solve the problem in whatever you define as a reasonable amount of time grows exponentially with n, but having e.g. 2^n processors when n is 1000 is Not Gonna Happen.

Having 1000 processors will still solve the problem twice as fast as 500, but that's not much help in practice when it's the difference between 50 billion billion years and 100.


A problem that is not in NP is, by definition, a problem that can't be solved in polynomial time even with an infinity of processors (or equivalently, with speculative execution given a perfect oracle). So, it can't be embarrassingly parallel.

So, a problem in EXP - NP (that is, it can be solved in exponential time, but it can't be solved in polynomial time even if given an infinity of processors) will require exponentially more time to solve for a higher n than a lower n, regardless of how many processors you have - even up to infinity. You may get some kind of speed-up with more processors, but it will not be enough to remove the exponential runtime blow-up.

An example of a problem like this is the Travelling Salesman Problem. As far as is known right now, there isn't even a non-determinstic algorithm that can solve the problem in polynomial time. Even if you had an infinity of processors, and could solve the problem for 1000 cities in 1 minute, you would still need ~2^1000 minutes to solve the problem for 2000 cities. Note: this bound is not proven, there may be some non-deterministic algorithm that can solve it in polynomial time.

I will also note that even for a problem that is in EXP and in NP, so an EXP problem that is embarrassingly parallel, doubling the problem size would require 2^n more processors to keep the same run time, not 2n (assuming it's time complexity is O(2^n)).


But what if the processor may be made organic, or encoded in DNA, so that replication and protein synthesis is akin to solving the problem? The "computer" grows exponentially, rather than linearly, and so for larger problems, it grows to the size that solves the problem?

Or we end up with grey goo...


Are there any examples of reusable "computers" like this? I've seen some demonstrations but they are not generic to the problem and reusable, they seem to have the data encoded in the makeup of the "computer" itself.

I write computers in quotation marks because even if they solve computational problems they are so different from what we are calling computers today, eg: mold growing in a maze and similar.


None that are usable i'm sure - it's merely a theoretical idea.


Except that synchronizing two or more cores is, as a rule of thumb, too expensive to do per-instruction?


It would have to be done on the same hidden low level register manipulation as current speculation already does. Instead of rolling back to the wrong branch and starting over you would just drop it and continue with the correct computation.

The issue would be if you have a slow (value load from memory) and a fast (cached value) branch, currently only the most often used one gets predicted, so if you always hit the fast branch the predictor will never touch the slow branch. I think if you evaluate both in parallel you end up having no way to avoid the penalties introduced by the slow path. Imagine having your CPU cache constantly invalidated by code that never "runs".


they can still provide an insane amount of performance if used properly.

In other words, software bloat has eaten up the bulk of the performance.


Like the sibling comments have indicated, many apparently sequential computations are actually parallelizable. I'll just add finite automata as another example in this category. On the surface it looks quite sequential (process one input at a time), but there are nice parallel algorithms that you can even implement on a GPU!

So yeah, there's a lot of nonobvious parallelism in the world. Just because something seems sequential doesn't mean it is.

I think the real reason why a lot of software is sequential/inefficient in practice is because we're just not smart enough to figure out how to optimize it (or don't have the proper incentives/it's not a priority). But that's something that's conceivably solvable; it's not limited by laws of physics the way hardware is.


Wait what‽ Do you have any resources you can point to for parallelism in finite automata?


It's been a while, but here are a couple things I found on Google for the search "finite automata gpu":

https://github.com/vqd8a/DFAGE

https://onlinelibrary.wiley.com/doi/epdf/10.1002/stvr.1796


Good Coding is still in major deficit these days so absolutely true.

Going back to "coding intensity" that used to be common in 8-bit processors to overcome bloatware is still open territory. One may themselves code with libraries efficiently but the libraries are very seldom very efficient.

This is still one of the better ways of "extending Moore's Law" - stop bloating code.

Architecture innovation is another.

Chiplets is probably the best physical/electronics way right now to continue the status quo of laziness.


The best chiplet interconnect may turn out to be no interconnect at all. Wafer scale integration [1] has come up periodically over the years. In short, just make a physically larger integrated circuit, potentially as large as the entire wafer -- like a foot across. As I understand it, there's no particular technical hurdle, and indeed the progress with self-healing and self-testing designs with redundancy to improve yield for small processors, also makes really large designs more feasible than in the past. The economics never worked out in the favour of this approach before, but now we're at the scaling limit maybe that will change.

At least one company is pursuing this at the very high end. The Cerebras WFE-2 [2] ("wafer scale engine") has 2.6 trillion transistors with 800,000 cores and 48 gigabytes of RAM, on a single, giant, integrated circuit (shown in the linked article). I'm just an interested follower of the field, no expert, so what do I know. But I think that we may see a shift in that direction eventually. Everything on-die with a really big die. System on a chip, but for the high end, not just tiny microcontrollers.

[1] https://en.wikipedia.org/wiki/Wafer-scale_integration

[2] https://www.zdnet.com/article/cerebras-continues-absolute-do...


To clarify and contextualize a bit what you're saying:

The one big obstacle in creating larger chips is defects. There's just a statistical chance of there being a defect on any given surface area of the wafer, defect which generally breaks the chip that occupies that area of the wafer.

So historically, the approach was to make more smaller chips and trash those chips on the wafer affected by defects. Then came the "chiplet" approach where they can assemble those functional chips into a larger meta-chip (like the Apple M1 Ultra).

But as you're saying, changes in the way chips are designed can make them resilient to defects, so you no longer need to trash that chip on your wafer that's affected by such a defect, and can thus design a larger chip without fear of defects.

(Of course such an approach requires a level of redundancy in the design, so there's a tradeoff)


> changes in the way chips are designed can make them resilient to defects, so you no longer need to trash that chip on your wafer that's affected by such a defect,

no, it's basically "chiplets but you don't cut the chiplets apart". You design the chiplets to be nodes in a mesh interconnect, and failed chiplets can simply be disabled entirely and then routed around. But they're still "chiplets" that have their own functionality and provide a coarser conceptual block than a core itself and thus simplify some of the rest of the chip design (communications/interconnect, etc).

note that technically (if you don't mind the complexity) there's nothing wrong with harvesting at multiple levels like this! You could have "this chiplet has 8 cores, that one has 6, that one failed entirely and is disabled" and as long as it doesn't adversely affect program characteristics too much (data load piling up or whatever) that can be fine too.

however, there's nothing about "changes in the way the chips are designed that makes them more resilient to defects", you still get the same failure rates per chiplet, and will still get the same amount of failed (or partially failed) chiplets per wafer, but instead of cutting out the good ones and then repackaging, you just leave them all together around "route around the bad ones".

The advantage is that MCM-style chiplet/interposer packaging actually makes data movement much more expensive, because you have to run a more powerful interconnect, where this isn't moving anything "off-chip", so you avoid a lot of that power cost. There are other technologies like EMIB and copper-copper bonding that potentially can lessen those costs for chiplets of course.

What Intel is looking at doing with "tiles" in their future architectures with chiplets connected by EMIB at the edges (especially if they use copper-copper bonding) is sort of a half-step in engineering terms here but I think there are still engineering benefits (and downsides of course) to doing it as a single wafer rather than hopping through the bridge even with a really good copper-copper bond. Actual full-on MCM/interposer packaging is a step worse than cu-cu bonding and requires more energy but even cu-cu bonding is not perfect and thus not as good as just "on-chip" routing. So WSI is designed to get everything "on-chip" but without the yield problems of just a single giant chip.


"there's nothing about "changes in the way the chips are designed that makes them more resilient to defects"

Of course there is. Scarequotes invalid.

Ever since the first first pal/gal/cpld/fpga-like device, the essence of malleable silicon has existed, and has been incorporated into things more and more over time.

It's 40 year old stuff by now.


I'll add that many DRAM chips already do something like this, but ironically enough the re-routing mechanism adds complexity which is itself a source of problems, (be it manufacturing or design, such as broken timing promises)

Also, NAND Flash storage (SSD) is designed around the very concept of re-routing around bad blocks, because the very technology means they have a wear-life.


> I'll add that many DRAM chips already do something like this, but ironically enough the re-routing mechanism adds complexity which is itself a source of problems, (be it manufacturing or design, such as broken timing promises)

The best-performing solution there is probably software. Tell the OS about bad blocks and keep the hardware simple.


I think this is already implemented both in Linux and in Windows; you can tell the OS which RAM ranges are defective.

Doing this from the chip side is not there yet, apparently. I wonder when will this be included in the DRAM feature list, if ever. I suspect that detecting defects from the RAM side is not trivial.


> I suspect that detecting defects from the RAM side is not trivial.

Factory testing or a basic self-test mode could easily find any parts that are flat-out broken. And as internal ECC rolls out as a standard feature, that could help find weaker rows over time.


Yep, my last PC developed a defect in one of the RAM modules. Finding it using memtest86 was trivial; easier than figuring out exactly how to tell Windows what to do about it...

Of course it did take a little bit of a hunch to go from "the game I'm playing crashes at this point" to "maybe my RAM is defective". I suppose ECC would help spot this.


Spot and correct it too :)


Wikipedia link on microlithography if you want a rabbit hole about wafer making:

https://wikipedia.org/wiki/Microlithography

Being able to print something in nanometers is an overlooked technical achievement for human manufacturing.


If that rabbit hole appeals, the ITRS reports (now called IRDS[2]) are very good mid-level, year-by-year summary of the state of the art in chipmaking, including upcoming challenges and future directions.

> Being able to print something in nanometers is an overlooked technical achievement for human manufacturing.

IMO, a semiconductor fab probably is the highest human achievement in terms of process engineering. Not only do you "print" nanometric devices, you do it continuously, in a multi-month pipelined system and sell the results for as little as under a penny (micros, and even the biggest baddest CPUs are "only" a thousand pounds, far less than any other item with literally a billion functional designed features on it).

[1]: https://en.wikipedia.org/wiki/International_Technology_Roadm...

[2]: https://en.wikipedia.org/wiki/International_Roadmap_for_Devi...


is it the same reports accessed by the process in https://irds.ieee.org/home/how-to-download-irds ?


For IRDS, yes.


This is briefly touched upon in Neal Stephenson's "Seveneves", after The Cataclysm happens humanity regrows most technology, but for the most part ICs are limited to 8086-level devices as even thousands of years later they are unable to reach the level of semiconductor technology before it. The amount of technology and power in something as small as a cell phone or lowly raspberry pi is nothing short of a marvel, particularly at their price points.


> change in the way chips are designed can make them resilient to defects

This is already happening for almost all modern chips manufactured in the last 10+ years. DRAM chips have extra rows/cols. Even Intel CPUs have redundant cache lines, internal bus lines and other redundant critical parts, which are burned-in during initial chip testing.


The Cerebras WFE is design has on each wafer to disable / efuse a portion of itself to account for defects. This is what you can do if your control the wafer.


The other one big obstacle is chips are square while wafers are round.


it depends on the exact shape of your mask of course, but typically losses around the edges are in the 2-3% range.

It's not really possible to fix this either since wafers need to be round for various manufacturing processes (spinning the wafer for coating or washing stages) and round obviously isn't a dense packing of the mask itself. It just kinda is how it is, square mask and round wafer means you lose a bit off the edges, fact of life.


The interesting part is loss scales with the die size.


... that's the exact opposite of every economic and yield advantage that chiplet design addresses, isn't it?

Want to fine tune your chip offering to some multiple of 8 cores (arbitrary example of the # cores on the chiplet)? Just a packaging issue.

Want to upbin very large corecounts that generally overclock quite well? For a massive unichip described, maybe there are sections of the chip that are clocking well and sections that aren't: you're stuck. With chiplets, you have better binning granularity and packaging.

Want to fine-tune various cache levels? I believe from what I've read that AMD is doing L3 on a separate chiplet (and vertically stacking it!). So you can custom-tune the cache size for chip designs,

You can custom-process different parts of the "CPU" with different processes and fabrication, possibly even different fab vendors.

You can upgrade various things like memory support and other evolving things in an isolated package, which should help design and testing.

The interconnects are the main problem. But then again, I can't imaging what a foot-wide CPU introduces for intra-chip communication, it probably would have it's own pseudo-interconnect highway anyway.

Maybe you don't even need to reengineer some chiplets between processor generations. If the BFD of your new release is some improvement to the higher or lower cpus in the High-Low designs, but the other is the same, then that should be more organizational efficiency.

Intel and others have effectively moved from gigantic integrated circuits decades ago: motherboard chipsets were always done with a separate cheaper fab that was a gen or two behind the CPU.

Maybe when process tech has finally stabilized for a generation now that process technology seems to be stagnating more, then massive wafer designs will start to edge out chiplet designs, but right now it appears that the opposite has happened and will continue for the foreseeable future.


I said it upthread but maybe it's better to use the conceptual model of "chiplets but you don't cut the chiplets apart". You have some network of "chiplets" connected with some kind of interconnect (a mesh is the model everyone's used so far).

So a "chiplet" (a module) could still have some of its cores turned off, or be disabled entirely, and then you just route around them. The yields are thus the exact same you would expect from taking a wafer, printing a bunch of chiplets, and then not cutting it apart. Most of them (Zen3 is >90% with 8 functional cores now iirc) will be fine. A few will fail clock bins. A few will be dead. Etc.

https://cerebras.net/blog/wafer-scale-processors-the-time-ha...

A lot of the points you raise are fair, some of them can be engineered around and others not really.

* There is no reason that different "chiplets" can't have different clocks. big.LITTLE already deals with different clock domains and rates, your interconnect should be asynchronous and then it's just an aspect of the design. Maybe it's not desirable from a programming perspective but it's not a problem in hardware, and if you don't like an uneven canvas, then just disable those modules.

* However you can't do different voltages as easily. Having a FIVR (Fully-Integrated Voltage Regulator) in each "chiplet" is an option, just feed some high voltage like 2v or 3v and have the chip step it down. This is highly desirable for efficiency anyway. Or just feed a reasonably high voltage and if some of the chips don't clock well, then oh well, you use a little more voltage than necessary (Intel does this for Alder Lake - the chip uses the greater of "V_big" and "V_little" for whatever values are needed for frequencies at the moment - if the little cores get more voltage than they need, oh well). I don't think clock variations/chip quality generally varies as much as people imagine it does, especially once a process gets relatively mature, you can come up with some relatively good bars that almost all cores will pass.

* Wafer-scale chips have to be fabbed with everything on one process, that's true. With "tiles" and EMIB, or MCM, or regular stacking, you can mix and match them, including between vendors.

A lot of these "mix and match" bits don't really make sense/don't really fit the release model/don't really fit the physical engineering:

* A core block is already a self-contained (relatively) module that can be designed/tested/updated in isolation. Once you've got the bugs hammered out, scaling it arbitrarily large isn't a problem, and again, you probably are organizing it into "modules" of 8 cores or so anyway for organization's sake, even with wafer-scale.

* "version friction" is already a thing even in other chips. If RDNA2 isn't ready to go when AMD designs their Zen3 APUs... oh well, it doesn't go until the next generation. You could certainly iterate versions more quickly, but you can't change these behind your customer's back - if RDNA2 becomes available at some later date, you can't just swap it in, that's a new product and has to be validated and designed for separately.

* "steppings" already handle this in traditional chips. If there is some minor improvement in the memory controller, it becomes a new stepping of the chip. If it's a big change... it's a new release.

The big advantage of wafer-scale over current MCM/interposer tech at an engineering level is the much lower cost of data movement. MCM/interposer is expensive to push a signal all the way through a big interposer somewhere else, and it takes quite a lot of cache to even attempt to hide this. "Active interposer", EMIB, and advanced bonding technologies can help reduce this a lot, but it's "cheaper" to just push it to the module next to you and then have it pass the message along in turn.

MCM also really doesn't go to the scale wafer-scale does. What if I want a package with 64 CCDs on it? That's really what wafer-scale is about. But you can't do that with MCM since that's bigger than the reticle limit used to print an interposer. What you'd have to do is print "bridges" that go under the edges of the chiplets and carry the signal, and then mount the whole thing on an inert substrate (not printed like an interposer). And... that's exactly what EMIB does. But you'd have to do a ton of packaging work to get the size of chip that wafer-scale can gives you, each of those packaging stages (attaching each bridge, etc) has failure rates too. And from the sounds of it, it's actually quite high (AMD is supposedly taking a big hit on Milan-X yields due to the cache packaging step).


> Wafer scale integration [1] has come up periodically over the years.

Not merely years, but decades. I worked with a wafer-scale group in MIT Lincoln Laboratory back in the late 80s. I'd say there is a reason the technology hasn't taken off in the past 35 years, but who knows, maybe now is the time for wafer scale integration to shine.

Incidentally, that group was originally Ken Olsen's group, after he left to go found Digital Equipment Corporation (DEC). They had a lot of talent and produced some interesting wafers. The tech back then (4in wafers iirc) used laser reconfigurability to route around bad cells.


I think a part of this is that the clock speed increases made most other more expensive solutions not yet commercially viable, but now that 'the end' is in sight a lot of those more expensive solutions are making a come back. You could see the same happening in software to mirror the far more exotic hardware that we use now compared to single core CPUs which for a very long time were the norm.


Nothing new under the sun. Ivor Catt was proposing wafer scale computing in the '70s. Large numbers of processors with the ability to route around defective units.

https://www.ivorcatt.org/icrns86jun_0004.htm


Calling wafer-scale "no interconnect" is kind of misleading since it's still very difficult to stitch reticles and it has yield challenges.


Indeed, that Zdnet article hints at how they solved that but doesn't provide detail enough to make sense of it.


The typical problem is the ~30mm retical dimension limit. Unless you want to have the same pattern copied ~100x over the wafer you'll need to have more unique masks... A LOT of masks, and many more mask steps (because they can't step the same mask). For reasonable (finfet) geometries that is much too expensive, even if you could buy the required aligners.

I suspect stacked wafer is much more cost competitive, and you get to use appropriate (RAM, CMOS, Flash) processes for each layerso long as their sizes match. Via size and density are such that speed is less of an issue than going to package and the yield hit isn't more than 2-3x.


Sounds like a great development if it works out.

But consider also that you can stick chiplets on top of each other vertically. That means you can put chiplets much closer together than if they were constrained to exist on the same single plane of the wafer.

Now how about stacking wafers on top of wafers? That could be super, but there might be technical difficulties, which maybe sooner or later can be overcome.


> But consider also that you can stick chiplets on top of each other vertically.

The problem there is heat dissipation. Already the performance constraint on consumer chips like the Apple M1 is how well it can dissipate heat in the product it's placed in (see Macbook Air vs Mac Mini). Stacking the chips just makes it worse.


AMD's 5800X3D and the upcoming generation of AMD/NVIDIA GPUs (both of which are rumored to feature stacked cache dies) are going to be real interesting. So far we haven't ever seen a stacked enthusiast die (MCM doesn't feature any active transistors on the interposer) and it will be interesting to see how the thermals work out.

This isn't even stacking compute dies either, stacking memory/cache is the low-hanging fruit but in the long term what everyone really wants is stacking multiple compute dies on top of each other, and that's going to get spicy real quick.

M1 is the other example but again, Apple's architecture is sort of unique in that they've designed it to run from the ground up at 3 GHz exactly, there's no overclocking/etc like enthusiasts generally expect. AMD is having to disable voltage control/overclocking on the 5800X3D as well (although that may be more related to voltage control rather than thermals - sounds like the cache die may run off one of the voltage rails from the CPU, potentially a FIVR could be used to drive that rail independently, or add an additional V_mem rail...)

And maybe that's the long-term future of things, that overclocking goes away and you just design for a tighter design envelope, one where you know the thermals work for the dies in the middle of the sandwich. Plus the Apple design of "crazy high IPC and moderately low ~3 GHz clocks" appears well-adapted for that reality.


Wide and slow is expensive. Very expensive. That's why Apple can do it and nobody else is doing it (in mobile Qualcomm is _cutting_ cache from Arm reference designs and in servers Graviton and Ampere are also cutting cache). It is cheaper for a given performance level to clock your cores as far as they'll go and cheap out on the width of your core if you know your customers either won't or can't care about power efficiency (because they have no better alternatives).


The fact that the M1 Macbook Air operates without needing a fan is very unusual for that level of performance.


the problem is signal propagation

for light to cross 1 feet should take ca 1ns


3D circuits would be denser (shorter propagation distances) than a planar circuit. In fact "computronium" is sort of an idea about how dense you can conceptually make computation.

You just can't really cool it that well with current technologies. Microfluidics are the current magic wand that everyone wishes existed but it's a ways away yet.


Signal propagation latency is only a problem if both end points need to be fully synchronous, which they don't have to be if they're independent units. Even the cores within single-chip multi-core CPUs aren't synchronous to each other, as far as I know.


The problem is cost. Instead of getting N CPU’s per wafer you are now getting just 1 mega-CPU. So you need to sell the mega-CPU at N times the price to get the same ROI. So a solution for super computer applications. Not for everybody else.


Instead of microcircuits, megacircuits. I like it


speed of light ~ 1*10^9 feet/sec

To cross one foot - no less than 10^-9 sec = 1ns


There are some new chip manufacturing technique coming down the pipeline, which will lead to prices dropping and likely "wafer-scale" will get to the mainstream.


Could you elaborate? Would love to know more.


Unfortunately, I cannot.


Can you clarify what you mean? How can anything that would use a major share of a wafer or even a whole one be mainstream? Producing a wafer is expensive, the chips are only available to ordinary people in high income markets because you get hundreds or even thousands from a single wafer.


I remember back in the 80's the limit was considered to be 64K RAM chips, because otherwise the defect rate would kill the yield.

Of course, there's always the "make a 4 core chip. If one core doesn't work, sell it as a 3 core chip. And so on."


Hmm. I worked for a memory manufacturer in the 80s and I do not remember any limit.


That is mainstream news reporting for you since the 80s.


"Reached their limits" - I feel like I've heard this many many times before.

Not that I doubt it, but just I've also been impressed with the ingenuity that folks come up with in this space.


Microarchitecture performance innovation continues (the M1, Zen3, and Alder Lake cores are all significant improvements over their predecessors), and transistor performance and density continues to increase. We're on the verge of new transistor architecture (FinFET -> GAA) and shrinks and other advancements continue.

Single chip processors have not reached their limits in terms of . For some applications, one piece of silicon is not enough, and that has been true for ever. Early cores were built with multiple chips, later ones had separate cache or FPU add ons, we had (and still have) SMP multi-processors that wire up many chips together, we have discrete GPUs.

So in any given year, with any given silicon technology and logic design, we have virtually always reached the limit of single chip processors, and gone beyond them with multiple chips. And at the same time, the limits of single chip processors have continued to expand year after year.

Both of these things remain true. Relative performance improvements have slowed significantly from where they were 20-30 years ago, but things are still ticking along.


I read articles like this as saying "reached their limits [as we currently understand them]." Sometimes we learn we were mistaken and more is possible but it's not reliable and, crucially, when it happens it happens in unexpected ways. The process of talking about when (and why) techniques have hit their useful limits is often key to unearthing the next step.


Agreed. I would be wary of reaching fundamental limits set by physics although I don't think we're there yet.

"It would appear that we have reached the limits of what is possible to achieve with computer technology, although one should be careful with such statements, as they tend to sound pretty silly in five years."

- attributed to von Neumann, 1949.


For example, there's lots to explore in the VLIW space.


Compilers, largely.


Yep, and also architectures whose state is simpler (for compilers) to model


We only have to solve one limitation per year to keep making progress year over year, and as it is, the semiconductor industry still seems to be solving large numbers of significant issues yearly. So while we don't necessarily get smooth, predictable improvement, a safe bet is that there will be continue to be useful new developments 10-20 years out, even if they don't translate to the same kinds of gains as in years past.


"There's plenty of room at the bottom."


Actually, we are getting out of room there.

that speech is about 80 years old nowadays. There was plenty of room at that time.

Of course, it also speculated that we would move into quantum computers at some point, what is still a possibility, but now we know that quantum computers won't solve every issue.


The M1 Ultra is fabricated as a single chip. The 12900K is fabricated as a single chip and is still a quarter the size of the M1 Ultra. Ryzen 3 puts 8 cores on a CCX instead of four because DDR memory controllers don't have infinite memory bandwidth (contrary to AMD's wishful nomenclature) and make shitty interconnects between banks of L3.

Chiplets are valid strategies that are going to be used in the future but there are still more tricks that CPU makers have up their sleeves that they need to use out of necessity. They're nowhere near their limits.


M1 ultra is two chips with an interconnect between them I thought? Or is the interconnect already on die with them?

(Edit: sounds like it is two: "Apple said fusing the two M1 processors together required a custom-built package that uses a silicon interposer to make the connection between chips. " https://www.protocol.com/bulletins/apple-m1-ultra-chip )


> M1 ultra is two chips with an interconnect between them I thought? Or is the interconnect already on die with them?

It's either depending on how you look at it. The active components of the interconnect are on the two M1 dies, but the interconnect itself goes through the interposer as well.


> The M1 Ultra is fabricated as a single chip.

I'm curious how much the M1 Ultra costs. It's such a massive single piece of glass I'd guess it's $1,200+. If that's the case it doesn't make sense to compare the M1 Ultra to $500 CPUs from Intel and AMD.


Dunno, M1 Ultra includes a decent GPU, which the $500 CPUs from Intel and AMD do not. Seems relatively comparable to a $700 GPU (like a RTX 3070 if you can find one) depending on what you are using. Sadly metal native games are rare, many use some metal wrapper and/or Rosetta emulation.

Seems pretty fair to compare an Intel alder lake or higher end AMD Ryzen AND a GPU (rtx 3070 or radeon 6800) to the M1 ultra, assuming you don't care about power, heat, or space.


Has anyone managed to reach the actual advertised 21 FP32 TFLOPS? I'm curious. Even BLAS or pure custom matmul stuff? How much of that is actually available? I can almost saturate and sustain an NVIDIA A40 or A4000 to their peak perf, so, wondering whether anyone written something there?


What are you using to make the comparison to a 3070? Not close in any benchmarks I've come across.


I estimate that Apple's internal "price" for the M1 Ultra is around $2,000. Since most of the chip is GPU, it should really be compared to a combo like 5950X + 6800 XT or 12900K + 3080.


It wouldn't surprise me. M1 Ultra has 114 billion transistors and a total area of ~860 square mm. For comparison, an RTX 3090 has 28 billion transistors and a total area of 628 square mm.


Wouldn't the price be primarily based on capital investment and not so much on the unit itself? After all, it's essentially a print out on a crystal using reeeeeally expensive printers. AFAIK Apple's relationship with TSMC is more than a customer relationship.


In a parallel universe where Intel builds and sells this CPU- what's the price? Single chip, die size of 860 square mm, 114 billion transistors, on package memory.

It just got me thinking the other day since all of these benchmarks pit it against $500-$1000 CPUs and it doesn't seem to fall in that price range at all. Look at this thing:

https://cdn.wccftech.com/wp-content/uploads/2022/03/2022-03-...


That's the whole package though, together with the RAM and everything. The Actual die is about the size of the thermal paste stain on that picture.


If there's a defective M1 Ultra, they can cut it in half and say those are two low-end M1 Max.


Wouldn't they only get, at most, one low-end M1 Max if there is a defect?


They sell cheaper models with some cores disabled, that's what I meant by low-end. Ever wondered what's the deal with the cheapest "7-core GPU" M1?


If the defect is in the right place, Apple apparently sells M1 Max chips with some GPU cores disabled.


also all the other shit that’s on the chip ram etc.


It is commonly said that on the new M1 macs, that the ram is on the chip, it is not. It is on the same substrate, but its just normal (fast) dram chips soldered on nearby.


"chip" is ambiguous the way you're using it.

The M1 Ultra is two dies in one package. The package is what goes on the motherboard.

You can also count the memory modules as packages as well. It's not incorrect to say that M1 Ultra has a bunch of LPDDR5 packages on it as well, each LPDDR5 package may have multiple dies in it as well, or the whole LPDDR5 package may be referred to as a stack ("Ultra has 16 stacks of memory")

But depending on context it also wouldn't be incorrect to say the M1 ultra is a package as a chip even if it's got more packages on it. From the context of the motherboard maker, the CPU BGA unit is the "package".

Anyway no, Ultra isn't a monolithic die in the sense you're meaning, it's two dies that are joined, Apple just uses a ridiculously fat pipe to do it (far beyond what AMD is using for Ryzen) such that it basically appears to be a single die. The same is true for AMD, Rome/Milan are notionally NUMA - running in NPS4 mode can squeeze some extra performance in extreme situations if applications are aware of it, and there's some weird oddities caused by memory locality in "unbalanced" configurations where each quadrant doesn't have the same amount of channels. It just doesn't feel like it because AMD has done a very good job hiding it.

However you're also right that we haven't reached the end of monolithic chips either. Splitting a chip into modules imposes a power penalty for data movement, it's much more expensive to move data off-chiplet than on-chiplet, and that imposes a limit on how finely you can split your chiplets (doing let's say 64 tiny chiplets on a package would use a huge amount of power moving data around, since everything is off-chip). There are various technologies like copper-copper bonding and EMIB that will hopefully lower that power cost in the future, but it's there.

And even AMD uses monolithic chips for their laptop parts, because of that. If any cores are running, the IO die has to be powered up, running its memory and infinity fabric links, and at least one CCD has to be powered up, even if it's just to run "hello world". This seems to be around 15-20W, which is significant in the context of a home or office PC.

It's worth noting that Ryzen is not really a desktop-first architecture. It's server-first, and AMD has found a clever way to pump their volumes by using it for enthusiast hardware. Servers don't generally run 100% idle, they are loaded or they are turned off entirely and rebooted when needed. If you can't stand the extra 20W at idle, AMD would probably tell you to buy an APU instead.


Some older stuff for reference: IBM POWER5 and POWER5+ (2004&2005) are MCM designs, had 2-4 CPU chips plus cache chips in same package.

Link: https://en.wikipedia.org/wiki/POWER5


Pentium pro from 1995 had two pieces of silicon in the package: https://en.wikipedia.org/wiki/Pentium_Pro


PPros are quite hard to find now because the "gold scavengers" loved them. As i recall, at the peak in 2008, they were $100ea and more for the ceramic packages. All that interconnect was tiny gold wires, apparently.


Heh, had no idea, they seemed to have a pretty limited run, ran at up 200 MHz, but was pretty quickly replaced by a Pentium-II at 233 Mhz on a single die.


I'm using one as a coaster.


> UCIe is a start, but the standard’s future remains to be seen. “The founding members of initial UCIe promoters represent an impressive list of contributors across a broad range of technology design and manufacturing areas, including the HPC ecosystem,” said Nossokoff, “but a number of major organizations have not as yet joined, including Apple, AWS, Broadcom, IBM, NVIDIA, other silicon foundries, and memory vendors.”

The fact that the standard doesn’t include anyone who is actually building chips makes me very pessimistic about it.


Looks like a lot of people who actually build chips are in the organization

https://www.uciexpress.org/membership


Is spectrum.ieee.org becoming another mainstream ( so to speak ) journalism where everything is dumbed down to basically Newspeak. The article is poorly written, the content is shallow and the headline is click bait.


The IET magazine is the same, it's basically New Scientist but more expensive and you get letters after your name too.

My favourite was the old PPARC Frontiers, was glossy but not breathless. But even then, these things never go into much detail.


I'm embarrassed to admit I still don't quite understand what a chiplet is, would be very grateful for your input here.

If a thread can run on multiple chiplets then this is awesome and seems like a solution.

If one thread == one chiplet, then*:

- a chiplet is equivalent to a core, except with speedier connections to other cores?

- this isn't a solution, we're 15 years into cores and single-threaded performance is still king. If separating work into separate threads was a solution, cores would work more or less just fine.**

* put "in my totally uneducated opinion, it seems like..." before each of these, internet doesn't communicate tone well and I'm definitely not trying to pass judgement here, I don't know what I'm talking about!

** generally, for consumer hardware and use cases, i.e. "I am buying a new laptop and I want it to go brrrr", all sorts of caveats there of course


AMD Epyc is (AFAIK) what popularized the term. Their current design has a memory controller (PCIe controller, 8 x 64 bit channels of ram, etc) and 8 chiplets which are pretty much just 8 cores and a infinity fabric connection for a cache coherent connection to other CPUs (in the same or other sockets) and dram.

So generally Epyc come with some multiple of 8 CPUs enabled (1 per chiplet) and the latency between cores on the same chiplet is lower than the latency to other chiplets.

This allows AMD to target high end servers (up to 64 cores), low end (down to 16), workstations with threadripper (4 chiplets instead of 8), and high end desktops (2 chiplets instead of 8) with the same silicon. This allows them to spend less on fabs, R&D, etc because they can amortize the silicon over more products/volume. It also lets them bin them so chiplets with bad cores can still be sold. It's one of the things that lets AMD compete with the much larger volume Intel has, and do pretty well against numerous silicon designs Intel chips.


> So generally Epyc come with some multiple of 8 CPUs enabled (1 per chiplet)

Not quite, AMD does use values other than 8 cores-per-CCD in Epyc as well. Take the 7402P, that's a 24C SKU, if you did that as "3 8-core chiplets" then you would only have 3 quadrants = 6 memory channels and 48 PCIe lanes. Those are done with 4 chiplets of 6 cores each. Same for 48C SKUs.

AMD also has a number of "frequency-optimized"/"cache-optimized" SKUs that are like, 4C or even 2C per CCD, with all the cache enabled, to allow maximum frequency and maximum cache-per-thread, for stuff like HFT where there just is no substitute for A Few Threads Going Really Fast. Or playing games to minimize your software license costs, as that is often based on core count.

However, the link is certainly a little bit amorphous. Some Epyc SKUs only have 4 memory channels, but through some magic they can still access all the memory slots like they were a full 8-channel part. I guess that means you have 2 quadrants active (2 memory controllers/PCIe controllers) but they can access the PHYs from the other two disabled memory controllers, not 100% on how that is implemented, but it exists.

https://www.servethehome.com/amd-epyc-7002-rome-cpus-with-ha...

Generally Epyc doesn't like "unbalanced" configurations though (this can incur severe performance penalties, like losing 2/3rds of your memory bandwidth level "severe" if you only populate 3/4 or 6/8 of your sticks) so that gets used very sparingly, and only in situations of "power of 2" resources I'm guessing.

As a general rule of thumb, assume anything Epyc-specific (IO die, IF links, etc) is designed to work with 4 of something, or at least 2 of something. All the "variation" happens at the CCD level. So a 24C is not 3x8C, it is always 4x6C as your baseline assumption (and it is). As mentioned sometimes memory controllers can be 2x but... generally four shall be the number that is counted, and the number of counting shall be four. Thou shalt not count to three, except that thou proceed to four. Five is right out.


Heh, right, I did say generally.

The 7453, 7443, 7413, and 7313p have less than 8 chiplets (the 7xx3 chips are Milan/Zen3). I don't believe any of them have less then full memory bandwidth, unlike the previous generation. The spec sheet mentioned PCIe x 128 for all of them as well.


> The spec sheet mentioned PCIe x 128 for all of them as well.

Due to the way they've sliced it, you always get full PCIe PHYs (128 lanes) just like you get full memory PHYs. They literally only gimped the memory bandwidth, like the controllers are gone but the PHYs remain and the other 4 controllers can use all of the PHYs. It's kinda weird, I don't think I've seen it done like that before.

Incidentally though this probably does mean some weirdness with locality at those extremes though - half of your lanes don't have any CPU cores locally and everything they do is running through the quadrant-interconnect.


"They literally only gimped the memory bandwidth".

As I mentioned looks like all the less than 8 chiplet Epycs in the current Zen3/Milan generation look like they have the full memory bandwidth.


No need to get defensive (on behalf of a multibillion-dollar corporation), I'm just more curious about how exactly they did the quad-channel SKUs in Rome.


Heh, didn't think it was defensive.

The 2nd gen epycs did have chips with reduced memory bandwidth, the 3rd gen didn't. I believe the posted URL from servethehome has a pretty good explanation. My theory is that the 2nd gen Epycs had a bottleneck in the chiplet uplink connections, so that 4 chiplets couldn't manage the full bandwidth. So maybe the 3rd gen increased those links so even 4 chiplets could still handle 100% of the available memory bandwidth.

It does boggle my mine that a $2k apple desktop has 400GB/sec of memory bandwidth and that a $4k apple desktop has 800GB/sec of memory bandwidth.


A chiplet is a full-fledged CPU with many cores on it. The term is used when multiple of these chips are stitched together with a high speed interconnect and plugged into the single socket on your motherboard.

If you ripped the lid off a Ryzen "chip", you would see multiple CPU dies underneath for the high end models.


Additionally - MCM - multi-chip module - instead of putting separate chips for various functions on a board, they're fused together in what from the outside looks like a single chip, but internally is 3 or 4 unrelated chips.

Examples at the Wikipedia article: https://en.wikipedia.org/wiki/Multi-chip_module


Are we moving this way because bigger chips with many cores have worse yields? so the answer is to make lots of little chips and fuse then together?


Yeah, in the very general case, chip errors are a function of die area. Cutting a die into four pieces so that when an error occurs in manufacturing, you only throw out a quarter of the die area is becoming the right model for a lot of designs.

Like all things chips, it's way more complicated than that, fractally, as you start digging in. Like AMD started down this road initially because of their contractual agreements with GloFlo to keep shipping with GloFlo does, but wanted the bulk of the logic on a smaller node than GloFo could provide, hence the IO die and compute chiplets model that still exists in Zen. It's still a good idea for other reasons but they lucked out a bit by being forced in that direction before other major fabless companies.

This is also not a new idea, but sort of ebbs and flows with the economics of the chip market. See the VAX 9000 multi chip modules for an 80s take on the same ideas and economic pressures.


Their GPUs are likely to be multichip for the first time too with NAVI 31 (while Nvidia's next gen will still be single chip and likely fall behind AMD). It also seems like that the cache will be 6nm while the logic will be 5nm and bonded together with some new TSMC technology. At least that can be inferred from some leaks:

https://www.tweaktown.com/news/84418/amd-rdna-3-gpu-engineer...


I've yet to see any sort of research out of AMD on MCM mitigations for things like cache coherency and NUMA. Nvidia on the other hand has published papers as far back as 2017 on the subject. On top of that even the M1 Ultra has some rough scaling spots in certain workloads and Apple is by far ahead of everyone else on the chiplet curve (if you don't believe me, try testing lock-free atomic load/store latency across CCX's in Zen3).

Also AMD claimed the MI250X is "multichip" but it presents itself as 2 GPUs to the OS and the interconnect is worse than NVLink.


There's a few ways to interpret that. Another interpretation could be that they are simply taping out Navi32 on two nodes, perhaps for AMD to better utilize the 5nm slots they have access to. Perhaps when Nvidia is on Samsung 10nm+++, then the large consumer AMD GPUs get a node advantage already being at TSMC 7nm+++, and so they're only using 5nm slots for places like integrated GPUs and data center parts that care about perf/watt.

But your interpretation is equally valid with the information we have AFAICT.


This is what Tesla's Dojo does (it's really a TSMC technology that they are the first to utilize). You can cut your wafer up into chips, ditch the bad ones, then reassemble them into a bigger wafery chip thing using some kind of glue. Then you can do more layers to wire them up.

I think they do it using identical chips but I guess there's no real reason you couldn't have different chips connected in one wafer. Expensive though!


Well fuse is one possibility. The AMD Epyc has generally an IO+memory controller die (called IOD) + 8 chiplets that are 8 cores each for most of the Epyc chips, however not all cores are enabled depending on the SKU.

However apple's approach does allow impressive bandwidth, 2.5TB/sec which is much higher than any of the chiplet approaches I'm aware of.


I hope somebody with relevant knowledge can answer this question, please: what % of the costs is "physical cost per unit" and what % is maintaining the I+D, factories, channels...?

In other words, if a chip with 100x size (100x gates, etc.) made sense, would it cost 100x to produce or just 10x or just 2x?

Edit: providing there wouldn't be additional design costs, just stacking current tech.


There's many limiting factors... one is the reticle limit.

But most fundamental is the defect density on wafers. If you have, say, 10 defects per wafer, and you have 1000 chips on it: odds are you get 990 good chips.

If you have 10 chips on the wafer, you get 2-3 good chips per wafer.

Of course, there's yield maximization strategies, like being able to turn off portions of the die if it's defective (for certain kinds of defects).

For the upper limit, look at what Cerebras is doing with wafer scale. Then you get into related, crazy problems, like getting thousands of amperes into the circuit and cooling it.


The way TSMC amortizes those fixed costs is to charge by the wafer, so if your chip is 100x larger it costs at least 100x more. (You will have losses due to defects and around the edges of the wafer.) You can play with a calculator like https://caly-technologies.com/die-yield-calculator/ to get a feel for the numbers.


It's been a while since I've been out of that industry, but back around the 45nm days, one of the biggest concerns was yield. If you've got 100x the surface area, the probability of there being a manufacturing defect that wrecks the chip goes up. Now, you could probably get away with selectively disabling defective cores, but the chiplet idea seems, to me, like it would give you a lot more flexibility. As an example, let's say a chiplet i9 requires 8x flawless chips, and a chiplet Celeron requires 4 chips, but they're allowed to have defects in the cache because the Celeron is sold with a smaller cache anyway.

In the "huge chip" case, you need the whole 8x area to be flawless, otherwise the chip gets binned as a Celeron. If the chiplet case, any single chip with a flaw can go into the Celeron bin, and 8 flawless ones can be assembled into a flawless CPU, and any defect ones go into the re-use bin. And if you end up with a flawed chip that can't be used at the smallest bin size, you're only tossing 1/4 or 1/8 of a CPU in the trash.


>would it cost 100x to produce or just 10x or just 2x?

Why would 100x something only cost 2x to produce?

>what % of the costs is "physical cost per unit" and what % is maintaining the I+D, factories, channels...?

Without unit volume and a definition of the first "cost" in the sentence no one could answer that question. But if you want to know the BOM cost of a chip, it is simply Wafer Price divided total useable chips depending on yield where yield is both a factor of current maturity of node and whether your design allows correction of defects for usable chips. Then add about ~10% for testing and packaging.


Why would 100x something only cost 2x to produce?

If you create an app, the cost is mostly developing it. Once you can sell a copy, you can sell 100x for more or less the same cost.

It's tricky for physical things. We tend to think that costs correlates with weight or volume, but that's wrong.

The price of a typical 100 ml (3.4 fl oz) perfume is around $50 in shops, for an "official" $100 price.

The cost of the juice is just $3, maybe $5. But that's not the total cost: the perfumist, the bottle, the box, shop markup, distribution markup, marketing, tv commercials, design, samples...

In this case, I guess I+D and machinery is a huge cost and price per unit is set not so much for the cost of producing one unit, but looking at what the whole costs are, putting a markup and dividing by expected units, hence my question.

By the way, nobody answered :-)

I know that the reponses answered somehow, in an indirect way...


That is only in the case of software, where the unit cost does not increase since it is nearly zero. What you wrote about perfume and juice are not unit "cost", but "price". Unit Cost, or BOM or COGS does not include --> the perfumist, the bottle, the box, shop markup, distribution markup, marketing, tv commercials, design, samples...

I am assuming the "I" here stands for Initial Investment or CapEx. And in the context of Foundry Customer it is more like R&D since they dont invest in Machineries. The foundry does that aka Intel , TSM, Samsung. The R&D cost again is not a percentage of "price" or "cost" since there is not expected volume to divide in the first place. And it depends on the complexity of the chip as well as node. On Leading Edge node tooling becomes much more expensive. Expect the Cost of Entry to be $300M+ ( for 7nm, definitely higher now for 5nm and 3nm ) for tools excluding masks and other steps. Making a chip with 99.9% of SRAM would have near zero R&D cost except for the mask. Complexity of your chip and design would dictate how many mask required before your final production. However multiple SKUs ( or dies variation ) could shared the cost of mask which is in the double digit million per run.

And none of these includes packaging and testing.

R&D dominates the cost of chip once you factor in engineers cost, and hence you often see Qualcomm ( or any other Fabless Company ) chasing volume despite it is often not in their best interest in terms of margin.

And and none of these includes the cost of IP, to ARM, IMG, CEVA, Lattice etc or Patents. Most of these are also per unit based. And probably other things I cant record on top of my head right now.


First, I apologize for the confusion, I+D wasn't translated to English: R&D.

What you wrote about perfume and juice are not unit "cost", but "price".

Maybe "cost" has a precise definition in Economics that I'm not aware of, but "price" is definitely not right in this context either. In my example, I can't see how the bottle and the box are not costs.

For me (maybe not correct, but that's what I meant anyway) "cost" is any expense that you incur to put the product in the hands of the final buyer. And cost per unit is total expenses divided by number of units. Feel free to use other words instead of "cost" and "cost per unit" that fit in these definitions.

Let me clarify the original question so it makes more sense. Thirty years ago computers had a single processor. Then they started to make multiprocessor and multicore CPUs. Now it seems that miniaturization is reaching its hard limits so they're wondering how to increase computing power.

One of the logical solutions is to put more processors in the same computer, not in the same chip or waffle (so defect rate is irrelevant here), but more like a "cluster in a box". I guess that would bring a number of problems like heat, but this kind of power doesn't seem the kind you want in your pocket.

So my question was: would this make economic sense? If "cost per unit" is rigid, it wouldn't. If it's driven by market or R&D, it might make sense.


I'm not an expert, or even an amateur, here but I defects are inevitable. So if you _need_ 100x the size without defects and one defect ruins the chip the cost might be 10000x to produce


"More multi-chip processor designs" != "single-chip processors have reached their limits".


Maybe we will start to optimize our software instead of expecting accelerating CPUs to eat our bloat.


Is this a solution to a yield problem? Making physically bigger dies is no problem. Wafers are much larger than the individual dies. If the dies are just being laid out flat, there's no density gain.

Multi-chip modules are nothing new. They've been used mostly when either there was a yield problem, or you wanted two different fab technologies. The latter is seen in some imagers and radars.


What we need much more from here on are deterministic processors. There are awesome optimizers out there, including programs with genetic algorithms that find the best way to do a micro task X or Y (part of a much bigger program).

IMO we have a ton of slightly-higher-hanging fruit that we can pick in terms of optimization but the relentless march of the X86 / X64 architecture obstructed that innovation.

Might be time to look inwards and start working more on squeezing the CPUs we have right now for maximum performance.


I hope we are going to get back to a more asymmetric multi-processing arrangement in the near term where we abandon the fiction of a processor or two running the whole show with peripheral systems that have as little smarts as possible and promote them to at least second class citizens.

These systems are much more powerful than when these abstractions were laid down, and at this point it feels like the difference between redundant storage on the box versus three feet away is more academic than anything else.


The problem is AMP is very hard to program and debug. In embedded, one core is a scheduler and another is doing some real-time task (like arm BIG.little). In larger automotive heterogeneous compute platform, typically they are all treated as accelerators, or with bespoke Tier-1 integration (or like NVIDIA Xavier). And on top of that, OEMs always want to "reclaim" those spare cycles when the other AMP cores are underutilized, which is nigh impossible to do, so they fall back to symmetric MP. I think embedded is the only place for this to work right now.

EDIT: I'm not an expert in this field but I have been asked to do work in this domain, and this narrow sampling is what I encountered, but I'd like to learn more about tooling and strategies for more generic AMP deployments.


M1 is an AMP design, as is every iPhone SoC. It works well although you’ll be surprised if you try to run an SMP workload on every single core.


Yes, I actually said this in my original post: Arm BIG.little works, which is what M1 is, and embedded works well, which are SoCs. It is much more difficult in AMP systems that aren't necessary only the same die, like automotive Tier 1 products.


Hmm, I thought big.LITTLE implied that only one of the cores was active at once and all the tasks migrated from one to the other. M1 doesn't do that, all cores can be on all the time and they just start on one of them depending on priority.

Why does it have such a horrible name?


Considering how seemingly impossible it has been to get GPUs and SIMD units accessible enough for programmers, I don't think we're ready for more until compiler technology and languages mature significantly.


That kind of exists since most I/O devices have CPU cores in them, although usually hidden behind register-based interfaces. Apple has taken it a little further by using the same core everywhere and creating a standard IPC mechanism.


The old commodore serial bus was like that ... Everything had 'smarts' built-in


Makes me remember the processor in the film terminator 2: https://gndn.files.wordpress.com/2016/04/shot00332.jpg


Not if C20 spec has anything to do with it. Program as OS, assign case selections in switch statement to threads (or distributed to other 'single-chip' cpu's. boundless networked single ASIC chip possibilities!


It's surprising it took that many cores before the limit was reached!


I wonder if this will finally restart any research into non-Von Neumann architecture even for commercial uses like workstations and servers.


So is UCI-e a competitor/potential successor for something like Intel's QPI (or whatever they are using now)?


A chip with Semi-FPGA as well as Semi-ASIC strategy could work. FPGA dev tools chain needs to improve.


"Single-Chip Processors Have Reached Their Limits

Announcements from XYZ and ABC prove that chiplets are the future, but interconnects remain a battleground"

This could easily have been written 10 years ago, and I bet someone will write it in 10 years again.

We need these really big chips with their big powerful cores because the nature the computing we do only changes very slowly towards being distributed and parallelizable and thus able to use a massive number of smaller but far more efficient cores.


Correct. Also known as Rent's rule. According to Wikipedia, it first was mentioned in the 1960s: https://en.wikipedia.org/wiki/Rent%27s_rule


You're implying you can't put big powerful cores on chiplets but that's not true at all.


Hardly - performance/core hasn’t flatlined, but has not maintained the same growth over time (decades) in performance we’ve traditionally had. That’s the problem.

So if you want better aggregate performance, more cores has been the plan for a decade+ now.

FLOP/s per core or whatever other metric you choose to use.

Previously it was possible to get 20-50% or more performance improvements even year to year for a core.


I wasn't talking about improvement at all. This was about big strong cores versus efficient cores, which is a tradeoff that always exists.

You could choose between 20 strong cores or 48 efficient cores on the same die space across four chiplets, for example.


Theoretically, only a problem if certfied turing complete.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: