A Look at Cerebras Wafer-Scale Engine: Half Square Foot Silicon Chip

ZhuanXia · on Nov 17, 2019

Interesting things I’ve learned about this chip after a little sluthing.

CEO implies in the FYI podcast that it can handle model sizes of a max of ~4 billion parameters per wafer. Respectable but not as large as I assumed would be possible when I first read of the scale of the chip.

CEO claims model parallelism will actually work well with these devices. Would be intriguing to know the limits of this.

Based on the cooling requirements, at least a GHz clock speed appears likely. If we take the heretical position that 1 parameter is >= 1 synapse, 4 billion parameters is about the size of a bee brain. 1GHz is about 5 million times faster than a bee brain.

It would take about 20000 such chips to simulate a human-sized network. Not economical and presumably the model parallelism would break long before this. But it is interesting to note that under the not implausible 1 parameter is >= 1 synapse assumption, we are only a few orders of magnitude away from human-sized networks training 5 million times real-time.

K0balt · on Nov 17, 2019

OK, a stupid question. Since this ANN if 5Bx as fast as animal neurons, couldn't it be "multiplexed" to simulate a much larger network in biological time (with massive state-storage memory, I presume)? I realize there will be propagation dependencies, but each dependency layer could be precomped before the next, I would guess?

Or is there a reason that this won't work (or at least won't be worth it) for ANN structures?

0-_-0 · on Nov 17, 2019

> Or is there a reason that this won't work (or at least won't be worth it) for ANN structures?

If all the weights and activations fit in on-chip memory, then you can do calculations at close to 100% efficiency. If you want to simulate a 20k times larger network, you also need to transfer the 4 Billion parameters per iteration, which would take a singificantly longer time. In other words, you would be seriously bottlenecked.

Veedrac · on Nov 17, 2019

A synapse is responsible both for storing its weight, and for operating on its weight. Therefore, model size and throughput are interrelated numbers. If a synapse in a bee brain fires at 10 Hz on average (my guess), total events per second is 10 times the brain ‘capacity’. Max firing rate will be a lot larger, since most neurons are silent most of the time.

In silicon we separate out the storage from the compute. Generally we have much more storage than we have compute, and we route the data to the compute units as we need computation. Therefore ALUs have aggregate throughput of only a small fraction of the model capacity per cycle; at 400,000 cores with, say, 8 16-bit operations per cycle each, that is ‘only’ ~10^6 operations per cycle.

A better way to get a measure of computational throughput is to compare those values. Numbers are hard to find, but 10^10 synaptic events/s for a bee and 10^15 edge calculations/s for Cerebras seems the right sort of ballpark. That's a difference of 10^5, a little less than your estimate of 10^7.

petra · on Nov 17, 2019

If we could do full wafer analog neural network - that's probably another 2 order of magnitude.

That means we'll need 200 such chips.

Why is the wafer so expensive ? Buying a manufactured wafer from TSMC ,16nm is I think is $4k-5k

And there are still a few nodes left. While cost per gate don't go down much, it could solve some of the external bandwidth issues.

But all those things are really expensive and very risky. Who knows...

akiselev · on Nov 17, 2019

$4-5k per wafer (sounds very low to me) doesn't say anything about yields due to manufacturing error. The smaller a CPU is, the higher your yields on a wafer because each error will destroy fewer and fewer chips.

When you have only a few giant chips per wafer, each error becomes devastating, taking out a large fraction of the wafer. Errors are so common I wouldnt be surprised if it took them 5-10 wafers per working chip, especially if they dont do any binning or they didnt design the chip to fuse off bad sections like Intel CPUs do (a single i3/i5/i7/i9 line is usually the same exact chip, with cores that have too many manufacturing errors fused off so perfect parts become i9s, less perfect parts become i7s, etc).

satya71 · on Nov 17, 2019

No that's not how you do giant chips. Even at the scale of CPUs, there are enough bad parts that it's uneconomical to throw away chips. What Intel/AMD do, for example, they make 4-core chips. If one core doesn't work, they sell it as 3-core or 2-core system. If some of the cache doesn't work, they sell it as a lower cache version.

In the case of Cerebras, they have disable the bad blocks to get chips that actually work.

mlyle · on Nov 17, 2019

Sure, but you still end up with area that is critical and shared and doesn't yield to this strategy, and you also end up with defects that are bad enough that you still can't employ a wafer with portions turned off.

So, someone like Cerebras has to both make as little critical as possible and buy much more expensive wafers with lower defect rates and still get only moderate yields.

petra · on Nov 17, 2019

>> buy much more expensive wafers with lower defect rates

How do you do that ? where do i read more about this ?

mlyle · on Nov 17, 2019

When you buy wafers, particle counts at sizes are specified, along with all kinds of other properties-- some determined through non-destructive testing (and can be used for binning) and some through destructive testing (and can be used for characterizing lots).

Better wafers cost more.

nsteel · on Nov 18, 2019

But remember the cpu market is extremely price sensitive and Intel/AMD have such huge volumes. Manufacturers of large chips don't necessarily need to worry about the extra complication of binning in order to yield.

petra · on Nov 17, 2019

This wafer scale tech manages yield issues internally, bypassing bad circuits.

That probably means their wafer yield is 100%.

As for wafer prices: https://anysilicon.com/major-pure-play-foundries-revenue-per...

It's $6k for all wafer smaller than 20nm, which means 16nm is somewhat cheaper than that.

TFortunato · on Nov 17, 2019

Also, It does not mean 16nm is cheaper than 20nm... the smaller the feature size, the more the wafer costs. Also, this is aggregate revenue at the foundry, and is ignoring things like tapeout costs.

Not trying to crap on the tech, but it's not so simple as we can merely go to the fab and get small quantities of full custom, 100% yielding chips, for 6k ea.

imtringued · on Nov 17, 2019

>That probably means their wafer yield is 100%.

That doesn't make any sense because errors are an inherent part of the manufacturing process. If errors in the manufacturing process make 1% of the area unavailable that means perfect yields could at most be 99%.

TFortunato · on Nov 17, 2019

I would be VERY shocked to see 100% yield... theres only so much you can do on chip though this strongly depends on what your definition of a working chip is :-P

dsign · on Nov 17, 2019

Are you implying that -- barring all the unknowns about how brain works --, this chip already has enough capability to simulate a human-size network at natural speed?

IshKebab · on Nov 17, 2019

I don't think so, because you need 20000 of them, and one is going to cost on the order of a million quid (I'd guess half a million). Even if you could afford all that, the network connections between them will be the limiting factor.

emteycz · on Nov 17, 2019

So... Half the speed? Still good!

jacobush · on Nov 17, 2019

Hmm... more like 20000 ants connected in some kind of network, I'd say.

rbanffy · on Nov 17, 2019

That's still something...

https://youtu.be/Bcs3_b3VXSU

fernly · on Nov 17, 2019

Unfortunately that archived link doesn't extend to "page 2".

Some more info easily found at cerebras's site,

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-...

> a very big chip, with memory close to cores, all connected by a high-bandwidth, low-latency fabric.

and at this site:

https://www.servethehome.com/cerebras-wafer-scale-engine-ai-...

2sk21 · on Nov 17, 2019

Looks exactly like a mesh of Inmos Transputers except on a single chip! Back in the late 80s, I spent months trying to get backpropagation to work on a Meiko computing surface - but no real speedups were possible.

kkmx · on Nov 17, 2019

The website loads fine? Page two is more interesting. It's about yield which is not discussed on the cerebras website.

jsjohnst · on Nov 17, 2019

Site down, but here’s another link:

https://web.archive.org/web/20191117035933/https://fuse.wiki...

anonytrary · on Nov 17, 2019

Seems a bit odd that it's down. This post has only a few upvotes right now, so the hug of death must have been pretty small.

rbanffy · on Nov 17, 2019

I wonder if there are other hug of death sources on the internet for sites like Wikichip. It has certainly survived more popular articles on HN.

jsjohnst · on Nov 17, 2019

Agree. Very interesting read though!

Tade0 · on Nov 17, 2019

I wonder how small a small HoD is really? 10k concurrent requests?

jsjohnst · on Nov 19, 2019

You are way high for almost anything but a company website that has an engineering team dedicated to it. Yes, there are exceptions (my blog could handle orders of magnitude higher traffic than that, but I had a multi-tier CDN in front of it with 100% full page caching. Did it need it, no, but it was fun to setup).

Typically speaking, most personal / non-professional sites going down from a HoD on HN probably got under 500 concurrent requests.

orbifold · on Nov 17, 2019

There is academic effort that has been working on a similar concept for 8+ years. A recent paper that also discusses some of the challenges (routing around defects, ...) is https://www.frontiersin.org/articles/10.3389/fnins.2019.0120... . If you dig into the publications and PhD theses conducted on this system, you will also find partial answers to some of the issues raised here in the comments (how to interconnect reticles, power supply (the first prototype had really scary centimeter thick copper bars that supplied the wafer). There is a second generation system in development..

peterhj · on Nov 17, 2019

> Due to the complexity involved, Cerebras is not only designing the chip, but they also must design the full system. ... The WSE will come in 15U units with a chassis for the WSE and another one for the power and miscellaneous components. The final product is intended to act like any other network-attached accelerator over a 100 GbE.

So it's NIC bandwidth bottlenecked. Though 100 GbE is in the same ballpark as PCIe 3, but last I heard 40-100 GbE were pretty CPU-intensive compared to alternatives.

kingludite · on Nov 17, 2019

The performance jump blows the mind. If AI is going to evolve like that our expectations will be rather silly compared to reality. Its kinda like [collectively] we haven't a clue what we are doing.

evancox100 · on Nov 17, 2019

How are they connecting the die that span reticles? A WL-CSP flow? Or just the regular process but the exposure for some of the layers is offset, so some fields over lap?

gwern · on Nov 17, 2019

Still no benchmarks, so this doesn't add all that much to the earlier coverage, unfortunately.

osamagirl69 · on Nov 17, 2019

>They re-purposed the scribe lines – the mechanical barrier between two adjacent dies that are typically used for test structures and eventually for strangulation...

Glad I don't work at that fab! Would much rather my dies be singulated than strangulated

amelius · on Nov 17, 2019

Can't deep learning computations be structured such that a large number of smaller interconnected dies (but with the same total area) give the same performance as this wafer?

ianai · on Nov 17, 2019

Is the reason this hasn’t been considered for CPUs before now intel and amd not wanting to produce safer scale chips?

ohazi · on Nov 17, 2019

Making a chip this large is difficult, expensive, and error prone.

It blows past the reticle limit, so you end up having to do multiple (carefully aligned) exposures for adjacent chunks of the design. Signals can't really travel more than a few mm to a cm or so without significant degredation, so you end up needing to add buffers all over the place. Good yield tends to be exponentially harder with larger designs, since the larger area has a higher probability of overlapping with a defect, so you end up needing to harden the design with redundancy and parts that can selectively be disabled if they turn out to be defective so you don't have to throw out the whole part.

FPGAs have been pushing towards this kind of craziness, but there aren't huge advantages to doing this with a CPU, since the only thing you can do with so much more area is add cores and cache. The cost to going off-die to additional cores isn't so bad. On the other hand, trying to synthesize a multi-FPGA design (while meeting timing) is torture.

cjsawyer · on Nov 17, 2019

Why not, then, build a motherboard with many processor slots that can handle the linking then use traditionally sized processors? I don’t have the background here and am curious.

jacquesm · on Nov 17, 2019

That's possible already. But usually it is done at the machine level rather than at the processor level (2, 4 and even 8 slot machines exist but are expensive) once you get over certain limits. The kind of problems people tend to solve on such installations (typically clusters of commodity hardware) are quite different from the kind of programs that you run on your day-to-day machine, think geological analysis, weather prediction and so on.

At some point the cost of the interconnect hardware dominates the cost of the CPUs. Lots of parties, for instance Evan Sutherland (https://news.ycombinator.com/item?id=723882) have tried their hand at this, but so far nobody has been able to pull it off successfully.

Eventually it will happen though, this is an idea that's too good to remain without sponsors long.

smueller1234 · on Nov 17, 2019

2 socket servers are actually the norm and dominate datacenters. 4 is still common but usually only done for the ability to address a large amount of RAM in a single machine or niche commercial workloads. 8 socket x86 servers are very unusual.

shaklee3 · on Nov 17, 2019

To add context, this is mostly because the numa properties get weird. With 2 sockets all of the inter-socket links can go directly to the other processor, and xeons have 2-3 of those currently. With 4 and 8 you end up having strange memory topologies that has hops that are less predictable unless you know your application was written for it.

cjsawyer · on Nov 17, 2019

Thanks for the explanation!

zozbot234 · on Nov 17, 2019

We are actually moving away from wafer scale integration, because sticking to smaller chips makes you a lot more resilient to defects, which are becoming even more of a problem at recent nodes. Recent chips from both intel and amd are based on tiny "chiplets", connected via some sort of fast in-package interconnect.

jacquesm · on Nov 17, 2019

Chiplets are just an admission that it is a hard problem. In the longer run the cost of interconnecting chiplets will go up again due to their number and then at some point the crossover point will be reached and we're off to some variation on self healing hardware all on one die.

IshKebab · on Nov 17, 2019

This is a 5 kW chip. It's like trying to cool 2 kettles running continuously. Very difficult!

To even get power into the chip they have to have wires running perpendicular to the board - ordinary traces don't cut it.

This is a really far-out design that isn't appropriate for the mass market at all.

jcims · on Nov 17, 2019

Two kettles in the UK, closer to three in the US where circuits typically top out at 1800w. I actually looked into the feasibility of installing some 240v circuits in my kitchen and ordering kettles/blenders/etc from Europe...too much work lol.

imtringued · on Nov 17, 2019

It's very easy with immersion cooling. The article mentions roughly 20kW of heat per 15U which wouldn't even get anywhere near the 100kW+ for which immersion cooling has been designed.

BubRoss · on Nov 17, 2019

More like three space heaters.

rwmj · on Nov 17, 2019

It was considered quite seriously for CPUs back in the 1980s (the article touches on this). The problem is only perhaps 30% of the dies on the wafer will work. You end up with lots of dead silicon which needs routing around.

A current problem which it's unclear how Cerebras are handling is that CPUs and DRAM have different fabrication techniques which I guess can't be mixed on a single wafer, and you don't want your CPU to be too far away from its RAM. Edit: It seems they's using SRAM not DRAM, so that explains it but it must be low density and power hungry memory.

Symmetry · on Nov 17, 2019

SRAM tends to use less power per bit than DRAM does since it doesn't need to be refreshed, especially with 8T cells. Well, less power when not reading and per bit read. The difference in speeds is high enough, though, that SRAM running at max bandwidth will use more power than DRAM running at max bandwidth. And there are cases, possibly even the Cerebras chip, where the cost of getting the bit where you need it to be outweighs the cost of reading it and the greater density of DRAM might make it more efficient since you don't have to spend energy transporting it so far.

CuriousSkeptic · on Nov 17, 2019

But does that need to be a problem for ML-applications? I’m theory the training should be able to compensate for some broken hardware by simply moving weights around.

rbanffy · on Nov 17, 2019

But then you'll have a model that's specific to that chip and can't be applied on another.

ISL · on Nov 17, 2019

Presumably the big challenge is cooling?

Interconnects and I/O are probably a challenge too.

Mistletoe · on Nov 17, 2019

Can someone provide info on how you cool this beast?

Merrill · on Nov 17, 2019

A "cold plate" of chilled water cooled metal in direct contact with the wafer.

rbanffy · on Nov 17, 2019

If I were to build a box out of these, I'd do immersion cooling and place the chip(s) across the front of the machine behind laminated glass.

Keeping the fluid cool would be left to the ugly parts nobody would ever see.

Merrill · on Nov 17, 2019

Immersion cooling might not work since it depends on convection flow to move the heated liquid away from the plate. With a cold plate, pumps and pressure can be used to move the chilled water so that water in contact with the plate is always cool and heat is extracted and moved to the chiller more rapidly.

rbanffy · on Nov 17, 2019

You could circulate the fluid. Since it'd be boiling, temperature would remain constant near the surfaces and bubbles would immediately displace hot fluid. A condenser and heat exchanger would recycle the vapor back into the tank.