AMD 3D Stacks SRAM Bumplessly

g0xA52A2A · on June 9, 2021

Previous discussion of Anandtech article on this https://news.ycombinator.com/item?id=27350632

twotwotwo · on June 9, 2021

The BIOS on an AMD reference platform refers to there being 1, 2, or 4 "X3D stacks", suggesting that eventually you might be talking about more cache than this: https://twitter.com/aschilling/status/1399701274489151489

(Who knows when/if they get 2/4-high working and at what price: the I/O die needs to be able to track tags proportionate to the total cache size, the compute chiplet needs to support it, and packaging/power/everything else needs to work out. Switch in BIOS doesn't mean the rest is there.)

Other clever thing about this for them is they can keep focusing on one base compute chiplet but turn it into a wider range of SKUs, on top of how they already do that with core count etc. Same chiplet can end up in a product with 32 or 96MB L3/CCD as soon as the first two-high stack comes out and, obviously, more if/when they get 2/4-high going.

baybal2 · on June 9, 2021

They already have 4 dies in 1 layer, just some of them being dummy, or possibly dead dies.

touisteur · on June 9, 2021

Or more thermally ideally placed. I had an EPYC7371 with 16 cores like the 7351 but higher frequency 3.1GHz base frequency against 2.4GHz (more €€€ too !). The thing was a beauty.

Sephr · on June 9, 2021

You can also stack SRAM bumplessly using the wireless ThruChip Interface: https://en.wikichip.org/wiki/thruchip_interface

Escapado · on June 9, 2021

Interesting technology. Is everyone using it? And if not (I recall tsv being used a lot) then why? The wiki entry paints a very positive picture here.

Sephr · on June 10, 2021

Apparently Japan withdrew funding from the company developing this[1]. Hopefully another company can pick this up eventually to see if its worth using. iirc the fraud was financial and the tech was legit, but please correct me if I'm wrong.

1. https://fuse.wikichip.org/news/1204/pezys-future-is-uncertai...

jleahy · on June 9, 2021

Large size and poor thermals, if I had to guess.

jl2718 · on June 9, 2021

I do not understand on-chip wireless. Why build an antenna for each channel? You could just multiplex high-impedance RF into any common wire with and it’s the same thing.

monocasa · on June 9, 2021

Because it works better. Two very close antennas and two halves of a transformer are just a matter of tuning and opinion.

jl2718 · on June 10, 2021

Well, yeah, I’m with you, but, you don’t need a transformer. You need a HF short. That’s easy. You can hook up a common conductor any way you want, connect as many independent channels as you want, put a big resistor from signal to ground at each terminal, and it works the same but smaller and lower power. Make the wire slightly resistive if you are concerned about reflection loops. I can’t think of a situation where signal couldn’t be improved by some kind of conduction.

monocasa · on June 11, 2021

Ah, I thought you were talking about the area for the loops.

The idea behind this is because running conductors through the substrate has its own issues that ironically makes it a lot more complicated than little transformers like this. It's all material science kind of stuff about how the extra steps to bond the layers conductively runs into yield issues. They also thermally flex in weird ways reminiscent of 2007 era BGAs that had problems with the shift to ROHS detaching themselves from boards. But way more fragile and you can't fix it by a run through the reflow oven.

Obviously they've gotten past some of those issues in the past few years.

kabdib · on June 9, 2021

Is this affected by external magnetic fields, or other interference that a hard-wired connection is immune to?

I'm not in the habit of waving magnets around near CPUs, but I worry about susceptibility to EMI and transients.

lazide · on June 9, 2021

The only difference between a handheld magnet, radio waves, inductance, etc. is speed, precision, and magnitude.

So in practice it is highly unlikely a handheld magnet is going to do anything for the same reason a cloud passing overhead is going to break a rock. The difference in air pressure is too low, the energy gradient too gradual, and it’s too far away anyway.

Take the same aggregate amount of energy into a tank of compressed air, and use a jackhammer and that rock is dust in no time.

Same physics principles; radically different impact in practice.

Assuming the magnet you have isn’t the electromagnetic version of a tornado and you aren’t waving it around at several thousand RPM anyway.

zelon88 · on June 10, 2021

What about size? Does the advantage disappear if you need to bundle a large number of these on a chip?

rbanffy · on June 9, 2021

I remember Sun working on wireless interconnects, but I think it was horizontal, inter-package

oblak · on June 9, 2021

Since I am not exactly in the industry, it's always funny looking at these diagrams. Actual cores are so tiny compared to various caches, SIMD blocks and what have you

cogman10 · on June 9, 2021

Yup, been like that for a while. The vast majority of transistors in a CPU are dedicated to memory. Only a real tiny fraction are dedicated to logic.

lacksconfidence · on June 9, 2021

Perhaps particularly interesting about this approach by AMD is that the typical memory on a CPU die isn't as dense as it can be, because of the processes they have to apply to the wafer to also build the CPU transistors. With AMD moving to separate chips they can use a process that builds denser memory than what is typically seen on a cpu.

hinkley · on June 9, 2021

I would think this helps with yield issues on your new manufacturing node as well.

If you have a 33% yield on one new chiplet that doesn't triple the price per unit for the package.

nine_k · on June 9, 2021

Yields are normally kept high by having extra device blocks, and achieving a working config by cutting some links on the die with a laser.

Some chips get downgraded in the process: if you can't sell a 4-core CPU with 8MB of L1 cache, you can disable the core that fails tests, and / or disable the parts of the cache that fail tests, and sell a 3-core part with 4MB cache; AMD did just that back in the day.

mschuster91 · on June 9, 2021

> Some chips get downgraded in the process: if you can't sell a 4-core CPU with 8MB of L1 cache, you can disable the core that fails tests, and / or disable the parts of the cache that fail tests, and sell a 3-core part with 4MB cache; AMD did just that back in the day.

How does that work anyway? I mean, how is a processor actually tested at the pre-packaging stage, given that you'd need to provide it with power and cooling for a test?

mechagodzilla · on June 9, 2021

They use a wafer tester that literally has a tiny bed-of-nails array on it that contacts the bumps for a given die on the wafer, and can move over the wafer in x and y. It provides power and test signals, although power is typically quite limited (<100A of current or so). The good die will then go on to get packaged and further tested before being assembled onto a PCB

doikor · on June 9, 2021

Camera takes photos and compares with how it should look like.

colejohnson66 · on June 9, 2021

I'd assume they attach it to the substrate, glue on the IHS (the lid), and run diagnostics on the completed chip. If any issues are detected, the processor would contain functions to disable certain blocks (through JTAG or something).

cogman10 · on June 9, 2021

Hard to really say if it will negatively or positively impact yield.

You should be able to cram in more chips per wafer. But, you might see more of those turn out to be duds due to the more complex layers. This 3d stacking has an amplifying effect to flaws in lower layers.

We'll see if Zen 4 or Zen 5 has 100Mb caches... that'll be the true test.

derefr · on June 9, 2021

The proper yield comparison for TSV wouldn't be against the one-chip, less-stuff version, though. It'd be against what you'd have to do to achieve the same capacities without TSV: a multiplication of the number of mask layers per chip, to produce a single extremely "tall" chiplet. That'd be an extremely low-yield process (which is why nobody's doing it.)

rbanffy · on June 9, 2021

That's so true. In college I designed a discrete stack-based CPU and by far the biggest chip count was in the microcode EPROMs and SRAM for the register file. In those days RAM was fast, so it didn't have a cache (which would be even more SRAM).

55873445216111 · on June 9, 2021

Really impressive and a plesently surprising feature from AMD. I imagine the real target market for this is Eypc server CPUs. Eventually they might be able to cut the L3 cache out of the CCD die completely and rely only on external SRAM die as L3. This would give AMD a very flexible portfolio where they can offer different SKUs with different amounts of L3 cache at different price points, all using the exact same CCD die. What will be interesting is to see how much the die stacking impacts thermals.

_xy8h · on June 9, 2021

Also consider that cache is typically a substantial portion of the die, that's some impressive cost savings to be able to etch the wafer without having to worry about the greater area with cache resulting in a potentially defective chip. Smaller chips, more chips on a wafer, less loss overall with a defect.

gameswithgo · on June 9, 2021

the first thing they did in the presentation was show a 12% gaming FPS uplift.

market is anyone who wants faster stuff.

But yes, databases will love it, so will compilers.

phkahler · on June 10, 2021

>> Eventually they might be able to cut the L3 cache out of the CCD die completely and rely only on external SRAM die as L3.

That might also reduce latency? They've got all these cores around the cache, but in this case it would be directly above all the cores. Seems like it could reduce the maximum distance to cache. But then you'd need even more cores to keep the die sizes similar :-)

oscardssmith · on June 9, 2021

It wouldn't surprise me if they have a really expensive epic chip with 8 cores and 1gb of cache. A chip like that would be incredible for applications with per core licensing

tromp · on June 9, 2021

1GB on-chip SRAM is quite feasible at 7nm according to several companies designing a single chip ASIC for Grin's Cuckatoo32 Proof-of-Work, that needs 512MB of sequential access memory and 512MB of pure random access memory for a maximally efficient solver [1].

[1]https://forum.grin.mw/t/cuckatoo32-feasibility

fomine3 · on June 10, 2021

Also for APU. APU's GPU performance is heavily constrained by RAM bandwidth.