AMD's high-bandwidth memory explained

ChuckMcM · on May 19, 2015

It will be interesting to see how this holds up longer term. I read a great paper from ISSC on making chips thin enough for stacking and their conclusion was that it takes very little silicon to build most chips. Thin enough that stacking old 130nm dies to get 22nm transistor density in the same thickness using a stack of 6 130nm 'sheets'. Heat is still an issue of course as is alignment (if you heat one sliver before the others heat up it can push itself out of alignment apparently)

That said, I'd expect this to become the memory for laptops in the not to distant future. A Core i7 with 16GB of DDR5 stacked on top of the CPU and all that freed up space for more battery. Look for it in a Macbook near you :-)

Zenst · on May 19, 2015

I'd of thought the central connecting holes that are used to link the wafers together as a bus also act as heatsink to blance heat across the stack.

Also with the lower speeds used, heat will be less and may be the limit currently to avoid warping of individual wafers in the stack.

As for desktops etc, one avenue this does open up would be a more standard socket perhaps as the CPU can change and sits on another socket in effect, allowing changes to be done at that level to maybe stretch out sockets some CPU's would normally never reach.

stephengillie · on May 19, 2015

I had never looked at a GPU this way before, but it's essentially a "slocket". The name comes from the Pentium II and some Celerons, which mounted onto an addon card which also carried the external L2 RAM cache. This was mounted on the daughterboard to grant physical proximity and thus low latency. http://en.wikipedia.org/wiki/Pentium_II

Does this count as a 3DIC?

Also, obligatory: http://i.imgur.com/v7loIkF.jpg

I've seen GPUs advertised as having 128-bit buses, but I didn't know GDDR5 was only 32bit. For this it feels a bit like the RAMBUS vs DDR battle all over again; high-clock combined with a narrow bus caused higher latencies than a slower clock with a wider bus, which has the added benefit of being cheaper overall.

bryanlarsen · on May 19, 2015

A slocket is a PCB with an edge connector, just like current graphics cards.

This is a multi-chip module or MCM (http://en.wikipedia.org/wiki/Multi-chip_module). Intel also uses them to put the North bridge on the same package as the CPU for its mobile CPU's.

There are several innovative aspects about this MCM, but MCM's have been around for a long time.

A single GDDR5 chip is 32bit. GPU's use many chips in parallel to achieve wider buses.

spikels · on May 19, 2015

I guess the difference here is the substrate is another silicon chip rather than PCB or substitute. Pretty cool. Must have been some significant engineering challenges.

cptskippy · on May 19, 2015

I see HBM evolving into an external cache with off substrate RAM eventually working it's way back into cards.

awalton · on May 20, 2015

I think you're wrong. Everything in history to date with semis has been focused towards integrating more on chip, or at the very worst, on package. This puts the memory as close as it can get to the chip without driving the chip's cost up astronomically. There's absolutely nothing to suggest they'll de-integrate moving forward.

The biggest problem after this is how much the GPU and CPU have to fight over main system memory, which really brings us to the end game of GPUs altogether. Sooner or later there won't be room enough for both in the picture, your single heterogeneous core or MCM will have both a CPU and GPU on it (and probably a half a dozen or more application-specific accelerators).

cptskippy · on May 20, 2015

I wasn't suggesting there would be any disintegration. Just that off substrate memory will be reintroduced in-addition to HBM.

If you need a point of reference take a look at L2 cache. It was originally chips on the motherboard, then with the PII it was put on a Processor Card, then with the PIII it was eventually integrated into the Die.

The exact same thing happened with L3 cache.

cgabios · on May 20, 2015

Yup. Memory sticks are the new tape (in the volatile storage hierarchy). Definitely potential for (re-)proliferation of NUMA optimizations.

The questions are whether G/CPUs have:

- multiple or single SKUs for on-package RAM capacities

- addon RAM ability via sticks, sockets and/or surface-mount

cptskippy · on May 20, 2015

I assume there will be multiple SKUs for on-package RAM capacities. Neither Intel or AMD has ever had qualms about adding more SKUs to their lineup.

It would come as no surprise if addon RAM became a feature of the higher end CPUs (e.g. i7, Xeon, Phenom, whatever)

listic · on May 19, 2015

Looks like 3D is going to be 'the next big thing' in DRAM, one way or the other.

We are going to have AMD HBM [1], NVIDIA stacked DRAM [2] and Hybrid Memory Cube [3]. But why do we need all three, when the latter is supposed to be a standard? Or are some of these actually duplicates?

[1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

[2] https://en.wikipedia.org/wiki/GeForce_1000_series

[3] http://en.wikipedia.org/wiki/Hybrid_Memory_Cube

Narishma · on May 20, 2015

AMD and Nvidia are both using HBM AFAIK. AMD is just farther along.

ploxiln · on May 19, 2015

my guess: patents

unwind · on May 20, 2015

This reminds me of the good old PlayStation2's "Emotion Engine" hardware; it has a 2,560-bit wide memory bus in the GPU. Of course that's three independent read (1,024) write (1,024) and read/write (512) buses, but still it was pretty wild back in the year 2000.

minthd · on May 19, 2015

To some extent , that's not really that impressive. AMD gets access to a breakthrough technology like HBM and 2.5D silicon interposers , and all we get is just a measly 50% improvement in memory bandwidth ?

A more interesting configuration would be attaching 12 1GB HBM chips to the gpu , and achieving a memory bandwidth of 128GB/s * 12 = 1.5Tbyte/sec (would increase power only by 30W over current model).

Maybe their gpu is too weak to support such massive memory bandwidth, and it would be quite hard to do so ?

kllrnohj · on May 19, 2015

I have no idea how you could possibly choose "measly" as the adjective to pair with "50% improvement in memory bandwidth" (with the part about "using half as much power" being suspiciously missing from your summary).

timeu · on May 19, 2015

according to the source, the issue is with the chip size:

One problem with HBM is especially an issue for large GPUs. High-end graphics chips have, in the past, pushed the boundaries of possible chip sizes right up to the edges of the reticle used in photolithography. Since HBM requires an interposer chip that's larger than the GPU alone, it could impose a size limitation on graphics processors. When asked about this issue, Macri noted that the fabrication of larger-than-reticle interposers might be possible using multiple exposures, but he acknowledged that doing so could become cost-prohibitive.

Fiji will more than likely sit on a single-exposure-sized interposer, and it will probably pack a rich complement of GPU logic given the die size savings HBM offers. Still, with HBM, the size limits are not what they once were.

TheLoneWolfling · on May 19, 2015

Offhand, I'd be more worried about the interference. 1.5Tb/s is an large amount of bandwidth to have running in close proximity along adjacent connectors.