Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia H200 Tensor Core GPU (nvidia.com)
132 points by treesciencebot on Nov 13, 2023 | hide | past | favorite | 123 comments



The H200 GPU die is the same as the H100, but its using a full set of faster 24GB memory stacks:

https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-acc...

This is an H100 141GB, not new silicon like the Nvidia page might lead one to believe.


It is remarkable how much GPU compute is limited by memory speed.


Depends on the workload.

Sometimes things really are compute bound, and sometimes you get a "big" workload that still fits nicely in the GPU's L2. Generative AI is mostly at the far end of "memory bound."

Some ML startups (like Graphcore) seemed to bet on large caches, sparsity and clever preprocessing instead of raw memory bandwidth, but I think their strategy was compromised when model sizes exploded. Even Cerebras was kinda caught off guard when their 40GB pizza was suddenly kind of cramped.


Current ML architectures tend to be heavily optimized for ease of very large scale parallelism in training, even at the expense of a bigger model size and compute cost. So there may be some hope for different architectures as we stop treating idle GPUs as being basically available for free and start budgeting more strictly for what we use.


In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.

Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!


In many cases it's the same for CPU. AMD's cpus with bigger caches due to on-die 3D stacked memory are in a different league when it comes to perf.


What would make [HBM3E] GPU memory faster?

High Bandwidth Memory > HBM3E: https://en.wikipedia.org/wiki/High_Bandwidth_Memory#HBM3E


Compared to HBM3, you mean?

The memory makers bump up the speed the memory itself is capable of through manufacturing improvements. And I guess the H100 memory controller has some room to accept the faster memory.


More technically, I suppose.

Is the error rate due to quantum tunneling at so many nanometers still a fundamental limit to transistor density and thus also (G)DDR and HBM performance per unit area, volume, and charge?

https://news.ycombinator.com/item?id=38056088 ; a new QC and maybe in-RAM computing architecture like HBM-PM: maybe glass on quantum dots in synthetic DNA, and then still wave function storage and transmission; scale the quantum interconnect

Is melamine too slow for >= HBM RAM?


My understanding is that while quantum tunneling defines a fundamental limit to miniaturization of silicon transistors we are still not really near that limit. The more pressing limits are around figuring out how to get the EUV light to consistently draw denser and denser patterns correctly.


From https://news.ycombinator.com/item?id=35380902 :

> Optical tweezers: https://en.wikipedia.org/wiki/Optical_tweezers

> "'Impossible' photonic breakthrough: scientist manipulate light at subwavelength scale" https://thedebrief.org/impossible-photonic-breakthrough-scie... :

>> But now, the researchers from Southampton, together with scientists from the universities of Dortmund and Regensburg in Germany, have successfully demonstrated that a beam of light can not only be confined to a spot that is 50 times smaller than its own wavelength but also “in a first of its kind” the spot can be moved by minuscule amounts at the point where the light is confined

FWIU, quantum tunneling is regarded as error to be eliminated in digital computers; but may be a sufficient quantum computing component: cause electron-electron wave function interaction and measure. But there is zero or 1 readout in adjacent RAM transistors. Lol "Rowhammer for qubits"


"HBM4 in Development, Organizers Eyeing Even Wider 2048-Bit Interface" (2023) https://news.ycombinator.com/item?id=37859497


For anyone wondering how this applies to big LLMs, 144GB is big, but you’d need to roughly double this to train Gpt 3.x fitting everything in memory at once.

Of course even If 300GB GPUs were available tomorrow, and you sold a million house to buy as many as that would allow it’d still take years to train once.



I dunno. But thats a dual GPU product, so its not really 180GB.


They are also doing a custom 4 chip Hopper GPU for a European HPC initiative. Seems like a good move. Basically allows you to have 4 H100s connected via NVLink without needing a separate SXM board.


I'm holding out for 8 x MI300x on a single board. =)


This is a single-chip H100 NVL. Both are GH100's with the same tweaked 20% wider 6144-bit HBM3e (versus 5120 bit on other H100's) running at a higher speed.

The HBM3e loadout is slightly different than H100 NVL's was going to be, but this definitely seems like a higher bin H100. It's basically as-if AMD had shipped a 7900 XT, then latter started selling the 7900 XTX; same chip, but they brought up all the memory controllers on this one.


I'm curious: Do you think there is a realistic chance for another chip maker to catch up and overtake NVidia in the AI space in the next few years or is their lead and expertise insurmountable at this point?


This is launched in response to MI300X, and this should still not be enough to match AMD's product. This launches 2 quarters after MI300X, but B100 should arrive before AMD's MI400 generation.


AMD always launches impressive hardware specs. But they are way behind in software, which is more important than hardware


StableHLO[1] and IREE[2] are interesting projects that might help AMD here, from [1]:

> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).

From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.

[1] https://github.com/openxla/stablehlo

[2] https://github.com/openxla/iree

[3] https://www.tensorflow.org/xla


If AMD launches hardware that is clearly faster, the software will move towards it


That's exactly what the CUDA monopoly is meant to prevent, and as a fervent supporter of OpenCL (with two commercial apps), this is exactly the case I always make: even if some GPU came out tomorrow costing $0 and with infinite performance, all these people who paint themselves into a corner are hosed.

Not that anyone cares, and everyone keeps using CUDA while simultaneously complaining about Nvidia GPU prices, as if those two things have nothing to do with each other...


Have you had good experience with this for portability though? On what classes of hardware and OS?

I did a bit of work in OpenCL almost 10 years ago, and found it decently portable on a range of NVIDIA GPUs as well as Intel iGPUs. On the high end I used something like the Titan X while on the low end it was typical GPUs found in business class laptops.

But my limited exposure to AMD was terrible by comparison. Even though I am away from that work now, I still tend to try to run "clpeak" and one of my simpler image processing scripts on each new system. And while I liked a Ryzen laptop for general use or even games, it seemed like OpenCL was useless there. It seemed my best option was to ignore the GPU and use Intel's x86_64 SIMD OpenCL runtime.


Yes, even ~2012 OpenCL code works incredibly well today for spectral path tracing: https://indigorenderer.com/indigobench

Also my fractal software incl OpenCL multi-GPU / mixed plaftorm rendering: https://chaoticafractals.com/

Both work on [ Nvidia, AMD, Intel, Apple ] x [ CPU, GPU ].

Some of the shared code here: https://github.com/glaretechnologies/glare-core

Don't let anyone tell you OpenCL is dead! Keep writing OpenCL software!!


AIUI, your current best bet for good OpenCL implementation on less-than-cutting-edge AMD hardware is the Mesa Project's RustiCL work.


My own understanding is that OpenCL is semi-obsolete at the moment (although newer standards revisions are still coming out, so this may change in the future) with forward-looking projects mostly targeting Vulkan Compute or SYCL.

(There are some annoying differences in the low-level implementations of OpenCL vs. Vulkan Compute, due to their being based on SPIR-V compute "kernels" vs. "shaders" respectively, that make it hard for them to interop cleanly. So that's why the choice can be significant.)


OpenCL tooling has always been bad versus CUDA, and that isn't NVidia to blame, rather Intel and AMD.

Only C, C++ and Fortran were never taken seriously enough, other language stacks never considered.

Thus everyone that enjoyed programming in anything not C, with great libraries and graphical debuggers flocked to CUDA, now remains to be seen if SYCL and SPIRV will ever matter enough to regain some of those folks back.


The probability of a processor emerging that is so much cheaper that it warrants moving over, but that also its API is the OpenCL that you already target is very low. Aim for what is probable, not what is theoretically possible.


Most people who utilize this hardware aren't programming kernels directly for the GPU, they're using abstraction layers like pytorch, tensorflow, etc. For the developers of those type of frameworks, cuda itself offers a lot of libraries like cublas.

There's relatively few people capable of implementing these frameworks without a solid cuda-like foundation, and those that do exist would need a very strong incentive to do it.


If they were allowed to get significantly ahead that status quo would likely be disrupted pretty fast.


I thought CUDA was NVIDIA’s moat. Is that no longer the case, or did AMD come up with a good alternative?


The vast majority of work in ML isn't people working with CUDA directly - people use open source frameworks like PyTorch and TensorFlow to define a network and train it, and all the frameworks support CUDA as a backend.

Other backends are also available, such as CPU-only training. And you can export networks in reasonably-standard formats.

nvidia's moat is much more mature framework support than AMD's cards; widespread popularity due to that good framework support, ensuring everyone develops on nvidia, thus maintaining their support lead; much faster performance than CPU-only training; and a price that, though high, is a lot less than an ML developer's salary.

If you need 24GB of vram and nvidia offers that for $1600 while AMD offers it for $1300, how many compatibility problems do you want to deal with to save a single day's wages?

But nvidia's moat is far from guaranteed. Huge users like OpenAI and Facebook might find improving AMD support pays for itself.


> Huge users like OpenAI and Facebook might find improving AMD support pays for itself.

At that scale they may actually develop their own hardware a la Google TPU.

If you want to just focus on the AI problem and not on infrastructure, just use NVidia. If you want control and efficiency, design your own. AMD kind of falls in a weird middle ground with respect to the massive companies.


Google’s focus on TPUs have caused them many issues though other giant players might see spending FTE-equivalents on PyTorch etc development as a better investment choice.


CUDA code can be forward-ported to AMD's HIP, which can be used with the ROCm stack. For a more standards-focused alternative there's also SYCL, which has implementations targeting a variety of hardware backends (including HIP) and may also target Vulkan Compute in the future.


> CUDA code can be forward-ported to AMD's HIP, which can be used with the ROCm stack.

Maybe in some cases, but that doesn't even really matter since hardware support is poor.


StableHLO[1] and IREE[2] are interesting projects that might help AMD here, from [1]:

> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).

From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.

[1] https://github.com/openxla/stablehlo

[2] https://github.com/openxla/iree

[3] https://www.tensorflow.org/xla


>I'm curious: Do you think there is a realistic chance for another computer maker to catch up and overtake IBM in the computer space in the next few years or is their lead and expertise insurmountable at this point?

Nothing is insurmountable. :)

https://en.wikipedia.org/wiki/The_Innovator's_Dilemma

>It describes how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enough to quickly take market share from established business.


I'm aware, but I'd argue that in addition to the competitive moats described by Hamilton Helmer (7 powers guy) there is a real moat in unique technological expertise in the chip industry. E.g. the chip making machines that ASML makes or the 3nm chips that TSMC produces have reached a level of sophistication that will take 3+ years for competitors to replicate, thus granting them a sort of quasi monopoly for the foreseeable future.


I‘s say those are covered by Helmer with scale power. Spending millions on tiny process optimizations or other research is possible due to the large revenue streams that are coming in. For example, in terms of scale economies, only when you sell thousands high end GPUs per month you can hire people to write highly optimized compilers.


Maybe - though I think it's more of a cornered resource. You have specialized knowledge that attracts specialized people which create more specialized knowledge. Creates a sorta feedback loop.


Although no lead is insurmountable, the fixed capital investment and mature software ecosystem specific to this sector makes it harder to imagine what a competitor would look like.

Given how large the prize is, the next chapter of chip development is likely to be nvidia vs state sponsored projects. China, in particular, will funnel further resources into acquiring this technology by any means necessary, including (more) industrial sabotage and outright theft. It's going to be interesting to see how this will play out. Up until a few years ago China was viewed as being a formidable competitor for projects of this nature, but as the country has moved to become increasingly authoritarian, so too have its decision making and execution declined in quality.



photonics can run light based matrix multiplication for a fraction of the current GPUs, it's only a matter of time until they initiate a complete paradigm shift


I keep reading of these upcoming photonic hardware architectures capable of 100x+ increases. I sure as hell hope they hit market soon.


last time I checked the biggest players only did inference architectures, not training. presumably they shy away from implementing Hinton's forward forward architecture or something similar


Google's TPUs are competitive, but can only be rented.


based on MLPerf, google is further lagging behind, Intel is the only that is catching up a bit, but still.

AMD is trying to catch up too, so far Nvidia still remains to be the leader, a few years ahead.


Behind in what dimension? The most expensive Nvidia chips are much faster than Google TPUs, but the Google TPUs are competitive in terms of end to end training costs (roughly you can think of this as FLOPs per dollar).


I use TPUs on Colab all the time and I'm freaking happy. Maybe there's a way to use H100's do the same thing, but my code is already written to host utterly massive files on gcs buckets as tfrecord files and load them up on TPUs during the training process. First started with the free ones and now I rent the newer ones because it's not that expensive. I recommend beginners try it out. I find the other architectures more expensive, at least for my use case.


Anandtech had a good saying. There are no bad products, just bad prices.

So if TPU clusters are priced right...


Maybe one of the big Chinese chip makers or AMD, but the growing popularity of CUDA (not compatible with other GPUs) makes that less likely in the near future


Maybe not overtake, but Microsoft and Amazon are going to eat some of the pie by pushing their own accelerators inside their ecosystems.


I don't think that type of question or logic applies when predicting stock markets.


Luckily no one is trying to predict a stock market here


The performance jumps that Nvidia has had in a fairly short amount of time is impressive, but I can't help but feel like there is a real need for another player in this space. Hopefully AMD can challenge this supremacy soon.


Or even just offer an alternative, along with Intel: https://www.servethehome.com/intel-shows-gpu-max-1550-perfor...

There aren't many many Gaudi/Instinct cloud offerings even though the market is accelerator starved.


You can use Gaudi2s at the new Intel Developer Cloud[0]. Not sure why don't offer it on AWS though, seems a bit odd since they have the DL1 instances for the first-gen Gaudis.

[0]: https://developer.habana.ai/intel-developer-cloud/


Interesting, this looks like what I might want: https://eduand-alvarez.medium.com/llama2-fine-tuning-with-lo...


I'd prefer another player that doesn't rely on TSMC.


Doesn't that leave pretty much Samsung and Intel as the only options?


Maybe IBM, if the stars align: https://news.ycombinator.com/item?id=38256558


And Nvidia has used Samsung before.


So basically Samsung.


Nvidia has been test driving Intel’s foundry services - [0]

Intel is on track with their node rollout roadmap, according to their CEO - [1]

[0] - https://www.tomshardware.com/news/nvidia-ceo-intel-test-chip...

[1] - https://focustaiwan.tw/sci-tech/202311070017


Not sure why you were downvoted. Taiwan is in a precarious position and diversifying manufacturing away from them makes sense.


> diversifying manufacturing away from them

Diversifying manufacturing away from Taiwan makes their position more precarious, not less.


Both parent comments were likely referring to any entity other than Taiwan. If you are a fabless chip designer or one of their customers, it makes sense to diversify away from Taiwan, even if that comes at Taiwan's expense.


I worry that a lot of large cap companies either depend directly on TSMC (Nvidia, AMD, Apple) or depend on a company that depends on TSMC (Microsoft/OpenAI, Arm). It's TSMC all the way down and that scares me.

I never thought I'd root for Intel.


It’s easy to forget that TSMC was not the chipmaking leader until 7 or 8 years ago. Intel was. Things can change quickly in tech. Intel is trying to reclaim their old glory. We’ll see if they succeed.


Historians might look back at Intel’s and GloFo’s stumbles in the mid 2010s as being a pivotal turning point. Chips are the new oil and that makes the mid term future very dangerous.


I'd rather Intel. People have been pleading with AMD for years to compete with Nvidia, but AMD really has not put in a proper effort. They still don't look like they are putting in a proper effort.


AMD shipped Frontier. Compare and contrast with Intel's Aurora.

Epyc took the performance crown from Intel. Games consoles have been AMD for ages.

AMD are competing with Intel and Nvidia simultaneously with fewer resources than either, having come back from near bankruptcy in recent memory.

There's been plenty of effort and execution from team red.

It's commercially unfortunate that the crypto and now deep learning crowd don't particularly value the flexibility or control that comes from an open source toolchain. Regardless, I don't think the Cuda moat will hold out.


<They still don't look like they are putting in a proper effort.>

Quite the contrary, they've turned around the company to focus on AI.

Legacy software projects are on hold and software developers moved to work on AI under a new VP (former Xilinx exec). They have purchased some startups to get experienced AI developers.

Here is Andrew Ng giving a positive evaluation of AMD's software efforts https://youtu.be/KDBq0GqKpqA?t=2359


AMD was fighting Intel for its life. After a number of flops, it only got a big breakthrough on the CPU-side with Zen just over half a decade ago - which is not that far back. Hopefully they now have a bit of money saved up in their war chest to help the GPU division.


This may be a naive question, but all the metrics seem to be for inference. Should we expect similar gains on training?


Yes. Training would especially benefit from the increased memory size.


Interesting thanks. I wonder why they aren’t marketing that more on this page that seems important


Can anyone explain to a layman what exactly I'm looking at in that picture? It looks like a neat little city or building from Bladerunner.


It's a server motherboard with 8 GPUs crammed on it, facing up. The tall towers are the GPU heatsinks. I believe the blade looking things on the side are CPU RAM, the heatsinks on the back are covering the CPUs, and the little heatsink in the middle must be the CPU VRMs. Fans are in the back, and they crammed some electrical components on the front where all the IO is.


Looks like an HGX drawer, so there’s only GPUs on this. The heatsinks towards the front are probably on NVLink switches.


Ah you are right.


Just fyi, if 1 of 8 of the GPU's fail, you will be replacing the entire assembly; they are not modular.


Where does the H200 fit in if the B100 is coming out the same year with 2x the performance? Is the H200 just cheaper than the B100?


Its a different production line. They can keep producing both since they are both in demand anyway.

And the B100 is farther away. Nvidia always doubles the memory of their cards like this mid generation.


Does the L40S fit in a similar way?

Most of the GPU's are backordered bigtime but our vendors are chomping at the bit to sell us these.


Is the limit on the speed on inference a memory bandwidth issue or compute?



Memory bandwidth/latency, especially when you're at smaller batch sizes.


Depends. One might say its sometimes "cache size limited" too.


I had a shock when I looked up prices for H100 gpus, wanting to use one just for personal experimentation and for an upcoming hackathon. How much this one costs? $300,000?


These are not for consumers -- these are datacenter-grade systems.

If you want a consumer GPU, you can go for the RTX 4090 (24GB VRAM) or the A6000 Ada (48GB VRAM) if you are building a workstation.

If you really need to "experiment" on an A/H100, then you can rent it by the hour through a cloud provider like Runpod.


To elaborate: you can't really buy these except in specific configurations from Supermicro (usually 8x H100) or the like. So take whatever chip-specific cost you have in mind, and 8x it, and add on the cost of CPU/memory/storage. NVIDIA doesn't bother to sell these in a configuration that you can plug into your desktop.


"GPU" - zero video output capabilities built in.


It can still process graphics, you just need to do a cross-adapter scanout or encode it for transmission over network.


Can it? I thought that capability ended with the A100.

It still has a media encode/decode blocks. A big one, in fact.


AIPU?


NPU is the term normally used


With cookies banner and ad banner, the page has barely 1/4th of screen space on mobile device.


Am I the only one that's annoyed by the non-alphabetical model numbers? Why not do B100 after the A100, then jump to H (supposing there won't be a C100 or D200 at some point)? Like, wtf Nvidia.


They name their architectures after scientists (Maxwell, Pascal, Turing, Volta, Ampere, Lovelace, Hopper). Thats what the GPU initial stands for.

As for the number, the die name counts down to 100 (with GA107, for instance, being a small GPU die and GA100 being the big one), and the big datacenter GPU as a product inherits the 100.


it'd be nice if they picked them in alphabetical order


The naming scheme goes back to at least 2004 (Curie), and Wikipedia has done the service of alphabetizing it for us: https://en.m.wikipedia.org/wiki/List_of_eponyms_of_Nvidia_GP...

Also, it occurred to me that Nvidia does sometimes increment the die to 200 (EG GM200, as the Maxwell 100 series was a single small oddball die). Its possible that they "refreshed" the GH100 die and are codenaming it GH200.


At first they used "measures of hotness", but ran out a few generations in. Coincidentally, degrees of warmth are all named after scientists. So they continued with scientists.

https://en.m.wikipedia.org/wiki/Fahrenheit_(microarchitectur... -> https://en.m.wikipedia.org/wiki/Celsius_(microarchitecture) -> https://en.m.wikipedia.org/wiki/Kelvin_(microarchitecture) -> https://en.m.wikipedia.org/wiki/Rankine_(microarchitecture)


I'm surprised nobody at nvidia brought this up


Which scientist is letter B?



Bohr?


I recently had occasion to evaluate a database of 1200+ NVIDIA GPUs and can tell you that the only thing consistent about the model numbers is their inconsistency. For example, what is an RTX 4000? It could be the 2018 Quadro RTX 4000, the Quadro RTX 4000 Max-Q, or Quadro RTX 4000 Mobile (all Turing cards), but it could also be the RTX 4000 Mobile Ada Generation (Ada Lovelace card released 2023).


At least they haven’t tried to do a Tesla S, 3, X, Y progression.


Wait, did Tesla pick those model names for that reason?


It's run by a 13 year old boy. What did you expect?


Yes.


Why do they still sell hardware now that practically every other business has moved to being a service provider? If we set aside the fact that it would be an awful move for end-users, what's to stop Nvidia from cornering the market by only renting them in their own data centers? Is it the logistics of moving the massive training sets?


That would be a highly risky bet on Nvidia becoming AWS faster than AWS can become Nvidia.

What they're doing is instead trying to make sure that their GPUs continue to be seen as the best option in the short/medium term (by having them accessible everywhere), and trying to commoditize their complement by giving small cloud providers disproportionate GPU allocations, which they hope will drive customers from the big providers to the smaller ones that a) aren't trying to build their own ML hardware, b) will have less negotiating leverage with Nvidia in the long term.


Many of the largest HPC customers (read: DoD, DoE, NNSA) simply will not allow their code to sit on someone else's machine.


Semi industry players have a strong cultural memory that "competing with your own customers" is a bad plan.


I surmise that such a strategy would essentially hand their market share over to AMD on a silver platter.


What do you think all the other service providers are running their services on?


I'm asking why Nvidia doesn't maximize their profits by retaining the hardware and selling compute. They could capture the market from those other providers if they sold more FLOPS/kilowatt (or whatever metric is used). Compared to manufacturing GPUs/TPUs, running a datacenter (especially one that specializes in Nvidia hw) would seem to be a trivial task.


Google Cloud Platform hasn't managed to make much of a dent in AWS's business, despite being the only place you can get 'TPUs' and 'bigquery'.

Becoming a successful cloud provider is far from trivial, even if you can offer technology no-one else has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: