Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia Announces H100 NVL – Max Memory Server Card for Large Language Models (anandtech.com)
122 points by neilmovva on March 21, 2023 | hide | past | favorite | 107 comments



A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100.

The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.


It is interesting that hopper isn’t widely available yet.

I have seen some benchmarks from academia but nothing in the private sector.

I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.

Not having any competition whatsoever means Nvidia can release what they like when they like.


The question is, do they not have much production, or is OpenAI and Microsoft buying every single one they produce?


Why bother when you can get cryptobros paying way over MSRP for 3090s?


GPU mining died last year.

There's so little liquidity post-merge that it's only worth mining as a way to launder stolen electricity.

The bitcoin people still waste raw materials, and prices are relatively sticky with so few suppliers and a backlog of demand, but we've already seen prices drop heavily since then.


Right, that's why NVidia is acutally trying again. The money printer has run out of ink.


Not just cryptobros. A100s are the current top of the line and it’s hard to find them available on AWS and Lambda. Vast.AI has plenty if you trust renting from a stranger.

AMD really needs to pick up the pace and make a solid competitive offering in deep learning. They’re slowly getting there but they are at least 2 generations out.


I would take a huge performance hit to just not deal with Nvidia drivers. Unless things have changed, it is still not really possible to operate on AMD hardware without a list of gotchas.


Its still basically impossible to find MI200s in the cloud.

On desktops, only the 7000 series is kinda competitive for AI in particular, and you have to go out of your way to get it running quick in PyTorch. The 6000 and 5000 series just weren't designed for AI.


It's crazy to me that no other hardware company has sought to compete for the deep learning training/inference market yet ...

The existing ecosystems (cuda, pytorch etc) are all pretty garbage anyway -- aside from the massive number of tutorials it doesn't seem like it would actually be hard to build a vertically integrated competitor ecosystem ... it feels a little like the rise of rails to me -- is a million articles about how to build a blog engine really that deep a moat ..?


How could their moat possibly be deeper?

First of all you need hardware with cutting-edge chips. Chips which can only be supplied by TSMC and Samsung.

Then you need the software ranging all the way from the firmware and driver over something analogous to CUDA with libraries like cuDNN, cuBLAS and many others to integrations into pytorch and tensorflow.

And none of that will come for free, like it came to Nvidia. Nvidia built CUDA and people built their DL frameworks around it in the last decade, but nobody will invest their time into doing the same for a competitor, when they could just do their research on Nvidia hardware instead.

Realistically it's up to AMD or Intel.


There will probably be Chinese options as well. China has an incentive to provide a domestic competitor due to deteriorating relations with the U.S.


They certainly will have to try, since nvidia is banned from exporting A100 and H100 chips.


They do ship A800 and H800 to China. H800 is the H100 with a much slower memory bandwidth. A800 is also a tiered down version of the A100


No other company has sought this?

https://www.cerebras.net/ Has innovative technology, has actual customers, and is gaining a foothold in software-system stacks by integrating their platform into the OpenXLA GPU compiler.


There are tons of companies trying; they just aren't succeeding.


Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.

I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.


You can also join a pair of regular PCIe H100 GPUs with an NVLink bridge. So that topology is not so new either.


>H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find

You can safely assume an entity bought as many as they could.


I was wondering today if we would start to see the reverse of this. Small ASICS or some kind of optimized for LLM Gpu for desktop / or maybe even laptops of mobile. It is evident I think that LLM are here to stay and will be a major part of computing for a while. Getting this local, so we aren't reliant on clouds would be a huge boon for personal computing. Even if its a "worse" experience, being able to load up an LLM into our computer, tell it to only look at this directory and help out would be cool.


In fact, Qualcomm has announced a "Cloud AI" PCIe card designed for inference (as opposed to training & inference) [1, 2]. It's populated with NSPs like the ones in mobile SoCs.

[1] https://www.qualcomm.com/products/technology/processors/clou...

[2] https://github.com/quic/software-kit-for-qualcomm-cloud-ai-1...


Software/hardware co-evolution. Wouldn't be the first time we went down that road to good effect.

For anything that can be run remotely, it'll always be deployed and optimized server-side first. Higher utilization means more economy.

Then trickle down to local and end user devices if it makes sense.


Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural engines" in their SoCs. These engines continue to evolve to better support common types of models.


Why is the sentiment here so much that LLMs will somehow be decentralized and run locally at some point? Has the story of the internet so far not been that centralization has pretty much always won?


Hackers want to run LLMs locally just because. It's not a mainstream thing.


It makes business sense as well. It doesn't make much sense to build an entire company around the idea that OpenAI's APIs are always available and you won't eventually get screwed. "Be careful of basing your business on top of another" and all that yadda yadda.

If you want to build a business around LLMs, it makes a lot of sense to be able to run the core service of what you want to offer on your own infrastructure instead of rely on a 3rd party that most likely doesn't give more than 1% care about you.


Running LLMs on your own servers doesn't mean PCs which is what this thread is about. A100/H100 is fine for a business but people can't justify them for personal use.


Because that is pretty much the pendulum swinging in the IT world. Right now it is solidly in 'centralization' territory, hopefully it will go back towards decentralization again in the future. The whole PC revolution was an excellent datapoint for decentralization, now we're back to 'dumb terminals' but as local compute strengthens the things that you need a whole farm of servers for today can probably fit in your pocket tomorrow, or at the latest in a few years.


Not sure this really tracks. Local compute has always been strengthening as a steady incline. Yet we haven't really experienced any sort of pendulum shift, it's always been centralization territory.

The reasoning seems mostly obvious to me here: people do not care for the effort that decentralization requires. If given the option to run AI off some website to generate all you want, people will gladly do this over using their local hardware due to the setup required.

The unfortunate part is that it takes so much longer to create not for profit tooling that is just as easy to use, especially when the calling to turn that into your for profit business in such a lucrative field is so tempting. Just ask the people who have contributed to Blender for a decade now.


Absolutely not. Computers used to be extremely centralized and the decentralization revolution powered a ton of progress in both software development and hardware development.

You can run many AI applications locally today that would have required a massive investment in hardware not all that long ago. It's just that the bleeding edge is still in that territory. One major optimization avenue is the improvement of the models themselves, they are large because they have large numbers of parameters, but the bulk of those parameters has little to no effect on the model output and there is active research on 'model compression', which has the potential to be able to extract the working bits from a model while discarding the non-working bits without affecting the output and realize massive gains in efficiency (both in power consumption as well as for running the model).

Have a look at the kind of progress that happened in the chess world with the initial huge ML powered engines that are beaten by the kind of program that you can run on your phone nowadays.

https://en.wikipedia.org/wiki/Stockfish_(chess)

I fully expect something similar to happen to language models.


The bleeding edge will always be in that territory. It still requires a massive investment today to run AI applications locally to produce anywhere near as good results. People are spending upwards of $2000 for a GPU just to get decent results when it comes to image generation, many forgoing this entirely and just giving Google a monthly fee to use their hardware.

Which is the point, decentralization will always be playing catch up here unless something really interesting happens. It has absolutely nothing to do with local compute power, that has always been on an incline. We just get fed scraps down the line.


Todays scraps are yesterdays state-of-the-art, and that's very logical and applies to far more than just AI applications. It's the way research and development result in products and the subsequent optimization. This has been true since the dawn of time in one form or another. At some point stone tools were high tech and next to affordably. Then it was bronze, then iron, and at some point we finally hit steam power. From there to the industrial revolution was a relatively short span and from there to electricity, electronics, solid state, computers, personal computers, mobile phones, smartphones and so on in ever decreasing steps.

If anything the steps are now so closely following each other that we have far more trouble tracking the societal changes and dealing with them than that we have a problem with the lag between technological advancement and its eventual commoditization.


Nvidia's business model encourages this for starters. They charge a huge markup for their datacenter GPU's through some clever licensing restrictions. So it is cheaper per FLOP to run inference on a personal device.

Centralization of compute has not always won (even if that compute is mostly controlled by a single company). The failure of cloud gaming vs consoles, and the success of Apple (which is very centralized but pushes a lot of ML compute out to the edge) for example.


I think the sentiment is both. There will be advanced centralized LLM's and people want the option to have a personal one (or two). There needn't be a single solution.


Sure, for big business, but torrents are still alive and well.


I think it's because it feels more similar to Google Stadia than to Facebook.


A couple of the big players are already looking at developing their own chips.


Have been for years. Maybe lots of years. It's expensive to have a go (many engineers plus cost of making the things) and it's difficult to beat the established players unless you see something they're doing wrong or your particular niche really cares about something the off the shelf hardware doesn't.


Please give us consumer cards with more than 24GB VRAM, Nvidia.

It was a slap in the face when the 4090 had the same memory capacity as the 3090.

A6000 is 5000 dollars, ain't no hobbyist at home paying for that.


Nvidia don't want consumers using consumer GPUs for business.

If you are a business user then you must pay Nvidia gargantuan amounts of money.

This is the outcome of a market leader with no real competition - you pay much more for lower power than the consumer GPUs and you are forced into ujsing their business GPUs through software license restrictions on the drivers.


That was always why the Titan line was so great - they typically unlocked features in between Quadro and Gaming cards. Sometimes it was subtle (like very good FP32 AND FP16 performance) or adding full 10 bit colour support if you had a Titan only. Now it seems like they have opened up even more of those features to consumer cards (at least the creative ones) with the studio drivers.


Hmmm ... "Studio Drivers" ... how are these tangibly different to gaming drivers?

According to this, the difference seems to be that Studio Drivers are older and better tested, nothing else.

https://nvidia.custhelp.com/app/answers/detail/a_id/4931/~/n...

What am I missing in my understanding of Studio Drivers?

""" How do Studio Drivers differ from Game Ready Drivers (GRD)?

In 2014, NVIDIA created the Game Ready Driver program to provide the best day-0 gaming experience. In order to accomplish this, the release cadence for Game Ready Drivers is driven by the release of major new game content giving our driver team as much time as possible to work on a given title. In similar fashion, NVIDIA now offers the Studio Driver program. Designed to provide the ultimate in functionality and stability for creative applications, Studio Drivers provide extensive testing against top creative applications and workflows for the best performance possible, and support any major creative app updates to ensure that you are ready to update any apps on Day 1. ""


Isn't a new Titan RTX 4090 coming out soon?


An alleged photo of an engineering sample was spotted in the wild a while ago, but no one knows if it's actually going to end up being a thing you can buy.


We're NOT business users, we just want to run our own LLM at home.

Given the size of LLMs, this should be possible with just a little bit of extra VRAM.


Exactly, we're just below that sweet spot right now.

For example on 24GB, Llama 30B runs only in 4bit mode and very slowly, but I can imagine a RLHF finetuned 30B or 65B version running in at least 8bit would be actually useful, and you could run it on your own computer easily.


Do you know where the cutoff is? Does 32GB VRAM give us 30B int8 with/without a RLHF layer? I don't think 5090 is going to go straight to 48GB, I'm thinking either 32 or 40GB (if not 24GB).


> For example on 24GB, Llama 30B runs only in 4bit mode and very slowly

why do you think adding vram, but not cores will make it run faster?..


I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally

In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB


It's not impossible, there are several projects which load model layer by layer for execution from the disk or ram, but it will be much slower.


I don't think you understand though, they don't WANT you. They WANT the version of you who makes $150k+ a year and will splurge $5k on a Quadro.

If they had trouble selling stock we would see this niche market get catered to.


That IS me. $5K is not enough to run an LLM at home (beyond the non-functional reduced quantization smaller models).


Ahh yes, looks like I was too generous with my numbers, the new Quadro with 48GB VRAM is $7k, so you probably would need $14k and a Threadripper/Xeon/EPYC workstation because you won't have enough PCIE lanes/RAM/Memory Bandwidth otherwise.

So maybe more accurate is $200k+ a year and $20-30k on a workstation.

I grew up on $20k a year, the numbers in tech. are baffling!


Nvidia can't do a large 'consumer' card without cannibalizing their commercial ML business. ATI doesn't have that problem.

ATI seems to be holding the idiot ball.

Port stable diffusion and clip to their hardware. Train an upsized version sized for a 48GB card. Release a prosumer 48gb card... get huge uptake from artists and creators using the tech.


GPUs are going to be weird, underconfigured and overpriced until there is real competition.

Whether or not there is real competition depends entirely on whether Intels Arc line of GPUs stays in the market.

AMD strangely has decided not to compete. Its newest GPU the 7900 XTX is an extremely powerful card, close to the top of the line Nvidia RTX 4090 in raster performance.

If AMD had introduced it with an aggressively low price then then they could have wedged Nvidia, which is determinbed to exploit it's market dominance by squeezing the maximum money out of buyers.

Instead, AMD has decided to simply follow Nvidia in squeezing for maximum prices, with AM prices slightly behind Nvidia.

It's a strange decision from AMD who is well behind in market and apparently seems disinterested in increasing that market share by competing aggressively.

So a third player is needed - Intel - it's alot harder for three companies to sit on outrageously high prices for years rather than compete with each other for market share.


The root cause is that TSMC raised prices in everyone.

Since Intel GPUs are again TSMC manufactured, you really aren't going to see price improvements unless Intel subsidizes all of this.


>> The root cause is that TSMC raised prices in everyone.

This is not correct.



You are correct that the manufacturing cost has gone up.

You are incorrect that this is the root cause of GPU prices being sky high.

If manufacturing cost was the root cause then it would be simply impossible to bring prices down without losing money.

The root cause of GPU prices being so high is lack of competition - AMD and Nvidia are choosing to maximise profit, and they are deliberately undersupplying the market to create scarcity and therefore prop up prices.

"AMD 'undershipping' chips to help prop prices up" https://www.pcgamer.com/amd-undershipping-chips-to-help-prop...

"AMD is ‘undershipping’ chips to balance CPU, GPU supply Less supply to balance out demand—and keep prices high." https://www.pcworld.com/article/1499957/amd-is-undershipping...

In summary, GPOU prices are ridiculously high because Nvidia and AMD are overpricing them because they believe this is what gamers will pay, NOT because manufacturing costs have forced prices to be high.


Isn't this price fixing? And if so, can this be prosecuted?


I suspect that the lack of CUDA is a dealbreaker for too many people when it comes to AMD, with the recent explosion in machine learning.


GPUs strike me as absurdly cheap given the performance they can offer. I'd just like them to be easier to program.


Depends on the GPU of course but at the top end of the market AUD$3000 / USD$1,600 is not cheap and certainly not absurdly cheap.

Much less powerful GPUs represent better value but the market is ridiculously overpriced at the moment.


The really interesting upcoming LLM products are from AMD and Intel... with catches.

- The Intel Falcon Shores XPU is basically a big GPU that can use DDR5 DIMMS directly, hence it can fit absolutely enormous models into a single pool. But it has been delayed to 2025 :/

- AMD have not mentioned anything about the (not delayed) MI300 supporting DIMMs. If it doesn't, its capped to 128GB, and its being marketed as an HPC product like the MI200 anyway (which you basically cannot find on cloud services).

Nvidia also has some DDR5 grace CPUs, but the memory is embedded and I'm not sure how much of a GPU they have. Other startups (Tenstorrent, Cerebras, Graphcore and such) seemed to have underestimated the memory requirements of future models.


> DDR5 DIMMS directly

That's the problem. Good DDR5 RAM's memory speed is <100GB/s, while nvidia could has up to 2TB/s, and still the bottleneck lies on memory speed for most applications.


Not if the bus is wide enough :P. EPYC Genoa is already ~450GB/s, and the M2 max is 400GB/s.

Anyway, what I was implying is that simply fitting a trillion parameter model into a single pool is probably more efficient than splitting it up over a power hungry interconnect. Bandwidth is much lower, but latency is also slower, you are shuffling much less data around.


Grace can be paired with Hopper via a 900GB/s NVLINK bus (500GB/s memory bandwidth), 1TB of LPDDR5 on the CPU and 80-94GB of HBM3 on the GPU.


That does sound pretty good, but its still going chip to chip over NVLink.


I wonder how soon we'll see something tailored specifically for local applications. Basically just tons of VRAM to be able to load large models, but not bleeding edge perf. And eGPU form factor, ideally.


The Apple M-series CPUs with unified RAM is interesting in this regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM for $4300 today, and I expect the M2 Ultra go to 192GB.


I'm not a ML scientist my any means, but Perf seems as important as RAM from what I'm reading. Running prompts in internal chain of thought (eating up more TPU time) appears to give much better output.


It's not that perf is not important, but not having enough VRAM means you can't load the model of a given size at all.

I'm not saying they shouldn't bother with RAM at all, mind you. But given some target price, it's a balance thing between compute and RAM, and right now it seems that RAM is the bigger hurdle.


I'm super duper curious if there are ways to glob together VRAM between consumer-grade hardware to make this whole market more accessible to the common hacker?


You can, for instance, connect two RTX 3090 with an NVLink bridge. That gives you 48 GB in total. The 4090 doesn't support NVLink anymore.


You actually can split a model [0] onto multiple GPUs even without NVLink, just using the PCIe for the transfers.

Depending on the model the performance is sometimes not all that different. I believe for solely inference on some models the speed difference may barely be noticeable, where for other training activities it may make 10+% difference [1]

[0] https://pytorch.org/tutorials/intermediate/model_parallel_tu...

[1] https://huggingface.co/transformers/v4.9.2/performance.html


> The 4090 doesn't support NVLink anymore.

Are you sure about that?



I remember reading about a guy who soldered 2GB VRAM modules on his 3060 12GB (replacing the 1GB modules) and was able to attain 24GB on that card. Or something along those lines.


How is this card (which is really two physical cards occupying 2 PCIe slots) exposed to the OS? Does it show up as a single /dev/gfx0 device, or is the unification a driver trick?


The two cards show as two distinct GPUs to the host, connected via NVLink. Unification / load balancing happens via software.


Kinda depressing if you consider how they removed NVLink in the 4090, stating the following reason:

> “The reason we took [NVLink] off is that we need I/O for other things, so we’re using that area to cram in as many AI processors as possible,” Jen-Hsun Huang explained of the reason for axing NVLink.[0]

"NVLink is bad for your games and AI, trust me bro."

But then this card, actually aimed at ML applications, uses it.

0. https://www.techgoing.com/nvidia-rtx-4090-no-longer-supports...


Market segmentation. Back when the Pascal architecture was the latest thing, it didn't make much sense to buy expensive Tesla P100 GPUs for many professional applications when consumer GeForce 1080 Ti cards gave you much more bang for the buck with few drawbacks. From the corporation's perspective it makes so much sense to differentiate the product lines more, now that their customers are deeply entrenched.


What exactly is an SXM5 socket? It sounds like a PCIe competitor, but proprietary to nvidia. Looking at it, it seems specific to nvidia DGX (mother?)boards. Is this just a "better" alternative to PCIe (with power delivery, and such), or fundamentally a new technology?


Yes to all your questions. It's specifically designed for commercial compute servers. It provides significantly more bandwidth and speed over PCIe.

It's also enormously more expensive and I'm not sure if you can buy it new without getting the nvidia compute server.


It's one of those /If you have to ask, you can't afford it/ scenarios.


The TDP row in the comparison table must be in error. It shows the card with dual GH100 GPUs at 700W and the one with a single GH100 GPU at 700-800W ?!


That's the SXM version, used for instance in servers like the DGX. It's also faster than the PCIe variation.


So it's essentially two H100's in a trenchcoat? (plus a sprinkling of "latest")


I would sell a kidney for one of these. It's basically impossible to train language models on a consumer 24GB card. The jump up is the A6000 ADA, at 48GB for $8,000. This one will probably be priced somewhere in the $100k+ range.


Use 4 consumer grade 4090 then. It would be much cheaper and better in almost every aspect. Also even with this, forget about training foundational models. Meta spent 82k GPU hours on the smallest llama and 1M hours on largest.


Go with 2x 3090s instead. 4000 series doesn't support SLI, so you're stuck with the max of whatever one card you get.


If I remember correctly the NVLINK adds 100GB/s (where PCIE 4.0 is 64GB/s). Is it really worth getting 3090 performance (roughly half) for that extra bus speed?


Ampere NVLink (NV3) was 600 GByte/sec, with Hopper (NV4) it's 900 GByte/sec. https://www.nvidia.com/en-us/data-center/nvlink/


That is for the data center NVLINK, according to Wikipedia, for GA102 (3090) it is a 56.25GB/s bidirectional, yielding 112.5GB/s total bus bandwidth.


Ah, that's true, thanks. It's the same type of NVLink as on the A40 GPU. https://images.nvidia.com/content/Solutions/data-center/a40/...


PCIE 4.0*16 is 32 GB/s.


You think? It’s double 48 GB (per card) so why wouldn’t it be in the $20k range?


Machine learning is so hyped right now (with good reason) so customers are price insensitive.


I guess we'll see.


Tomshardware is estimating $80k.


NVIDIA is selling shovels in a gold rush. Good for them. Their P/E of 150 is frightening, though.


I was just saying to a colleague the day before this announcement that the inevitable consequence of the popularity of large language models will be GPUs with more memory.

Previously, GPUs were designed for gamers, and no game really "needs" more than 16 GB of VRAM. I've seen reviews of the A100 and H100 cards saying that the 80GB is ample for even the most demanding usage.

Now? Suddenly GPUs with 1 TB of memory could be immediately used, at scale, by deep-pocket customers happy to throw their entire wallets at NVIDIA.

This new H100 NVL model is a Frankenstein's monster stitched together from whatever they had lying around. It's a desperate move to corner the market early as possible. It's just the beginning, a preview of the times to come.

There will be a new digital moat, a new capitalist's empire, built upon on the scarcity of cards "big enough" to run models that nobody but a handful of megacorps can afford to train.

In fact, it won't be enough to restrict access by making the models expensive to train. The real moat will be models too expensive to run. Users will have to sign up, get API keys, and stand in line.

"Safe use of AI" my ass. Safe profits, more like. Safe monopolies, safe from competition.


I wonder how this compares to AMD Instinct MI300 128GB HBM3 cards?


Does AMD have a chance here in the short term (say 24 months)?


AMD seems to be focusing on traditional HPC, they've got a ton of 64 bit flops in their recent commercial model. I expect their server GPUs are mostly for chasing supercomputer contracts, which can be pretty lucrative, while they cede model training to NVidia.


For now nvidia is a very dominant player for sure but in long run do you see it changing, with competition from Amd-xilinx, intel or potential AI hardware startups,why have the startups or other big players failed to make dent in nvidia's dominance ? considering how big this market will be in coming years there should have been significant investment made by other players but they seem to be incompetent in making even a competitive chip and nvidia which is already so ahead is running even more faster expanding its software ecosystem across various industries.


Sarah Connor is totally coming for NVIDIA.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: