A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100.
The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.
There's so little liquidity post-merge that it's only worth mining as a way to launder stolen electricity.
The bitcoin people still waste raw materials, and prices are relatively sticky with so few suppliers and a backlog of demand, but we've already seen prices drop heavily since then.
Not just cryptobros. A100s are the current top of the line and it’s hard to find them available on AWS and Lambda. Vast.AI has plenty if you trust renting from a stranger.
AMD really needs to pick up the pace and make a solid competitive offering in deep learning. They’re slowly getting there but they are at least 2 generations out.
I would take a huge performance hit to just not deal with Nvidia drivers. Unless things have changed, it is still not really possible to operate on AMD hardware without a list of gotchas.
Its still basically impossible to find MI200s in the cloud.
On desktops, only the 7000 series is kinda competitive for AI in particular, and you have to go out of your way to get it running quick in PyTorch. The 6000 and 5000 series just weren't designed for AI.
It's crazy to me that no other hardware company has sought to compete for the deep learning training/inference market yet ...
The existing ecosystems (cuda, pytorch etc) are all pretty garbage anyway -- aside from the massive number of tutorials it doesn't seem like it would actually be hard to build a vertically integrated competitor ecosystem ... it feels a little like the rise of rails to me -- is a million articles about how to build a blog engine really that deep a moat ..?
First of all you need hardware with cutting-edge chips. Chips which can only be supplied by TSMC and Samsung.
Then you need the software ranging all the way from the firmware and driver over something analogous to CUDA with libraries like cuDNN, cuBLAS and many others to integrations into pytorch and tensorflow.
And none of that will come for free, like it came to Nvidia. Nvidia built CUDA and people built their DL frameworks around it in the last decade, but nobody will invest their time into doing the same for a competitor, when they could just do their research on Nvidia hardware instead.
https://www.cerebras.net/ Has innovative technology, has actual customers, and is gaining a foothold in software-system stacks by integrating their platform into the OpenXLA GPU compiler.
Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.
I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.
>H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find
You can safely assume an entity bought as many as they could.
I was wondering today if we would start to see the reverse of this. Small ASICS or some kind of optimized for LLM Gpu for desktop / or maybe even laptops of mobile. It is evident I think that LLM are here to stay and will be a major part of computing for a while. Getting this local, so we aren't reliant on clouds would be a huge boon for personal computing. Even if its a "worse" experience, being able to load up an LLM into our computer, tell it to only look at this directory and help out would be cool.
In fact, Qualcomm has announced a "Cloud AI" PCIe card designed for inference (as opposed to training & inference) [1, 2]. It's populated with NSPs like the ones in mobile SoCs.
Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural engines" in their SoCs. These engines continue to evolve to better support common types of models.
Why is the sentiment here so much that LLMs will somehow be decentralized and run locally at some point? Has the story of the internet so far not been that centralization has pretty much always won?
It makes business sense as well. It doesn't make much sense to build an entire company around the idea that OpenAI's APIs are always available and you won't eventually get screwed. "Be careful of basing your business on top of another" and all that yadda yadda.
If you want to build a business around LLMs, it makes a lot of sense to be able to run the core service of what you want to offer on your own infrastructure instead of rely on a 3rd party that most likely doesn't give more than 1% care about you.
Running LLMs on your own servers doesn't mean PCs which is what this thread is about. A100/H100 is fine for a business but people can't justify them for personal use.
Because that is pretty much the pendulum swinging in the IT world. Right now it is solidly in 'centralization' territory, hopefully it will go back towards decentralization again in the future. The whole PC revolution was an excellent datapoint for decentralization, now we're back to 'dumb terminals' but as local compute strengthens the things that you need a whole farm of servers for today can probably fit in your pocket tomorrow, or at the latest in a few years.
Not sure this really tracks. Local compute has always been strengthening as a steady incline. Yet we haven't really experienced any sort of pendulum shift, it's always been centralization territory.
The reasoning seems mostly obvious to me here: people do not care for the effort that decentralization requires. If given the option to run AI off some website to generate all you want, people will gladly do this over using their local hardware due to the setup required.
The unfortunate part is that it takes so much longer to create not for profit tooling that is just as easy to use, especially when the calling to turn that into your for profit business in such a lucrative field is so tempting. Just ask the people who have contributed to Blender for a decade now.
Absolutely not. Computers used to be extremely centralized and the decentralization revolution powered a ton of progress in both software development and hardware development.
You can run many AI applications locally today that would have required a massive investment in hardware not all that long ago. It's just that the bleeding edge is still in that territory. One major optimization avenue is the improvement of the models themselves, they are large because they have large numbers of parameters, but the bulk of those parameters has little to no effect on the model output and there is active research on 'model compression', which has the potential to be able to extract the working bits from a model while discarding the non-working bits without affecting the output and realize massive gains in efficiency (both in power consumption as well as for running the model).
Have a look at the kind of progress that happened in the chess world with the initial huge ML powered engines that are beaten by the kind of program that you can run on your phone nowadays.
The bleeding edge will always be in that territory. It still requires a massive investment today to run AI applications locally to produce anywhere near as good results. People are spending upwards of $2000 for a GPU just to get decent results when it comes to image generation, many forgoing this entirely and just giving Google a monthly fee to use their hardware.
Which is the point, decentralization will always be playing catch up here unless something really interesting happens. It has absolutely nothing to do with local compute power, that has always been on an incline. We just get fed scraps down the line.
Todays scraps are yesterdays state-of-the-art, and that's very logical and applies to far more than just AI applications. It's the way research and development result in products and the subsequent optimization. This has been true since the dawn of time in one form or another. At some point stone tools were high tech and next to affordably. Then it was bronze, then iron, and at some point we finally hit steam power. From there to the industrial revolution was a relatively short span and from there to electricity, electronics, solid state, computers, personal computers, mobile phones, smartphones and so on in ever decreasing steps.
If anything the steps are now so closely following each other that we have far more trouble tracking the societal changes and dealing with them than that we have a problem with the lag between technological advancement and its eventual commoditization.
Nvidia's business model encourages this for starters. They charge a huge markup for their datacenter GPU's through some clever licensing restrictions. So it is cheaper per FLOP to run inference on a personal device.
Centralization of compute has not always won (even if that compute is mostly controlled by a single company). The failure of cloud gaming vs consoles, and the success of Apple (which is very centralized but pushes a lot of ML compute out to the edge) for example.
I think the sentiment is both. There will be advanced centralized LLM's and people want the option to have a personal one (or two). There needn't be a single solution.
Have been for years. Maybe lots of years. It's expensive to have a go (many engineers plus cost of making the things) and it's difficult to beat the established players unless you see something they're doing wrong or your particular niche really cares about something the off the shelf hardware doesn't.
Nvidia don't want consumers using consumer GPUs for business.
If you are a business user then you must pay Nvidia gargantuan amounts of money.
This is the outcome of a market leader with no real competition - you pay much more for lower power than the consumer GPUs and you are forced into ujsing their business GPUs through software license restrictions on the drivers.
That was always why the Titan line was so great - they typically unlocked features in between Quadro and Gaming cards. Sometimes it was subtle (like very good FP32 AND FP16 performance) or adding full 10 bit colour support if you had a Titan only. Now it seems like they have opened up even more of those features to consumer cards (at least the creative ones) with the studio drivers.
What am I missing in my understanding of Studio Drivers?
"""
How do Studio Drivers differ from Game Ready Drivers (GRD)?
In 2014, NVIDIA created the Game Ready Driver program to provide the best day-0 gaming experience. In order to accomplish this, the release cadence for Game Ready Drivers is driven by the release of major new game content giving our driver team as much time as possible to work on a given title. In similar fashion, NVIDIA now offers the Studio Driver program. Designed to provide the ultimate in functionality and stability for creative applications, Studio Drivers provide extensive testing against top creative applications and workflows for the best performance possible, and support any major creative app updates to ensure that you are ready to update any apps on Day 1.
""
An alleged photo of an engineering sample was spotted in the wild a while ago, but no one knows if it's actually going to end up being a thing you can buy.
Exactly, we're just below that sweet spot right now.
For example on 24GB, Llama 30B runs only in 4bit mode and very slowly, but I can imagine a RLHF finetuned 30B or 65B version running in at least 8bit would be actually useful, and you could run it on your own computer easily.
Do you know where the cutoff is? Does 32GB VRAM give us 30B int8 with/without a RLHF layer? I don't think 5090 is going to go straight to 48GB, I'm thinking either 32 or 40GB (if not 24GB).
I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally
In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB
Ahh yes, looks like I was too generous with my numbers, the new Quadro with 48GB VRAM is $7k, so you probably would need $14k and a Threadripper/Xeon/EPYC workstation because you won't have enough PCIE lanes/RAM/Memory Bandwidth otherwise.
So maybe more accurate is $200k+ a year and $20-30k on a workstation.
I grew up on $20k a year, the numbers in tech. are baffling!
Nvidia can't do a large 'consumer' card without cannibalizing their commercial ML business. ATI doesn't have that problem.
ATI seems to be holding the idiot ball.
Port stable diffusion and clip to their hardware. Train an upsized version sized for a 48GB card. Release a prosumer 48gb card... get huge uptake from artists and creators using the tech.
GPUs are going to be weird, underconfigured and overpriced until there is real competition.
Whether or not there is real competition depends entirely on whether Intels Arc line of GPUs stays in the market.
AMD strangely has decided not to compete. Its newest GPU the 7900 XTX is an extremely powerful card, close to the top of the line Nvidia RTX 4090 in raster performance.
If AMD had introduced it with an aggressively low price then then they could have wedged Nvidia, which is determinbed to exploit it's market dominance by squeezing the maximum money out of buyers.
Instead, AMD has decided to simply follow Nvidia in squeezing for maximum prices, with AM prices slightly behind Nvidia.
It's a strange decision from AMD who is well behind in market and apparently seems disinterested in increasing that market share by competing aggressively.
So a third player is needed - Intel - it's alot harder for three companies to sit on outrageously high prices for years rather than compete with each other for market share.
You are correct that the manufacturing cost has gone up.
You are incorrect that this is the root cause of GPU prices being sky high.
If manufacturing cost was the root cause then it would be simply impossible to bring prices down without losing money.
The root cause of GPU prices being so high is lack of competition - AMD and Nvidia are choosing to maximise profit, and they are deliberately undersupplying the market to create scarcity and therefore prop up prices.
In summary, GPOU prices are ridiculously high because Nvidia and AMD are overpricing them because they believe this is what gamers will pay, NOT because manufacturing costs have forced prices to be high.
The really interesting upcoming LLM products are from AMD and Intel... with catches.
- The Intel Falcon Shores XPU is basically a big GPU that can use DDR5 DIMMS directly, hence it can fit absolutely enormous models into a single pool. But it has been delayed to 2025 :/
- AMD have not mentioned anything about the (not delayed) MI300 supporting DIMMs. If it doesn't, its capped to 128GB, and its being marketed as an HPC product like the MI200 anyway (which you basically cannot find on cloud services).
Nvidia also has some DDR5 grace CPUs, but the memory is embedded and I'm not sure how much of a GPU they have. Other startups (Tenstorrent, Cerebras, Graphcore and such) seemed to have underestimated the memory requirements of future models.
That's the problem. Good DDR5 RAM's memory speed is <100GB/s, while nvidia could has up to 2TB/s, and still the bottleneck lies on memory speed for most applications.
Not if the bus is wide enough :P. EPYC Genoa is already ~450GB/s, and the M2 max is 400GB/s.
Anyway, what I was implying is that simply fitting a trillion parameter model into a single pool is probably more efficient than splitting it up over a power hungry interconnect. Bandwidth is much lower, but latency is also slower, you are shuffling much less data around.
I wonder how soon we'll see something tailored specifically for local applications. Basically just tons of VRAM to be able to load large models, but not bleeding edge perf. And eGPU form factor, ideally.
The Apple M-series CPUs with unified RAM is interesting in this regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM for $4300 today, and I expect the M2 Ultra go to 192GB.
I'm not a ML scientist my any means, but Perf seems as important as RAM from what I'm reading. Running prompts in internal chain of thought (eating up more TPU time) appears to give much better output.
It's not that perf is not important, but not having enough VRAM means you can't load the model of a given size at all.
I'm not saying they shouldn't bother with RAM at all, mind you. But given some target price, it's a balance thing between compute and RAM, and right now it seems that RAM is the bigger hurdle.
I'm super duper curious if there are ways to glob together VRAM between consumer-grade hardware to make this whole market more accessible to the common hacker?
You actually can split a model [0] onto multiple GPUs even without NVLink, just using the PCIe for the transfers.
Depending on the model the performance is sometimes not all that different. I believe for solely inference on some models the speed difference may barely be noticeable, where for other training activities it may make 10+% difference [1]
I remember reading about a guy who soldered 2GB VRAM modules on his 3060 12GB (replacing the 1GB modules) and was able to attain 24GB on that card. Or something along those lines.
How is this card (which is really two physical cards occupying 2 PCIe slots) exposed to the OS? Does it show up as a single /dev/gfx0 device, or is the unification a driver trick?
Kinda depressing if you consider how they removed NVLink in the 4090, stating the following reason:
> “The reason we took [NVLink] off is that we need I/O for other things, so we’re using that area to cram in as many AI processors as possible,” Jen-Hsun Huang explained of the reason for axing NVLink.[0]
"NVLink is bad for your games and AI, trust me bro."
But then this card, actually aimed at ML applications, uses it.
Market segmentation. Back when the Pascal architecture was the latest thing, it didn't make much sense to buy expensive Tesla P100 GPUs for many professional applications when consumer GeForce 1080 Ti cards gave you much more bang for the buck with few drawbacks. From the corporation's perspective it makes so much sense to differentiate the product lines more, now that their customers are deeply entrenched.
What exactly is an SXM5 socket? It sounds like a PCIe competitor, but proprietary to nvidia. Looking at it, it seems specific to nvidia DGX (mother?)boards. Is this just a "better" alternative to PCIe (with power delivery, and such), or fundamentally a new technology?
The TDP row in the comparison table must be in error. It shows the card with dual GH100 GPUs at 700W and the one with a single GH100 GPU at 700-800W ?!
I would sell a kidney for one of these. It's basically impossible to train language models on a consumer 24GB card. The jump up is the A6000 ADA, at 48GB for $8,000. This one will probably be priced somewhere in the $100k+ range.
Use 4 consumer grade 4090 then. It would be much cheaper and better in almost every aspect. Also even with this, forget about training foundational models. Meta spent 82k GPU hours on the smallest llama and 1M hours on largest.
If I remember correctly the NVLINK adds 100GB/s (where PCIE 4.0 is 64GB/s). Is it really worth getting 3090 performance (roughly half) for that extra bus speed?
I was just saying to a colleague the day before this announcement that the inevitable consequence of the popularity of large language models will be GPUs with more memory.
Previously, GPUs were designed for gamers, and no game really "needs" more than 16 GB of VRAM. I've seen reviews of the A100 and H100 cards saying that the 80GB is ample for even the most demanding usage.
Now? Suddenly GPUs with 1 TB of memory could be immediately used, at scale, by deep-pocket customers happy to throw their entire wallets at NVIDIA.
This new H100 NVL model is a Frankenstein's monster stitched together from whatever they had lying around. It's a desperate move to corner the market early as possible. It's just the beginning, a preview of the times to come.
There will be a new digital moat, a new capitalist's empire, built upon on the scarcity of cards "big enough" to run models that nobody but a handful of megacorps can afford to train.
In fact, it won't be enough to restrict access by making the models expensive to train. The real moat will be models too expensive to run. Users will have to sign up, get API keys, and stand in line.
"Safe use of AI" my ass. Safe profits, more like. Safe monopolies, safe from competition.
AMD seems to be focusing on traditional HPC, they've got a ton of 64 bit flops in their recent commercial model. I expect their server GPUs are mostly for chasing supercomputer contracts, which can be pretty lucrative, while they cede model training to NVidia.
For now nvidia is a very dominant player for sure but in long run do you see it changing, with competition from Amd-xilinx, intel or potential AI hardware startups,why have the startups or other big players failed to make dent in nvidia's dominance ? considering how big this market will be in coming years there should have been significant investment made by other players but they seem to be incompetent in making even a competitive chip and nvidia which is already so ahead is running even more faster expanding its software ecosystem across various industries.
The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.