Sometimes things really are compute bound, and sometimes you get a "big" workload that still fits nicely in the GPU's L2. Generative AI is mostly at the far end of "memory bound."
Some ML startups (like Graphcore) seemed to bet on large caches, sparsity and clever preprocessing instead of raw memory bandwidth, but I think their strategy was compromised when model sizes exploded. Even Cerebras was kinda caught off guard when their 40GB pizza was suddenly kind of cramped.
Current ML architectures tend to be heavily optimized for ease of very large scale parallelism in training, even at the expense of a bigger model size and compute cost. So there may be some hope for different architectures as we stop treating idle GPUs as being basically available for free and start budgeting more strictly for what we use.
In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.
Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!
The memory makers bump up the speed the memory itself is capable of through manufacturing improvements. And I guess the H100 memory controller has some room to accept the faster memory.
Is the error rate due to quantum tunneling at so many nanometers still a fundamental limit to transistor density and thus also (G)DDR and HBM performance per unit area, volume, and charge?
https://news.ycombinator.com/item?id=38056088 ; a new QC and maybe in-RAM computing architecture like HBM-PM: maybe glass on quantum dots in synthetic DNA, and then still wave function storage and transmission; scale the quantum interconnect
My understanding is that while quantum tunneling defines a fundamental limit to miniaturization of silicon transistors we are still not really near that limit. The more pressing limits are around figuring out how to get the EUV light to consistently draw denser and denser patterns correctly.
>> But now, the researchers from Southampton, together with scientists from the universities of Dortmund and Regensburg in Germany, have successfully demonstrated that a beam of light can not only be confined to a spot that is 50 times smaller than its own wavelength but also “in a first of its kind” the spot can be moved by minuscule amounts at the point where the light is confined
FWIU, quantum tunneling is regarded as error to be eliminated in digital computers; but may be a sufficient quantum computing component: cause electron-electron wave function interaction and measure. But there is zero or 1 readout in adjacent RAM transistors. Lol "Rowhammer for qubits"
For anyone wondering how this applies to big LLMs, 144GB is big, but you’d need to roughly double this to train Gpt 3.x fitting everything in memory at once.
Of course even If 300GB GPUs were available tomorrow, and you sold a million house to buy as many as that would allow it’d still take years to train once.
They are also doing a custom 4 chip Hopper GPU for a European HPC initiative. Seems like a good move. Basically allows you to have 4 H100s connected via NVLink without needing a separate SXM board.
This is a single-chip H100 NVL. Both are GH100's with the same tweaked 20% wider 6144-bit HBM3e (versus 5120 bit on other H100's) running at a higher speed.
The HBM3e loadout is slightly different than H100 NVL's was going to be, but this definitely seems like a higher bin H100. It's basically as-if AMD had shipped a 7900 XT, then latter started selling the 7900 XTX; same chip, but they brought up all the memory controllers on this one.
I'm curious: Do you think there is a realistic chance for another chip maker to catch up and overtake NVidia in the AI space in the next few years or is their lead and expertise insurmountable at this point?
This is launched in response to MI300X, and this should still not be enough to match AMD's product. This launches 2 quarters after MI300X, but B100 should arrive before AMD's MI400 generation.
StableHLO[1] and IREE[2] are interesting projects that might help AMD here, from [1]:
> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).
From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.
That's exactly what the CUDA monopoly is meant to prevent, and as a fervent supporter of OpenCL (with two commercial apps), this is exactly the case I always make: even if some GPU came out tomorrow costing $0 and with infinite performance, all these people who paint themselves into a corner are hosed.
Not that anyone cares, and everyone keeps using CUDA while simultaneously complaining about Nvidia GPU prices, as if those two things have nothing to do with each other...
Have you had good experience with this for portability though? On what classes of hardware and OS?
I did a bit of work in OpenCL almost 10 years ago, and found it decently portable on a range of NVIDIA GPUs as well as Intel iGPUs. On the high end I used something like the Titan X while on the low end it was typical GPUs found in business class laptops.
But my limited exposure to AMD was terrible by comparison. Even though I am away from that work now, I still tend to try to run "clpeak" and one of my simpler image processing scripts on each new system. And while I liked a Ryzen laptop for general use or even games, it seemed like OpenCL was useless there. It seemed my best option was to ignore the GPU and use Intel's x86_64 SIMD OpenCL runtime.
My own understanding is that OpenCL is semi-obsolete at the moment (although newer standards revisions are still coming out, so this may change in the future) with forward-looking projects mostly targeting Vulkan Compute or SYCL.
(There are some annoying differences in the low-level implementations of OpenCL vs. Vulkan Compute, due to their being based on SPIR-V compute "kernels" vs. "shaders" respectively, that make it hard for them to interop cleanly. So that's why the choice can be significant.)
OpenCL tooling has always been bad versus CUDA, and that isn't NVidia to blame, rather Intel and AMD.
Only C, C++ and Fortran were never taken seriously enough, other language stacks never considered.
Thus everyone that enjoyed programming in anything not C, with great libraries and graphical debuggers flocked to CUDA, now remains to be seen if SYCL and SPIRV will ever matter enough to regain some of those folks back.
The probability of a processor emerging that is so much cheaper that it warrants moving over, but that also its API is the OpenCL that you already target is very low. Aim for what is probable, not what is theoretically possible.
Most people who utilize this hardware aren't programming kernels directly for the GPU, they're using abstraction layers like pytorch, tensorflow, etc. For the developers of those type of frameworks, cuda itself offers a lot of libraries like cublas.
There's relatively few people capable of implementing these frameworks without a solid cuda-like foundation, and those that do exist would need a very strong incentive to do it.
The vast majority of work in ML isn't people working with CUDA directly - people use open source frameworks like PyTorch and TensorFlow to define a network and train it, and all the frameworks support CUDA as a backend.
Other backends are also available, such as CPU-only training. And you can export networks in reasonably-standard formats.
nvidia's moat is much more mature framework support than AMD's cards; widespread popularity due to that good framework support, ensuring everyone develops on nvidia, thus maintaining their support lead; much faster performance than CPU-only training; and a price that, though high, is a lot less than an ML developer's salary.
If you need 24GB of vram and nvidia offers that for $1600 while AMD offers it for $1300, how many compatibility problems do you want to deal with to save a single day's wages?
But nvidia's moat is far from guaranteed. Huge users like OpenAI and Facebook might find improving AMD support pays for itself.
> Huge users like OpenAI and Facebook might find improving AMD support pays for itself.
At that scale they may actually develop their own hardware a la Google TPU.
If you want to just focus on the AI problem and not on infrastructure, just use NVidia. If you want control and efficiency, design your own. AMD kind of falls in a weird middle ground with respect to the massive companies.
Google’s focus on TPUs have caused them many issues though other giant players might see spending FTE-equivalents on PyTorch etc development as a better investment choice.
CUDA code can be forward-ported to AMD's HIP, which can be used with the ROCm stack. For a more standards-focused alternative there's also SYCL, which has implementations targeting a variety of hardware backends (including HIP) and may also target Vulkan Compute in the future.
StableHLO[1] and IREE[2] are interesting projects that might help AMD here, from [1]:
> Our goal is to simplify and accelerate ML development by creating more interoperability between various ML frameworks (such as TensorFlow, JAX and PyTorch) and ML compilers (such as XLA and IREE).
From there, their goal would most likely be to work with XLA/OpenXLA teams on XLA[3] and IREE[2] to make RoCM a better backend.
>I'm curious: Do you think there is a realistic chance for another computer maker to catch up and overtake IBM in the computer space in the next few years or is their lead and expertise insurmountable at this point?
>It describes how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enough to quickly take market share from established business.
I'm aware, but I'd argue that in addition to the competitive moats described by Hamilton Helmer (7 powers guy) there is a real moat in unique technological expertise in the chip industry. E.g. the chip making machines that ASML makes or the 3nm chips that TSMC produces have reached a level of sophistication that will take 3+ years for competitors to replicate, thus granting them a sort of quasi monopoly for the foreseeable future.
I‘s say those are covered by Helmer with scale power. Spending millions on tiny process optimizations or other research is possible due to the large revenue streams that are coming in. For example, in terms of scale economies, only when you sell thousands high end GPUs per month you can hire people to write highly optimized compilers.
Maybe - though I think it's more of a cornered resource. You have specialized knowledge that attracts specialized people which create more specialized knowledge. Creates a sorta feedback loop.
Although no lead is insurmountable, the fixed capital investment and mature software ecosystem specific to this sector makes it harder to imagine what a competitor would look like.
Given how large the prize is, the next chapter of chip development is likely to be nvidia vs state sponsored projects. China, in particular, will funnel further resources into acquiring this technology by any means necessary, including (more) industrial sabotage and outright theft. It's going to be interesting to see how this will play out. Up until a few years ago China was viewed as being a formidable competitor for projects of this nature, but as the country has moved to become increasingly authoritarian, so too have its decision making and execution declined in quality.
photonics can run light based matrix multiplication for a fraction of the current GPUs, it's only a matter of time until they initiate a complete paradigm shift
last time I checked the biggest players only did inference architectures, not training. presumably they shy away from implementing Hinton's forward forward architecture or something similar
Behind in what dimension? The most expensive Nvidia chips are much faster than Google TPUs, but the Google TPUs are competitive in terms of end to end training costs (roughly you can think of this as FLOPs per dollar).
I use TPUs on Colab all the time and I'm freaking happy. Maybe there's a way to use H100's do the same thing, but my code is already written to host utterly massive files on gcs buckets as tfrecord files and load them up on TPUs during the training process. First started with the free ones and now I rent the newer ones because it's not that expensive. I recommend beginners try it out. I find the other architectures more expensive, at least for my use case.
Maybe one of the big Chinese chip makers or AMD, but the growing popularity of CUDA (not compatible with other GPUs) makes that less likely in the near future
The performance jumps that Nvidia has had in a fairly short amount of time is impressive, but I can't help but feel like there is a real need for another player in this space. Hopefully AMD can challenge this supremacy soon.
You can use Gaudi2s at the new Intel Developer Cloud[0]. Not sure why don't offer it on AWS though, seems a bit odd since they have the DL1 instances for the first-gen Gaudis.
Both parent comments were likely referring to any entity other than Taiwan. If you are a fabless chip designer or one of their customers, it makes sense to diversify away from Taiwan, even if that comes at Taiwan's expense.
I worry that a lot of large cap companies either depend directly on TSMC (Nvidia, AMD, Apple) or depend on a company that depends on TSMC (Microsoft/OpenAI, Arm). It's TSMC all the way down and that scares me.
It’s easy to forget that TSMC was not the chipmaking leader until 7 or 8 years ago. Intel was. Things can change quickly in tech. Intel is trying to reclaim their old glory. We’ll see if they succeed.
Historians might look back at Intel’s and GloFo’s stumbles in the mid 2010s as being a pivotal turning point. Chips are the new oil and that makes the mid term future very dangerous.
I'd rather Intel. People have been pleading with AMD for years to compete with Nvidia, but AMD really has not put in a proper effort. They still don't look like they are putting in a proper effort.
AMD shipped Frontier. Compare and contrast with Intel's Aurora.
Epyc took the performance crown from Intel. Games consoles have been AMD for ages.
AMD are competing with Intel and Nvidia simultaneously with fewer resources than either, having come back from near bankruptcy in recent memory.
There's been plenty of effort and execution from team red.
It's commercially unfortunate that the crypto and now deep learning crowd don't particularly value the flexibility or control that comes from an open source toolchain. Regardless, I don't think the Cuda moat will hold out.
<They still don't look like they are putting in a proper effort.>
Quite the contrary, they've turned around the company to focus on AI.
Legacy software projects are on hold and software developers moved to
work on AI under a new VP (former Xilinx exec). They have purchased some startups to get experienced AI developers.
AMD was fighting Intel for its life. After a number of flops, it only got a big breakthrough on the CPU-side with Zen just over half a decade ago - which is not that far back. Hopefully they now have a bit of money saved up in their war chest to help the GPU division.
It's a server motherboard with 8 GPUs crammed on it, facing up. The tall towers are the GPU heatsinks. I believe the blade looking things on the side are CPU RAM, the heatsinks on the back are covering the CPUs, and the little heatsink in the middle must be the CPU VRMs. Fans are in the back, and they crammed some electrical components on the front where all the IO is.
I had a shock when I looked up prices for H100 gpus, wanting to use one just for personal experimentation and for an upcoming hackathon. How much this one costs? $300,000?
To elaborate: you can't really buy these except in specific configurations from Supermicro (usually 8x H100) or the like. So take whatever chip-specific cost you have in mind, and 8x it, and add on the cost of CPU/memory/storage. NVIDIA doesn't bother to sell these in a configuration that you can plug into your desktop.
Am I the only one that's annoyed by the non-alphabetical model numbers? Why not do B100 after the A100, then jump to H (supposing there won't be a C100 or D200 at some point)? Like, wtf Nvidia.
They name their architectures after scientists (Maxwell, Pascal, Turing, Volta, Ampere, Lovelace, Hopper). Thats what the GPU initial stands for.
As for the number, the die name counts down to 100 (with GA107, for instance, being a small GPU die and GA100 being the big one), and the big datacenter GPU as a product inherits the 100.
Also, it occurred to me that Nvidia does sometimes increment the die to 200 (EG GM200, as the Maxwell 100 series was a single small oddball die). Its possible that they "refreshed" the GH100 die and are codenaming it GH200.
At first they used "measures of hotness", but ran out a few generations in. Coincidentally, degrees of warmth are all named after scientists. So they continued with scientists.
I recently had occasion to evaluate a database of 1200+ NVIDIA GPUs and can tell you that the only thing consistent about the model numbers is their inconsistency. For example, what is an RTX 4000? It could be the 2018 Quadro RTX 4000, the Quadro RTX 4000 Max-Q, or Quadro RTX 4000 Mobile (all Turing cards), but it could also be the RTX 4000 Mobile Ada Generation (Ada Lovelace card released 2023).
Why do they still sell hardware now that practically every other business has moved to being a service provider? If we set aside the fact that it would be an awful move for end-users, what's to stop Nvidia from cornering the market by only renting them in their own data centers? Is it the logistics of moving the massive training sets?
That would be a highly risky bet on Nvidia becoming AWS faster than AWS can become Nvidia.
What they're doing is instead trying to make sure that their GPUs continue to be seen as the best option in the short/medium term (by having them accessible everywhere), and trying to commoditize their complement by giving small cloud providers disproportionate GPU allocations, which they hope will drive customers from the big providers to the smaller ones that a) aren't trying to build their own ML hardware, b) will have less negotiating leverage with Nvidia in the long term.
I'm asking why Nvidia doesn't maximize their profits by retaining the hardware and selling compute. They could capture the market from those other providers if they sold more FLOPS/kilowatt (or whatever metric is used). Compared to manufacturing GPUs/TPUs, running a datacenter (especially one that specializes in Nvidia hw) would seem to be a trivial task.
https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-acc...
This is an H100 141GB, not new silicon like the Nvidia page might lead one to believe.