A lot of comments in this thread are parroting the common CUDA moat but that really only applies for training and R&D. The majority of spend is on inference and the world is standardizing around a handful of common architectures that have or are being implemented performantly in non CUDA stacks
2. Inference is easier than training, so other cards will become competitive in inference performance.
3. As AI applications start to proliferate, inference costs will start to dominate training costs.
4. Hence Nvidia's dominance will not last.
I think the most problematic assumption is 3. Every AI company we see thus far is locked in an arms race to improve model performance. Getting overtaken by another company's model is very harmful for business performance (See Midjourney's reddit activity after DALLE-3), while a SOTA release instantly results in large revenue leaps.
We also haven't reached the stage where most large companies can fine-tune their own models, given the sheer complexity of engineering involved. But this will be solved with a better ecosystem, and will then trigger a boom in training demand that does scale with the number of users.
Will this hold in the future? Not indefinitely, but I don't see this ending in say 5 years. We are far from AGI, so scaling laws + market competition mean training runs will grow just as fast as inference costs.
Also, 4 is very questionable. Nvidia's cards are not inherently disadvantaged in inference, they may not be specialized ASICs, but are good enough for the job with an extremely mature ecosystem. The only reason why other cards can be competitive against Nvidia's cards in inference, is because of Nvidia's 70% margins.
Therefore, all Nvidia needs, to defend against attackers, is to lower their margins. They'll still be extremely profitable, their competitors not so much. This is already showing in the A100 H100 bifurcation. H100s are used for training, while the now old A100s used for inference. Inference card providers will need to compete against a permanently large stock of retired Nvidia training cards.
Apple is still utterly dominant in the phone business after nearly 2 decades. They capture the majority of the profits despite
1. Not manufacturing their own hardware
2. The majority of the market share by units sold is say Chinese/Korean
If inference is easy, while training is hard. It could just lead to Nvidia capturing all the prestigious and easy profits from training, while the inference market is a brutal low margin business with 10 competitors. This will lead to the Apple situation.
What stats are you looking at? Looking at https://subredditstats.com/r/midjourney , I see a slower growth curve after the end of July, but still growing and seemingly unrelated to the DALL-E 3 release, which was more like end of October publicly.
Do you mind expanding on this? What do you see as the biggest things that make that milestone > 5 years away?
Not trolling, just genuinely curious -- I'm a distributed systems engineer (read: dinosaur) who's been stuck in a dead-end job without much time to learn about all this new AI stuff. From a distance, it really looks like a time of rapid and compounding growth curves.
Relatedly, it also does look -- again, naively and from a distance -- like an "AI is going to eat the world" moment, in the sense that current AI systems seem good enough to apply to a whole host of use cases. It seems like there's money sitting around just waiting to be picked up in all different industries by startups who'll train domain-specific models using current AI technologies.
Intelligence lies on a spectrum. So does skill generality. Ergo, AGI is already here. Online discussions conflate AGI with ASI -- artificial superhuman intelligence, i.e. sci-fi agents capable of utopian/dystopian world domination. When misused this way, AGI becomes a crude binary which hasn't arrived. With this unearned latitude, people subsequently make meaningless predictions about when silicon deities will manifest. Six months, five years, two decades, etc.
In reality, your gut reaction is correct. We have turned general intelligence into a commodity. Any aspect of any situation, process, system, domain, etc. which was formerly starved of intelligence may now be supplied with it. The value unlock here is unspeakable. The possibilities are so vast that many of us fill with anxiety at the thought.
I also think the space of products that involve training per-customer models is quite large, much larger than might be naively assumed given what is currently out there.
It may be true that inference is 100x larger than training in terms of raw compute. But I think it very well could be that inference is only 10x larger, or same-sized.
And besides, you can look at it in terms of sheer inputs and outputs. The size of data yet to be trained on is absolutely enormous. Photos and video, multimedia. Absolutely enormous. Hell, we need giant stacks of H100's for text. Text! The most compact possible format!
I also think it's ludicrous to think that NVIDIA hasn't witnessed the rise of alternate architectures and isn't either actively developing them (for inference) or seriously deciding which of the many startups in the field to outright buy.
They already have inference specialized designs and architectures. For example, the entire Jetson line. Which is inference focused (you can train on them but like, why would you?). They have several DNLA accelerators on chip besides the GPU that are purely for inference tasks.
I think Nvidia will continue to be dominant because it's still a lot easier to go from Cuda training to Cuda (TensorRT acerbated lets say) inference than migrating your model to ONNX to get it to run on some weird inference stack.
Well sure if you model is small and light enough. But there's no training a 7B+ model on one (well, you could it would be so, so, so slow). Like, decades?
Unless there is s collapse in AI i would suspect inference will just keep exploding and as prices go down volume will go up. Margins will go down and maybe we will land back in prices similar to standard gpus. Still very expensive but not crazy.
Right, and they don't even need to lower their training margins, since training needs a fancy interconnect they can just ship the same chip at different prices based on interconnect (and are already doing so with the 40xx vs H100).
I sold my GOOG shares from 2005 to buy NVDA last year and definitely agree with the article.
The thing is that Wall St doesn't understand any of the technical details and will just see huge profit growth from NVDA and drive the price crazy high.
NVDA has a PE of 80 and is up 6x since October 2022. It's the third highest company by market cap in the SP500, how much higher do you think it could possibly go?
It will keep going up until a Wall St analyst puts in an insane price target like $5000. Then it will be time to sell. It should follow AMZN from the dotcom bubble.
Except that Google is buying H100s and nVIDIA is not buying TPUs. Nobody is buying TPUs, even the ancient ones Google is willing to sell.
nVIDIA's hardware offerings are products for sale. Google's TPUs are a value-add to their rental market. This distinction is not lost on people. There's a reason all Google's big-load TPU clients they list in press releases are companies Google invested in.
Yes. to get good perf on pytorch you need to use custom kernels that are written in CUDA, and all the dev work is done in CUDA, so if you want to use new SOTA projects as they come out, you'll probably want an NVIDIA GPU unless you want to spend time hand-optimizing.
ALL of the comments make no sense after reading this line:
> Unlike Nvidia, which offers its GPUs out for other companies to purchase,
> Google's custom-made TPUs remain in-house for use across its own products and services.
nobody can purchase one of these.
and even if someone external to google could purchase one, why would they trust google for software, documentation or (the big if with google) ongoing support. this isn't their core business.
A few notes: Nvidia always announces their hardware in advance of availability, while Google typically announces some time after they've started using it internally. Also TPU has been more cost-focused than Nvidia, since Google uses these chips internally to serve their web traffic while Nvidia is supply constrained and can name their price right now, plus they don't pay the electricity bills.
There are also a few things wrong with the specs you quoted. 282 GB is split between two GPUs, it's 144 GB per GPU (not sure where the extra 6 GB went). The TPU pod throughput number you quoted is interconnect bandwidth which you compared to GH200's memory bandwidth. Those numbers are not comparable.
I believe GH200 does beat TPU v5p per chip, however the differences are not anywhere near as large as your comparison suggested. And it's likely that TPU v5p is dramatically more cost effective but we don't have Google's internal numbers to prove that.
The 282 GB figure comes from a "dual configuration" server with two CPU chips and two GPU chips:
> the new platform will be available in a wide range of configurations. The dual configuration [...] comprises a single server with 144 Arm Neoverse cores, eight petaflops of AI performance and 282GB of the latest HBM3e memory technology.
Note that one Grace Hopper CPU has 72 cores so this configuration clearly has two CPUs and two GPUs in one server, and the 282 GB HBM3e total is split between the two GPUs with 144 GB each (again, not sure where the extra 6 GB went, maybe disabled for yield issues?).
Actual dollars would be nice, since that is what the end user understands. Otherwise what are they talking about 1 million unit orders? But nice catch I didn't see it.
These comparisons are "Google Pod" vs. "Nvidia Supership" not single chips. Pod/Superships are collections GPU/TPU's, memory and high speed interconnect.
One Grace Hopper has: H100 chip, Grace CPU with 72 cores, 282GB of HBM3e memory and 480 GB LPDDR5X for the CPU.
If you want to compare the host too, then you should also take into account the TPU host.
Each TPUv5p host has 208 vCPU, 448 GB ram, 200 Gbps NIC (for data loading), 4 TPUv5p chips (8 cores). Each TPU chip has 95GB HBM with 2765GBps memory bandwidth, 3D interconnect with 4800 Gbps (3D interconnect means every chip has 6x 800 Gbps feeds, each going to a different chip in the torus).
One big difference between TPUs and GPUs is that the former have few beefy cores (2x chips) while the latter a lot of smaller cores. It can help or hinder depending on the workload. It makes it simpler to do deterministic training, which helps both debugging and optimization.
So, each system has 448GB host RAM and 4 TPU chips with 380GB HBM. The amount of host memory doesn't really matter that much: it's there to make sure that accelerators can be fed data at full speed.
TPU5p pods are organized in "cubes" of 16 machines (64 TPUs) that can be assembled in any shape for different kinds of data/model parallelism (see twisted tori in the TPUv4 paper). Each cube has 6080GB (6TB) of HBM ram and 29.3 petaflops (bfloat16) of theoretical peak performance. You can get multiple exaflops from a single pod and use multiple pods in the same datacenter with multislice if needed.
TPU are more power efficient than GPUs (e.g. A100 has a TDP 3x of TPUv4 + 3D torus requires fewer connections that a fattree interconnect, and optics do consume their fair bit of power), so you can scale to higher compute per datacenter before you hit that bottleneck on really really large models.
Also notice that you are using bare metal numbers for Grace hopper and I can only quote (what is publicly available) userspace/VM numbers for TPUs. For example, the actual physical RAM on the hosts is higher than 448 GB; so 448 vs 480 is apples to oranges.
Your numbers are still off. A GH200 pod has 144 terabytes of RAM; it isn't even measured in gigabytes. It looks like TPUv5p may have 851 terabytes which would be a generation ahead but you didn't show those numbers.
nabla9 said "one Grace Hopper" not "one GH200 pod".
Actually, DGX GH200 seem to come in different sizes, something that I can't find clearly stated on Nvidia's website, but what I do see are entirely inconsistent specs. I'm thinking they've changed it a few times.
DGX GH200 is next gen, but afaik not available to anyone aside from internal partners. All the quoted data here is for several of them together in a chassis.
>Each NVIDIA Grace Hopper Superchip in NVIDIA DGX GH200 has 480 GB LPDDR5 CPU memory, at an eighth of the power per GB, compared with DDR5, and 96 GB of fast HBM3.
TPUs have the second best software stack after CUDA though. JAX and Tensorflow support it before CUDA in some cases and it's the only Pytorch environment that comes close to CUDA for support.
Google has historically been weak at breaking into markets that someone else has already established and I think the TPUs are suffering from the same fate. There is not enough investment in making the chips compatible with anything other googles preferred stack (which happens to not be the established industry stack). Committing to getting torch to switch from device = “cuda” to device = “tpu” (or whatever) without breaking the models would go a long way imo.
I always thought Google was actually pretty good at taking over established, or rising markets, depending on the opportunity or threat they see from a competitor. Either by timely acquisition and/or ability to scale faster due to their own infrastructure capabilities.
- Google search (vs previous entrenched search engines in the early '00s)
- Adsense/doubleclick (vs early ad networks at the time)
- Gmail (vs aol, hotmail, etc)
- Android (vs iOS, palm, etc)
- Chrome (vs all other browsers)
Sure, i'm picking the obvious winners, but these are all market leaders now (Android by global share) where earlier incumbents were big, but not Google-big.
Even if Google's use of TPUs are purely self-serving, it will have a noticeable effect on their ability to scale their consumer AI usage at diminishing costs. Their ability to scale AI inference to meet "Google scale" demand, and do it cheaply (at least by industry standards), will make them formidable in the "ai race". This is why altman/microsoft and others are investing heavily in AI chips.
But I don't think their TPU will be only self-serving, rather, they'll scale it's use through GCP for enterprise customers to run AI. Microsoft is already tapping their enterprise customers for this new "product". But those kinds of customers will care more about cost than anything else.
The long-term game here is a cost game, and Google is very, very good at that and has a headstart on the chip side.
TPUs were originally intended to just be for internal use (to keep google from being dependent on Intel and nvidia). Making them an external product through cloud was a mistake (in my opinion). It was a huge drain on internal resources in many ways and few customers were truly using them in the optimal way. They also competed with google's own nvidia GPU offering in cloud.
The TPU hardware is great in a lot of ways and it allowed google to move quickly in ML research and product deployments, but I don't think it was ever a money-maker for cloud.
Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.
Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.
edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.
> Mojo is a closed source language that will never reach mainstream adoption among ML engineers and scientists.
[Citation needed]
The creator, Chris Lattner, previously created LLVM, clang, and Swift. In each case he said these projects would be open sourced, and in each case they were. In each case they reached mainstream adoption in their respective target markets.
He's stated that Mojo will be open source.
If you're going to claim with great confidence that this language will have a different outcome to his previous ones, then you probably should have some strong evidence for that.
hmm the creator says (from his podcast with Lex Friedman when I listened to him) that they are open sourcing it, but that it is a project borne out of their private effort at their company and that it is still being used privately - so the aim is open sourcing it while taking community input and updating their private code to reflect the evolving design so that when they release it their internal lang and the open sourced lang will not diverge.
of course not ideal, but better than "open sourcing" it and refusing every request because it does not work for their codebase. worse than having it open source from the get go, of course.
assuming that day comes, does it have a competitor in the works? a python superset, compatible with python libs, but enables you to go bare metal to the point that it enables you to directly program GPUs and TPUs without CUDA or anything?
"never" means you believe it will never be open sourced, or a competitor will surpass it by the time it is open sourced. or that you believe the premise of the lang is flawed and we don't need such a thing. Which one is it?
From what I see, they have a pretty active community and there is demand for such a system.
The github says something similar:
>This repo is the beginning of our Mojo open source effort. We've started with Mojo code examples and documentation, and we'll add the Mojo standard library as soon as we get the necessary infrastructure in place. The challenge is that we use Mojo pervasively inside Modular and we need to make sure that community contributions can proceed smoothly with good build and testing tools that will allow this repo to become the source of truth (right now it is not). We'll progressively add the necessary components, such as continuous integration, build tools, and more source code over time.
Doesn't really matter. Google's infra is all the client you need to continue pouring tens of billions into a project like this, bonus if others start using it more in the cloud, but they have so much use for accelerators across their own projects they aren't going to stop
This needs to be taken with a massive grain of salt, as LLM training performance hugely depends on the framework used.
And by "performance" I mean the quality of the end result, the training speed, and what features device is physically capable of handling. While kinda discrete aspects in other fields, all of these are highly correlated in ML training.
One specific example I am thinking of is support for efficient long context training. If your stack doesn't, for instance, support flash attention (or roll its own custom implementation), then good luck with that.
Another I am thinking of is quantization. Sometimes you want to train with a specific quantization scheme, and sometimes doing that is just a performance compromise. And vendors really love to fudge their numbers with the fastest, most primitive quantization on the device.
Compatability - does using a tpu require reworking significant parts of your software stack and application stack? If so, that sucks for most companies who's researchers are used to using nvidia libraries, tooling and hardware to get their models running, and reworking the entire bottom end of that to work on an esoteric platform needs to equate to a huge cost savings at scale to be worth it.
Cost per tensor flop delivered - likely very low if google has optimized the silicon, memory, voltage, power and temperature envelope, networking, boards and chassis for the server running it, as well as optimizing for optimal process node efficiency/cost. They're probably not on bleeding edge tsmc process, but instead optimizing for total deployed running cost per pflop over 2-3 years.
It's also now public (as of November) that microsoft/azure have been working for many years on their own ai chip, dubbed "Maia" [1], and appliance with the obvious goal of taking some of that nvidia margin in-house (and with openai/bing/copilot being a massive consumer of capacity). I think this will become even more commonplace with cloud vendors - even medium size ones - than it already is. The knowledge and complexity barrier to designing a tile processor unit seems pretty low and it looks like most of the hard stuff is in the drivers and software - something cloud providers designing and integrating internally can bypass and control to a great degree.
It's also very hard to benchmark side by side on these since I'm sure cuda/nvidia hardware can do compute that a TPU cannot. AMD's machine learning accelerators look good on paper too, as do the tensor processors in the apple silicon, but on real world applications and use cases they don't often measure up save for a few optimal workflows.
Do you know how Intel Gaudi [1] could be purchased. I see a "Contact us" in their place but any inside or public information about prices and the budget required for bulk purchases would be great.
Also, how Gaudi3 compares to Google TPUs? Thanks! I know that Gaudi3 is not readily available yet.
Is there enough volume of this chip for it to mean anything to anyone? If there are only a few that no one can’t buy, and they are only better than nvidia at 1 out of 10’000 workloads, can people even use their existence to try to get lower H100 prices?
I am currently analyzing the feasibility and economics of building a cloud service for AI training. This involves "day to day" analysis changes since all the "parts" are moving.
Hasn't this always been the case? I remember they released TPUs at IO 2018, and they were nice if you build FP16/FP32 models in Google Cloud TensorFlow/CoLab and never ever port them to anything else. Meanwhile the cool new open source stuff coming out every week usually requires a GPU and is rarely compatible with TPU without major changes. If you wait a few months a TPU compatible copycat appears on the Keras demos page but by then you've lost interest.
It is pretty exciting that various companies are making compute for deep learning. I was starting to get worried that the TPUs were lagging behind Nvidia. More diversity in hardware is always nice.
Meanwhile I am still waiting for Tesla Dojo to be an actual thing...
Elon's comments on the latest earnings call, if you read between the lines, strongly imply that Dojo v1, while functioning, has been a disappointment. He seems to hold out hope for future versions but he is markedly less optimistic than for other programs he talked about, and considering his typical unjustifiably optimistic attitude toward future releases of AI-related products that seems to bode poorly for the project. I would not be surprised if it was cancelled or rebooted in the next year or two.
I don't recall if they made comments about public availability for Dojo before, but given the difficulty of competing in this market and the huge benefits scale brings in hardware, it would seem foolish to limit the system to internal customers only.
> Elon's comments on the latest earnings call, if you read between the lines, strongly imply that Dojo v1, while functioning, has been a disappointment
Maybe they shouldn’t have gone with an in-house design.
> it would seem foolish to limit the system to internal customers only.
My understanding from working with similar HPC systems is that public use cases are likely niche and many tasks don’t benefit from the sort of fast interconnect their setup gives them. If anything being quite public about their design was meant to allow other companies to copy their design and compete with nvidia on a united front perhaps.
It’s not really as versatile as AWS, GCP are to profitably work as a compute cloud provider though. Perhaps it could be a provider for other companies who don’t want to own their hardware to do training runs on.
I agree that it seems a bit much to try to design their own training supercomputer. But they seemingly did OK with their inference hardware, and they do have an awful lot of money these days. Anyway, even if it is literally only useful for a single task, and that task is training large transformers, there is a use case right now. People are desperate for hardware to train large transformers.
Is there any good way to program TPUs for non-ML work? The only official way appears to involve encoding the compute operation into a TF graph and loading that. Is there a better way to directly use the hardware?
AFAIK, XLA is the only compiler that can target TPUs, so anything written for TPUs would need to lower to HLO. That said, why would you want to to run non-ML workloads on such an ML-specialized chip?
There are no benchmarks in this s*t article, just SWAGs. If this chip doesn't support FP8, there's no way it's going to beat the H100 on workloads that matter.
Could not find Gaudi or Cerebras device available anywhere. And MI300 risky since software support is poor. But if you have managed to make it work, perhaps there's potential.
Would like to try others, but investment too big. Would have to build entire software ecosystem. I think most people like this. Even geohot suffered trying to do this.
Meh, the comparison is somewhat pointless when it doesn't account for the slowdown that the vast majority of pytorch codebases experience on TPU's vs using JAX and it's accelerations specific to TPUs, and vice versa.
I understand that OpenAI used Azure for processing and not Google Cloud right? Using Google Cloud would mean, assuming Google as a bad actor, that they can get the model if OpenAI trains their models with it.
My intention of writing that comment was different, it was my mistake, so I correct and rephrase it. I think from a cybersecurity perspective.
Imagine a future where there are many players in the cloud training business. These players have the chance to retrieve de LLM model and since this could be a high stake business that is dangerous. I underastand that currently cloud players can observe everything that happens in their clouds including secret information and this is not happenning but it could not be "Google" by themselves but an inside job.
I'm not sure this will ever be a solved problem. If it's in your threat model to worry about such things, you necessarily need to have your own servers, in a secured physical location you control. No amount of attestation and auditing will allow you to overcome that sort of threat. For the most part, I believe audit logs and severely restricted permissions provided to employees greatly disincentive the chances of an inside job occurring successfully.
More compute can't save you from dumb methods. Memorizing the whole of human text while being fooled by a prompt crafted by a 14 year old doesn't mean its intelligent.
I don't see how we're this far in and so many competitors are still trying to ignore the massive market penetration of CUDA. I don't care how fast your chip is if I can only run 20% of the already tiny amount of software capable of leveraging this acceleration.
I think this was a more accurate assessment a couple of years ago. AMD historically under invested in their software stack, as is well documented. But, they are catching up. The prevalence of pytorch makes it easier for hardware vendors to target a single set of libraries for implementation and get broad distribution. Even Apple has made major progress getting mps support in more broadly, first directly into pytorch, and they are now charting their own course with MPX with some early interesting successes. Zuck recently used the word "H100 equivalent" in compute, and my memory is he indicated roughly 35% of that compute was not from NVIDIA -- that will all be AMD.
There's still work to do -- lots of repositories still contain `if device=="cuda"` type language, but my own experience is that manging code around to use apple gpus has gotten vastly easier this year, and I see more and more AMD GPU owners floating around github issues with resolvable problems. A year ago they were barely present.
All that said - people aren't ignoring it - and entrepreneurs and cloud monopolies are putting real resources into opening things up. I think the playing field will continue to level / get back to competitive over the next two to three years.
The main issue is until you solve 99% of all the problems, you’re still introducing substantial friction to every potential user who may have had to go from zero coding knowledge needed to very specialized coding knowledge needed.
And while many problems are trivial to fix at eg the PyTorch layer, lots of stuff like flash attention, DeepSpeed, etc. are coded directly in CUDA kernels.
Is that true and is that sustainable? My understanding has been that only the relatively low level libraries/kernels are written in cuda and all the magical algorithms are in python using various ML libaries. It's like how intel BLAS isn't much of a moat -- there are several open source implementations and you can mix and match.
How is CUDA so sticky when most ML devs aren't writing CUDA but something several layers of abstraction above it? Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?
> Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?
A long long time ago, i.e. the last time AMD was competing with Intel (before this time, that is), we used to use Intel's icc in our lab to optimize for the Intel CPUs and squeeze as much as possible out of them. Then AMD came out with their "Athlon"(?) and it was an Intel-beater at that time. But AMD never released a compiler for it; I bet they had one internally, but we had to rely on plain old GCC.
These hardware companies don't seem to get that a kick-ass software can really add wings to their hardware sales. If I were a hardware vendor, I would, if nothing else, make my hardware's software open so the community can run with it and create better software; which will result in more hardware sales!
It's a good question. I think fundamentally its because no wants to/can compete with Nvidia when making a general purpose parallel processor. They all want to make something a bit more specialized, so they need to guess what functionality is needed and not needed.
This is a really tricky guess, case in point that AMD's latest chip cant compete on training because they could not get Flash Attention 2 working on the backward pass because of their hardware architecture. [1]
Attempts to abstract at a higher layer have failed so far because that lower layer is really valuable, again Flash Attention is a good example.
The API that matters is Torch, and only the API. Letting NVIDIA charge famine prices is both a bad idea and a huge incentive to write software bridging the gap.
I don’t know if you do work in this space but this wasn’t accurate several years ago and it certainly isn’t accurate now.
The amount of extremely CUDA specific handwritten compute architecture code, custom kernels, etc has exploded and other than a few things here and there (yes, like torch SDPA) we’re waaaay past vanilla torch for anything beyond a toy.
The pricing and theoretical hardware specs of these novelties doesn’t matter when an inferior on paper Nvidia product will wipe the floor with the other in the real world.
People have spent 15 years wringing every last penny out of CUDA on Nvidia hardware.
There is some light shining through but you’ll still see things like “Woo-hoo FlashAttention finally supports ROCm! Oh wait, what’s that, FlashAttention2 has been running on CUDA for six months?”
Don’t even get me started on the “alternative” software stacks and drivers.
some stuff is still CUDA-dependent (a lot of scientific code). other stuff is not (major ML frameworks and many math libraries). nobody said CUDA was obsolete, just that there are great options for many use cases that no longer rely on it.