Hacker News new | past | comments | ask | show | jobs | submit login
Google TPU v5p beats Nvidia H100 (techradar.com)
174 points by wslh 10 months ago | hide | past | favorite | 133 comments



A lot of comments in this thread are parroting the common CUDA moat but that really only applies for training and R&D. The majority of spend is on inference and the world is standardizing around a handful of common architectures that have or are being implemented performantly in non CUDA stacks



The thesis of that essay, is that

1. Nvidia GPUs are dominant at training

2. Inference is easier than training, so other cards will become competitive in inference performance.

3. As AI applications start to proliferate, inference costs will start to dominate training costs.

4. Hence Nvidia's dominance will not last.

I think the most problematic assumption is 3. Every AI company we see thus far is locked in an arms race to improve model performance. Getting overtaken by another company's model is very harmful for business performance (See Midjourney's reddit activity after DALLE-3), while a SOTA release instantly results in large revenue leaps.

We also haven't reached the stage where most large companies can fine-tune their own models, given the sheer complexity of engineering involved. But this will be solved with a better ecosystem, and will then trigger a boom in training demand that does scale with the number of users.

Will this hold in the future? Not indefinitely, but I don't see this ending in say 5 years. We are far from AGI, so scaling laws + market competition mean training runs will grow just as fast as inference costs.

Also, 4 is very questionable. Nvidia's cards are not inherently disadvantaged in inference, they may not be specialized ASICs, but are good enough for the job with an extremely mature ecosystem. The only reason why other cards can be competitive against Nvidia's cards in inference, is because of Nvidia's 70% margins.

Therefore, all Nvidia needs, to defend against attackers, is to lower their margins. They'll still be extremely profitable, their competitors not so much. This is already showing in the A100 H100 bifurcation. H100s are used for training, while the now old A100s used for inference. Inference card providers will need to compete against a permanently large stock of retired Nvidia training cards.

Apple is still utterly dominant in the phone business after nearly 2 decades. They capture the majority of the profits despite

1. Not manufacturing their own hardware

2. The majority of the market share by units sold is say Chinese/Korean

If inference is easy, while training is hard. It could just lead to Nvidia capturing all the prestigious and easy profits from training, while the inference market is a brutal low margin business with 10 competitors. This will lead to the Apple situation.


> See Midjourney's reddit activity after DALLE-3

What stats are you looking at? Looking at https://subredditstats.com/r/midjourney , I see a slower growth curve after the end of July, but still growing and seemingly unrelated to the DALL-E 3 release, which was more like end of October publicly.


> We are far from AGI

Do you mind expanding on this? What do you see as the biggest things that make that milestone > 5 years away?

Not trolling, just genuinely curious -- I'm a distributed systems engineer (read: dinosaur) who's been stuck in a dead-end job without much time to learn about all this new AI stuff. From a distance, it really looks like a time of rapid and compounding growth curves.

Relatedly, it also does look -- again, naively and from a distance -- like an "AI is going to eat the world" moment, in the sense that current AI systems seem good enough to apply to a whole host of use cases. It seems like there's money sitting around just waiting to be picked up in all different industries by startups who'll train domain-specific models using current AI technologies.


Intelligence lies on a spectrum. So does skill generality. Ergo, AGI is already here. Online discussions conflate AGI with ASI -- artificial superhuman intelligence, i.e. sci-fi agents capable of utopian/dystopian world domination. When misused this way, AGI becomes a crude binary which hasn't arrived. With this unearned latitude, people subsequently make meaningless predictions about when silicon deities will manifest. Six months, five years, two decades, etc.

In reality, your gut reaction is correct. We have turned general intelligence into a commodity. Any aspect of any situation, process, system, domain, etc. which was formerly starved of intelligence may now be supplied with it. The value unlock here is unspeakable. The possibilities are so vast that many of us fill with anxiety at the thought.


When discussing silicon deities, maybe we can skip the Old Testament's punishing, all-powerful deity and reach for Silicon Buddha.


Sounds like somebody has watched the movie "Her"


I also think the space of products that involve training per-customer models is quite large, much larger than might be naively assumed given what is currently out there.

It may be true that inference is 100x larger than training in terms of raw compute. But I think it very well could be that inference is only 10x larger, or same-sized.

And besides, you can look at it in terms of sheer inputs and outputs. The size of data yet to be trained on is absolutely enormous. Photos and video, multimedia. Absolutely enormous. Hell, we need giant stacks of H100's for text. Text! The most compact possible format!


I also think it's ludicrous to think that NVIDIA hasn't witnessed the rise of alternate architectures and isn't either actively developing them (for inference) or seriously deciding which of the many startups in the field to outright buy.


They already have inference specialized designs and architectures. For example, the entire Jetson line. Which is inference focused (you can train on them but like, why would you?). They have several DNLA accelerators on chip besides the GPU that are purely for inference tasks.

I think Nvidia will continue to be dominant because it's still a lot easier to go from Cuda training to Cuda (TensorRT acerbated lets say) inference than migrating your model to ONNX to get it to run on some weird inference stack.


>you can train on them but like, why would you?).

Because you want to learn a new gait on your robot in a few minutes.


Well sure if you model is small and light enough. But there's no training a 7B+ model on one (well, you could it would be so, so, so slow). Like, decades?


Unless there is s collapse in AI i would suspect inference will just keep exploding and as prices go down volume will go up. Margins will go down and maybe we will land back in prices similar to standard gpus. Still very expensive but not crazy.


Right, and they don't even need to lower their training margins, since training needs a fancy interconnect they can just ship the same chip at different prices based on interconnect (and are already doing so with the 40xx vs H100).


I sold my GOOG shares from 2005 to buy NVDA last year and definitely agree with the article.

The thing is that Wall St doesn't understand any of the technical details and will just see huge profit growth from NVDA and drive the price crazy high.


I don’t understand; hasn’t that already happened and you need to get out now?


We are still at the beginning of Wall St freaking out


NVDA has a PE of 80 and is up 6x since October 2022. It's the third highest company by market cap in the SP500, how much higher do you think it could possibly go?


It will keep going up until a Wall St analyst puts in an insane price target like $5000. Then it will be time to sell. It should follow AMZN from the dotcom bubble.


Except that Google is buying H100s and nVIDIA is not buying TPUs. Nobody is buying TPUs, even the ancient ones Google is willing to sell.

nVIDIA's hardware offerings are products for sale. Google's TPUs are a value-add to their rental market. This distinction is not lost on people. There's a reason all Google's big-load TPU clients they list in press releases are companies Google invested in.


TPUs are mostly used outside of Google for training. There are better inference options.


Does pytorch somehow work better on CUDA? If not, who cares about CUDA?


Everyone else not using Pytorch for their GPU coding activities.


Yes. to get good perf on pytorch you need to use custom kernels that are written in CUDA, and all the dev work is done in CUDA, so if you want to use new SOTA projects as they come out, you'll probably want an NVIDIA GPU unless you want to spend time hand-optimizing.


ALL of the comments make no sense after reading this line:

> Unlike Nvidia, which offers its GPUs out for other companies to purchase,

> Google's custom-made TPUs remain in-house for use across its own products and services.

nobody can purchase one of these.

and even if someone external to google could purchase one, why would they trust google for software, documentation or (the big if with google) ongoing support. this isn't their core business.


Machine learning is not the only thing graphics cards are used for.


Google is one generation behind?

Google announces TPU v5p with specs competing against H100 when NVIDIA GH200 Grace Hopper is announced.

>The new pods also have 95GB of high-bandwidth memory (HBM)

GH200 has 282GB of HBM3e memory

>These new pods provide a throughput of 4,800Gbps.

That equals 600 GB/sec. GH200 has 10TB/sec of combined bandwidth HBM3e memory.


A few notes: Nvidia always announces their hardware in advance of availability, while Google typically announces some time after they've started using it internally. Also TPU has been more cost-focused than Nvidia, since Google uses these chips internally to serve their web traffic while Nvidia is supply constrained and can name their price right now, plus they don't pay the electricity bills.

There are also a few things wrong with the specs you quoted. 282 GB is split between two GPUs, it's 144 GB per GPU (not sure where the extra 6 GB went). The TPU pod throughput number you quoted is interconnect bandwidth which you compared to GH200's memory bandwidth. Those numbers are not comparable.

I believe GH200 does beat TPU v5p per chip, however the differences are not anywhere near as large as your comparison suggested. And it's likely that TPU v5p is dramatically more cost effective but we don't have Google's internal numbers to prove that.


That's mostly true. Nvidia also spends a lot of time internally testing their cards before shipping them out.

Nothing would be worse than a massive hardware recall.

If anything, Google would have an easier time going live sooner than Nvidia since they own all the TPUs and they're onsite.


> 282 GB is split between two GPUs

There is only one GPU in GH200 and it gets all 282 GB HBM3e memory. There is another 480 GB LPDDR5X memory for the another chip (Grace CPU, not GPU).

GH200 interconnect bandwidth trough NVLink is 900 GB/s.


I don't believe this is true. GH200's GPU has 144GB of HBM3e according to their datasheet. https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...

The 282 GB figure comes from a "dual configuration" server with two CPU chips and two GPU chips:

> the new platform will be available in a wide range of configurations. The dual configuration [...] comprises a single server with 144 Arm Neoverse cores, eight petaflops of AI performance and 282GB of the latest HBM3e memory technology.

Note that one Grace Hopper CPU has 72 cores so this configuration clearly has two CPUs and two GPUs in one server, and the 282 GB HBM3e total is split between the two GPUs with 144 GB each (again, not sure where the extra 6 GB went, maybe disabled for yield issues?).

https://nvidianews.nvidia.com/news/gh200-grace-hopper-superc...


I am more interested in TCO of TPU v5p. Right now it seems only Nvidia is making money everyone else loses.


The article mentions TPU v5p is 2.1 x perf/TCO of H100.


I don't think it does?

> Google's v5p TPUs are up to 2.8 times faster at training large language models than TPU v4, and offer 2.1-times value-for-money.

That's a comparison of TPUv5p to TPUv4, not to H100.


Actual dollars would be nice, since that is what the end user understands. Otherwise what are they talking about 1 million unit orders? But nice catch I didn't see it.


I think Broadcomm designs and builds TPUs for Google. Their stock was up 100% in 2023 and their profit margin was 39.31%.


They bought VMware for $61B in Nov of last year. Hard to think TPUs are driving much of this at Broadcom.


Google's TPUs are entirely in house designed. Not sure where you heard broadcomm is involved.



Google doesn't build their TPUs. Broadcom does.


table of contents?



Thanks.


H100, H200, and GH200 are the same generation although the next generation B100 is coming this year.


These comparisons are "Google Pod" vs. "Nvidia Supership" not single chips. Pod/Superships are collections GPU/TPU's, memory and high speed interconnect.

One Grace Hopper has: H100 chip, Grace CPU with 72 cores, 282GB of HBM3e memory and 480 GB LPDDR5X for the CPU.


If you want to compare the host too, then you should also take into account the TPU host.

Each TPUv5p host has 208 vCPU, 448 GB ram, 200 Gbps NIC (for data loading), 4 TPUv5p chips (8 cores). Each TPU chip has 95GB HBM with 2765GBps memory bandwidth, 3D interconnect with 4800 Gbps (3D interconnect means every chip has 6x 800 Gbps feeds, each going to a different chip in the torus).

One big difference between TPUs and GPUs is that the former have few beefy cores (2x chips) while the latter a lot of smaller cores. It can help or hinder depending on the workload. It makes it simpler to do deterministic training, which helps both debugging and optimization.

So, each system has 448GB host RAM and 4 TPU chips with 380GB HBM. The amount of host memory doesn't really matter that much: it's there to make sure that accelerators can be fed data at full speed.

TPU5p pods are organized in "cubes" of 16 machines (64 TPUs) that can be assembled in any shape for different kinds of data/model parallelism (see twisted tori in the TPUv4 paper). Each cube has 6080GB (6TB) of HBM ram and 29.3 petaflops (bfloat16) of theoretical peak performance. You can get multiple exaflops from a single pod and use multiple pods in the same datacenter with multislice if needed.

TPU are more power efficient than GPUs (e.g. A100 has a TDP 3x of TPUv4 + 3D torus requires fewer connections that a fattree interconnect, and optics do consume their fair bit of power), so you can scale to higher compute per datacenter before you hit that bottleneck on really really large models.

Also notice that you are using bare metal numbers for Grace hopper and I can only quote (what is publicly available) userspace/VM numbers for TPUs. For example, the actual physical RAM on the hosts is higher than 448 GB; so 448 vs 480 is apples to oranges.

Source: https://cloud.google.com/tpu/docs/v5p-training + I am oncall for these platforms.

https://arxiv.org/ftp/arxiv/papers/2304/2304.01433.pdf for TPUv4 vs. A100

---

I believe Grace Hopper is an amazing platform. It is a more general one than TPUv5p pods. Each optimized for different constraints and TCO/perf.


Your numbers are still off. A GH200 pod has 144 terabytes of RAM; it isn't even measured in gigabytes. It looks like TPUv5p may have 851 terabytes which would be a generation ahead but you didn't show those numbers.


nabla9 said "one Grace Hopper" not "one GH200 pod".

Actually, DGX GH200 seem to come in different sizes, something that I can't find clearly stated on Nvidia's website, but what I do see are entirely inconsistent specs. I'm thinking they've changed it a few times.

https://developer.nvidia.com/blog/announcing-nvidia-dgx-gh20... describes the DGX GH200 as:

-256 NVIDIA Grace Hopper Superchips (1 CPU + 1 GPU)

-each Grace CPU has 480 GB LPDDR5 CPU memory

-each H100 GPU has 96 GB of HBM3

-each GPU can access all of the CPU and GPU memory in the whole system (144TB), at 900 GBps

At https://www.nvidia.com/en-us/data-center/dgx-gh200/ (linked to from the DGX page from the website) which has a Datasheet pdf they say the DGX GH200 has:

-32 NVIDIA Grace Hopper Superchips (1 CPU + 1 GPU)

-each Grace CPU has 72 "ARM Neoverse V2 Cores with SVE2 4X 128"

-each GPU can access "19.5TB" shared memory

-that's 624GB per superchip, which is weird. I expect it's actually 96GiB HBM3 + 512GiB LPDDR, a total of 19456GiB = 19.0TiB

And other people have found completely different specs elsewhere on the website!


DGX GH200 is next gen, but afaik not available to anyone aside from internal partners. All the quoted data here is for several of them together in a chassis.

>Each NVIDIA Grace Hopper Superchip in NVIDIA DGX GH200 has 480 GB LPDDR5 CPU memory, at an eighth of the power per GB, compared with DDR5, and 96 GB of fast HBM3.

(https://developer.nvidia.com/blog/announcing-nvidia-dgx-gh20...)

Looks like a single superchip has 96GB of hbm3, similar to a single v5p tpu. The H100's in the DGX H100 were 80GB per board.


I guess performance per watt is also a major factor?


None of this matters if they can't get the hardware stack to work correctly.

The media keeps missing the real lock in Nvidia has: CUDA. It's not the hardware. It's the ability for someone to use it painlessly.


TPUs have the second best software stack after CUDA though. JAX and Tensorflow support it before CUDA in some cases and it's the only Pytorch environment that comes close to CUDA for support.


TPUs are single case use, contrary to CUDA.


Google has historically been weak at breaking into markets that someone else has already established and I think the TPUs are suffering from the same fate. There is not enough investment in making the chips compatible with anything other googles preferred stack (which happens to not be the established industry stack). Committing to getting torch to switch from device = “cuda” to device = “tpu” (or whatever) without breaking the models would go a long way imo.


I always thought Google was actually pretty good at taking over established, or rising markets, depending on the opportunity or threat they see from a competitor. Either by timely acquisition and/or ability to scale faster due to their own infrastructure capabilities.

- Google search (vs previous entrenched search engines in the early '00s)

- Adsense/doubleclick (vs early ad networks at the time)

- Gmail (vs aol, hotmail, etc)

- Android (vs iOS, palm, etc)

- Chrome (vs all other browsers)

Sure, i'm picking the obvious winners, but these are all market leaders now (Android by global share) where earlier incumbents were big, but not Google-big.

Even if Google's use of TPUs are purely self-serving, it will have a noticeable effect on their ability to scale their consumer AI usage at diminishing costs. Their ability to scale AI inference to meet "Google scale" demand, and do it cheaply (at least by industry standards), will make them formidable in the "ai race". This is why altman/microsoft and others are investing heavily in AI chips.

But I don't think their TPU will be only self-serving, rather, they'll scale it's use through GCP for enterprise customers to run AI. Microsoft is already tapping their enterprise customers for this new "product". But those kinds of customers will care more about cost than anything else.

The long-term game here is a cost game, and Google is very, very good at that and has a headstart on the chip side.


TPUs were originally intended to just be for internal use (to keep google from being dependent on Intel and nvidia). Making them an external product through cloud was a mistake (in my opinion). It was a huge drain on internal resources in many ways and few customers were truly using them in the optimal way. They also competed with google's own nvidia GPU offering in cloud.

The TPU hardware is great in a lot of ways and it allowed google to move quickly in ML research and product deployments, but I don't think it was ever a money-maker for cloud.


> The media keeps missing the real lock in Nvidia has: CUDA. It's not the hardware. It's the ability for someone to use it painlessly.

Really? What if someone writes a new back-end to PyTorch, TensorFlow and perhaps a few other popular libraries? Then will CUDA still matter that much?


if someone writes a new back-end to PyTorch

If that was easy to do surely AMD would have done it by now? After many years of trying?


I am starting to wonder if AMD were even trying all this time.


PyTorch has had an XLA backend for years. I don't know how performant it is though. https://pytorch.org/xla


It's pretty fast, just not as nice to use. You need statically defined tensors, and some functions are just not supported (last time I used it).


Can you do Unreal engine's Nanite, or Otoy Ray tracing in Pytorch?


TensorFlow and PyTorch support TPUs. It's pretty painless.


Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.

Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.

edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.


> TensorFlow and PyTorch support TPUs. It's pretty painless.

Unless you physically work next to the TPU hardware team, the torch support for TPUs is pretty brittle.


mojo language joins the chat: https://www.modular.com/max/mojo


Mojo is a closed source language that will never reach mainstream adoption among ML engineers and scientists.


> Mojo is a closed source language that will never reach mainstream adoption among ML engineers and scientists.

[Citation needed]

The creator, Chris Lattner, previously created LLVM, clang, and Swift. In each case he said these projects would be open sourced, and in each case they were. In each case they reached mainstream adoption in their respective target markets.

He's stated that Mojo will be open source.

If you're going to claim with great confidence that this language will have a different outcome to his previous ones, then you probably should have some strong evidence for that.


hmm the creator says (from his podcast with Lex Friedman when I listened to him) that they are open sourcing it, but that it is a project borne out of their private effort at their company and that it is still being used privately - so the aim is open sourcing it while taking community input and updating their private code to reflect the evolving design so that when they release it their internal lang and the open sourced lang will not diverge.

of course not ideal, but better than "open sourcing" it and refusing every request because it does not work for their codebase. worse than having it open source from the get go, of course.

assuming that day comes, does it have a competitor in the works? a python superset, compatible with python libs, but enables you to go bare metal to the point that it enables you to directly program GPUs and TPUs without CUDA or anything?

"never" means you believe it will never be open sourced, or a competitor will surpass it by the time it is open sourced. or that you believe the premise of the lang is flawed and we don't need such a thing. Which one is it?

Here is their github btw: https://github.com/modularml/mojo

From what I see, they have a pretty active community and there is demand for such a system.

The github says something similar:

>This repo is the beginning of our Mojo open source effort. We've started with Mojo code examples and documentation, and we'll add the Mojo standard library as soon as we get the necessary infrastructure in place. The challenge is that we use Mojo pervasively inside Modular and we need to make sure that community contributions can proceed smoothly with good build and testing tools that will allow this repo to become the source of truth (right now it is not). We'll progressively add the necessary components, such as continuous integration, build tools, and more source code over time.


yes there is a much more performant competitor that actually supports Nvidia GPUs [1] https://centml.ai/make-your-ml-models-run-faster-with-hidet/


...this has very little to do with mojo. mojo is not an nvidia accelerator for a couple ML frameworks.


And Nvidia does actually sell their hardware. Nobody will ever get their hands on one of these outside Google Cloud. It might as well not exist.


Well, sometimes they fall of the back of trucks I guess: https://www.ebay.com/itm/134540730431 Archive link: https://archive.ph/7dPFo


Doesn't really matter. Google's infra is all the client you need to continue pouring tens of billions into a project like this, bonus if others start using it more in the cloud, but they have so much use for accelerators across their own projects they aren't going to stop


What's painful about using TPUs?


in a sentence: google ai stuff is the vendor lockin of 2024 apple with the ecosystem value of 1994 apple.

c.f. https://news.ycombinator.com/item?id=39149854


So you trade one vendor lockin for another. Nothing lost.


But you can only buy it as a cloud service, yes? So presumably:

* Whether something 'beats' something else is actually a question of price/performance and...

* Whether or not the vendor in question is famous for dropping people in the shit.


Competitors will also never use it


This needs to be taken with a massive grain of salt, as LLM training performance hugely depends on the framework used.

And by "performance" I mean the quality of the end result, the training speed, and what features device is physically capable of handling. While kinda discrete aspects in other fields, all of these are highly correlated in ML training.

One specific example I am thinking of is support for efficient long context training. If your stack doesn't, for instance, support flash attention (or roll its own custom implementation), then good luck with that.

Another I am thinking of is quantization. Sometimes you want to train with a specific quantization scheme, and sometimes doing that is just a performance compromise. And vendors really love to fudge their numbers with the fastest, most primitive quantization on the device.


Additionally, Google has not been very honest recently with regards to their AI tech.


Two things matter at cloud-scale:

Compatability - does using a tpu require reworking significant parts of your software stack and application stack? If so, that sucks for most companies who's researchers are used to using nvidia libraries, tooling and hardware to get their models running, and reworking the entire bottom end of that to work on an esoteric platform needs to equate to a huge cost savings at scale to be worth it.

Cost per tensor flop delivered - likely very low if google has optimized the silicon, memory, voltage, power and temperature envelope, networking, boards and chassis for the server running it, as well as optimizing for optimal process node efficiency/cost. They're probably not on bleeding edge tsmc process, but instead optimizing for total deployed running cost per pflop over 2-3 years.

It's also now public (as of November) that microsoft/azure have been working for many years on their own ai chip, dubbed "Maia" [1], and appliance with the obvious goal of taking some of that nvidia margin in-house (and with openai/bing/copilot being a massive consumer of capacity). I think this will become even more commonplace with cloud vendors - even medium size ones - than it already is. The knowledge and complexity barrier to designing a tile processor unit seems pretty low and it looks like most of the hard stuff is in the drivers and software - something cloud providers designing and integrating internally can bypass and control to a great degree.

It's also very hard to benchmark side by side on these since I'm sure cuda/nvidia hardware can do compute that a TPU cannot. AMD's machine learning accelerators look good on paper too, as do the tensor processors in the apple silicon, but on real world applications and use cases they don't often measure up save for a few optimal workflows.

[1] https://www.geekwire.com/2023/microsoft-unveils-custom-ai-ch...


Training on 1k+ v5p and h100 this isn’t true. We get 400 tflops per chip on v5p int8 and 650 fp8 for h100s

H100s are less stable training though.

V5ps are about the same as intel gaudi2s and both are cheaper than h100s price/perf

Gaudi3s are 3-4x faster…


Do you know how Intel Gaudi [1] could be purchased. I see a "Contact us" in their place but any inside or public information about prices and the budget required for bulk purchases would be great.

Also, how Gaudi3 compares to Google TPUs? Thanks! I know that Gaudi3 is not readily available yet.

[1] https://habana.ai/products/gaudi/



Is there enough volume of this chip for it to mean anything to anyone? If there are only a few that no one can’t buy, and they are only better than nvidia at 1 out of 10’000 workloads, can people even use their existence to try to get lower H100 prices?


I am currently analyzing the feasibility and economics of building a cloud service for AI training. This involves "day to day" analysis changes since all the "parts" are moving.


Hasn't this always been the case? I remember they released TPUs at IO 2018, and they were nice if you build FP16/FP32 models in Google Cloud TensorFlow/CoLab and never ever port them to anything else. Meanwhile the cool new open source stuff coming out every week usually requires a GPU and is rarely compatible with TPU without major changes. If you wait a few months a TPU compatible copycat appears on the Keras demos page but by then you've lost interest.


It is pretty exciting that various companies are making compute for deep learning. I was starting to get worried that the TPUs were lagging behind Nvidia. More diversity in hardware is always nice.

Meanwhile I am still waiting for Tesla Dojo to be an actual thing...


Tesla’s dojo is deployed already I thought? I don’t recall them ever having plans to make it publicly available.


Elon's comments on the latest earnings call, if you read between the lines, strongly imply that Dojo v1, while functioning, has been a disappointment. He seems to hold out hope for future versions but he is markedly less optimistic than for other programs he talked about, and considering his typical unjustifiably optimistic attitude toward future releases of AI-related products that seems to bode poorly for the project. I would not be surprised if it was cancelled or rebooted in the next year or two.

I don't recall if they made comments about public availability for Dojo before, but given the difficulty of competing in this market and the huge benefits scale brings in hardware, it would seem foolish to limit the system to internal customers only.


> Elon's comments on the latest earnings call, if you read between the lines, strongly imply that Dojo v1, while functioning, has been a disappointment

Maybe they shouldn’t have gone with an in-house design.

> it would seem foolish to limit the system to internal customers only.

My understanding from working with similar HPC systems is that public use cases are likely niche and many tasks don’t benefit from the sort of fast interconnect their setup gives them. If anything being quite public about their design was meant to allow other companies to copy their design and compete with nvidia on a united front perhaps.

It’s not really as versatile as AWS, GCP are to profitably work as a compute cloud provider though. Perhaps it could be a provider for other companies who don’t want to own their hardware to do training runs on.


I agree that it seems a bit much to try to design their own training supercomputer. But they seemingly did OK with their inference hardware, and they do have an awful lot of money these days. Anyway, even if it is literally only useful for a single task, and that task is training large transformers, there is a use case right now. People are desperate for hardware to train large transformers.


Is there any good way to program TPUs for non-ML work? The only official way appears to involve encoding the compute operation into a TF graph and loading that. Is there a better way to directly use the hardware?


AFAIK, XLA is the only compiler that can target TPUs, so anything written for TPUs would need to lower to HLO. That said, why would you want to to run non-ML workloads on such an ML-specialized chip?


I was interested in seeing if raytracing or similar embarrassingly parallel workloads could be ported to a TPU.


Just write your code in JAX and run as usual?


Looking forward to see Nanite implementation in JAX.


There are no benchmarks in this s*t article, just SWAGs. If this chip doesn't support FP8, there's no way it's going to beat the H100 on workloads that matter.


Googlers, stop downvoting me and get back to work!


Can only use on Google Cloud, no? Nvidia is the only self-hostable product available.

But Google can build good framework integration. Will wait and see.


Gaudi, MI300, Cerebras...


Have you bought these?

Could not find Gaudi or Cerebras device available anywhere. And MI300 risky since software support is poor. But if you have managed to make it work, perhaps there's potential.

What stack are you using and what workload?


I don't work in ML but I know these are "call for pricing" products.


I don't know of anyone who uses these products. I would probably avoid.


Just don't complain about monopoly if you're not willing to try anything.


Would like to try others, but investment too big. Would have to build entire software ecosystem. I think most people like this. Even geohot suffered trying to do this.


Given the excessive run on GPUs its staggering that Google doesn't sell their TPUs (with some ludicrous exceptions).


I was seeking power consumption and price comparison with it, as otherwise only faster doesn't mean much.


Meh, the comparison is somewhat pointless when it doesn't account for the slowdown that the vast majority of pytorch codebases experience on TPU's vs using JAX and it's accelerations specific to TPUs, and vice versa.

https://arxiv.org/pdf/2309.07181.pdf

Meaning the TPU v5p is likely slower than H100 for most ML workloads that depend on pytorch.


Can we access TPUv4 and/or v5 in Google Cloud? What frameworks support them?


Maybe Google should invent a search engine that can answer such questions.


ChatGPT is replacing and better in many queries. The answer is there if you put the OP question.


I understand that OpenAI used Azure for processing and not Google Cloud right? Using Google Cloud would mean, assuming Google as a bad actor, that they can get the model if OpenAI trains their models with it.


No one would Google cloud if they were worried about IP theft. This also seems very much irrelevant to the posted article.


My intention of writing that comment was different, it was my mistake, so I correct and rephrase it. I think from a cybersecurity perspective.

Imagine a future where there are many players in the cloud training business. These players have the chance to retrieve de LLM model and since this could be a high stake business that is dangerous. I underastand that currently cloud players can observe everything that happens in their clouds including secret information and this is not happenning but it could not be "Google" by themselves but an inside job.


I'm not sure this will ever be a solved problem. If it's in your threat model to worry about such things, you necessarily need to have your own servers, in a secured physical location you control. No amount of attestation and auditing will allow you to overcome that sort of threat. For the most part, I believe audit logs and severely restricted permissions provided to employees greatly disincentive the chances of an inside job occurring successfully.


More compute can't save you from dumb methods. Memorizing the whole of human text while being fooled by a prompt crafted by a 14 year old doesn't mean its intelligent.


Plenty of intelligent and educated humans have been fooled by "prompts" aka lies/social engineering by others many years their younger.


I don't see how we're this far in and so many competitors are still trying to ignore the massive market penetration of CUDA. I don't care how fast your chip is if I can only run 20% of the already tiny amount of software capable of leveraging this acceleration.


I think this was a more accurate assessment a couple of years ago. AMD historically under invested in their software stack, as is well documented. But, they are catching up. The prevalence of pytorch makes it easier for hardware vendors to target a single set of libraries for implementation and get broad distribution. Even Apple has made major progress getting mps support in more broadly, first directly into pytorch, and they are now charting their own course with MPX with some early interesting successes. Zuck recently used the word "H100 equivalent" in compute, and my memory is he indicated roughly 35% of that compute was not from NVIDIA -- that will all be AMD.

There's still work to do -- lots of repositories still contain `if device=="cuda"` type language, but my own experience is that manging code around to use apple gpus has gotten vastly easier this year, and I see more and more AMD GPU owners floating around github issues with resolvable problems. A year ago they were barely present.

All that said - people aren't ignoring it - and entrepreneurs and cloud monopolies are putting real resources into opening things up. I think the playing field will continue to level / get back to competitive over the next two to three years.


The main issue is until you solve 99% of all the problems, you’re still introducing substantial friction to every potential user who may have had to go from zero coding knowledge needed to very specialized coding knowledge needed.

And while many problems are trivial to fix at eg the PyTorch layer, lots of stuff like flash attention, DeepSpeed, etc. are coded directly in CUDA kernels.


Is that true and is that sustainable? My understanding has been that only the relatively low level libraries/kernels are written in cuda and all the magical algorithms are in python using various ML libaries. It's like how intel BLAS isn't much of a moat -- there are several open source implementations and you can mix and match.

How is CUDA so sticky when most ML devs aren't writing CUDA but something several layers of abstraction above it? Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?


> Why can't intel, AMD, google w/e come along and write an adapter for that lowest level to TF, pytorch or whatever is the framework of the day?

A long long time ago, i.e. the last time AMD was competing with Intel (before this time, that is), we used to use Intel's icc in our lab to optimize for the Intel CPUs and squeeze as much as possible out of them. Then AMD came out with their "Athlon"(?) and it was an Intel-beater at that time. But AMD never released a compiler for it; I bet they had one internally, but we had to rely on plain old GCC.

These hardware companies don't seem to get that a kick-ass software can really add wings to their hardware sales. If I were a hardware vendor, I would, if nothing else, make my hardware's software open so the community can run with it and create better software; which will result in more hardware sales!


It's a good question. I think fundamentally its because no wants to/can compete with Nvidia when making a general purpose parallel processor. They all want to make something a bit more specialized, so they need to guess what functionality is needed and not needed.

This is a really tricky guess, case in point that AMD's latest chip cant compete on training because they could not get Flash Attention 2 working on the backward pass because of their hardware architecture. [1]

Attempts to abstract at a higher layer have failed so far because that lower layer is really valuable, again Flash Attention is a good example.

[1] https://www.semianalysis.com/p/amd-mi300-performance-faster-...


They’re not ignoring it, they’re eroding it.

The API that matters is Torch, and only the API. Letting NVIDIA charge famine prices is both a bad idea and a huge incentive to write software bridging the gap.


For those like myself Torch means nothing, I don't use GPU for machine learning rather graphics programming and general purpose compute.


I don’t know if you do work in this space but this wasn’t accurate several years ago and it certainly isn’t accurate now.

The amount of extremely CUDA specific handwritten compute architecture code, custom kernels, etc has exploded and other than a few things here and there (yes, like torch SDPA) we’re waaaay past vanilla torch for anything beyond a toy.

The pricing and theoretical hardware specs of these novelties doesn’t matter when an inferior on paper Nvidia product will wipe the floor with the other in the real world.

People have spent 15 years wringing every last penny out of CUDA on Nvidia hardware.

There is some light shining through but you’ll still see things like “Woo-hoo FlashAttention finally supports ROCm! Oh wait, what’s that, FlashAttention2 has been running on CUDA for six months?”

Don’t even get me started on the “alternative” software stacks and drivers.


Also the anti-CUDA folks keep forgetting the polyglot nature of the ecosystem, made possible since PTX was introduced on CUDA 3.0.

The graphical debugging tools that allows stepping through GPU code just like with on the CPU.


some stuff is still CUDA-dependent (a lot of scientific code). other stuff is not (major ML frameworks and many math libraries). nobody said CUDA was obsolete, just that there are great options for many use cases that no longer rely on it.


> already tiny amount of software capable of leveraging this acceleration.

This is why everyone is trying to compete with CUDA. Everyone wants a slice of the pie as we go from tiny amount to everything.


You aren't the customer they want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: