TPU transformation: A look back at 10 years of our AI-specialized chips

nl · 2024-08-04T10:05:17 1722765917

It's crazy that Google doesn't spin-out their TPU work as a separate company.

TPUs are the second most widely used environment for training after Nvidia. It's the only environment that people build optimized kernels for outside CUDA.

If it was separate to Google then there a bunch of companies who would happily spend some money on a real, working NVidia alternative.

It might be profitable from day one, and it surely would gain substantial market capitalization - Alphabet shareholders should be agitating for this!

aseipp · 2024-08-04T18:46:27 1722797187

People constantly bring this point up every 2 weeks here, the cost competitiveness of TPUs for Google comes exactly from the fact they make them in house and don't sell them. They don't need sales channels, support, leads, any of that stuff. They can design for exactly one software stack, one hardware stack, and one set of staff. You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

Nvidia has spent 20 years on this which is why they're good at it.

> If it was separate to Google then there a bunch of companies who would happily spend some money on a real, working NVidia alternative.

Unfortunately, most people really don't care about Nvidia alternatives, actually -- they care about price, above all else. People will say they want Nvidia alternatives and support them, then go back to buying Nvidia the moment the price goes down. Which is fine, to be clear, but this is not the outcome people often allude to.

authorfly · 2024-08-04T18:52:26 1722797546

You can or at least historically could buy access to TPUs and request it for non-profit projects too through the TPU research programme. Certainly you have been able to pay for pro membership on Notebook to get TPU access, which is how many of the AI generation before ChatGPT learned to run AI. TPUs however were kind of always for training, never geared for inference.

aseipp · 2024-08-05T09:23:30 1722849810

That is correct, and I should have been more, clear: when I say "Buy them" I mean direct sales of the hardware from seller to buyer. I am not referring to cloud compute style sales. Yes, they have been offering TPUs through Google Cloud for a long while now, but this still falls under all the stuff I said earlier: they don't need to have sales pipelines or channels (outside GCloud's existing ones), they don't need to design the hardware/software for arbitrary environments, they have one set of staff and machines to train and support, etc. All of that stuff costs money and ultimately it results in an entirely different sales and financial model.

Google could spin the TPU division out of Google, but 99% of the time people refer to moves like that they omit the implied follow up sentence which is "I can then buy a TPU with my credit card off the shelf from a website that uses Stripe." It is just not that simple or easy.

authorfly · 2024-08-08T07:51:19 1723103479

Good point

nl · 2024-08-05T02:59:10 1722826750

> You cannot just magically spin up a billion-dollar hardware company overnight with software, customers, sales channels and support, etc.

Not saying it is easy or to do it magically.

Just noting that Groq (founded by the TPU creator) did exactly this.

aseipp · 2024-08-05T06:30:26 1722839426

Yes, and now after years of doing that Groq is pivoting to being a cloud compute company, renting their hardware through an API exactly the same way Google does.

Building out your own vertically integrated offering with APIs is comparatively a lot simpler and significantly less risky in the grand scheme. For one thing, cloud APIs naturally benefit from the opex vs capex distinction that is often brought up here -- this is a big sales barrier, and thus a big risk. This is important because you can flush mid-8-figures down the toilet overnight for a single set of photomasks, so you are burning significant capital way before your foot is ever close to the proverbial door, much less inside it. You aren't going to make that money back selling single PCIe cards to enthusiastic nerds on Hacker News; you need big fish. Despite allusions to the contrary (people beating down your door to throw you bathtubs of money with no question), this isn't easy.

Another good example of verticality is the software. The difference in scope and scale between "Tools that we run" and "Tools you can run" is actually huge. Think about things like model choice -- it can be much easier to support things like new models when you are taking care of the whole pipeline and a complete offering, versus needing to support compiler and runtime tools that can compile arbitrary models for arbitrary setups. You can call it cutting corners, but there's a huge amount of tricky problems in this space and the time spent on procedural stuff ("I need to run your SDK on a 15 year old CentOS install!") is time not spent on the core product.

There are other architectural reasons for them to go this route that make sense. But I really need to stress here that a big and important one is that hardware is, in fact, a very difficult business even with a great product.

(Disclosure: I used to work at Groq back in 2022 before the Cloud Compute offering was available and LLMs were all the rage.)

nl · 2024-08-06T01:54:37 1722909277

I don't think renting out hardware is a bad model at all. Google spinning out their TPU work in this manner could be fine.

I think some (large) buyers will want on-prem and they have large enough budgets to make that worthwhile.

I don't think "sell individual TPUs to random people" is a great model. Most are better served by the cloud rental approach (although they might not think so themselves).

seattleeng · 2024-08-05T04:36:58 1722832618

Isnt Groq pivoting to the IaaS/SaaS model because hardware channel sales is hard and its easier for everyone to just use an API?

daghamm · 2024-08-04T19:41:23 1722800483

Actually, the do sell them. Only the low power edge versions, but still.

jankeymeulen · 2024-08-04T13:12:10 1722777130

The TPUs are highly integrated with the rest of the internal Google ecosystem, both hardware and software. Untangling that would be ... interesting.

michaelt · 2024-08-04T18:26:31 1722795991

We have a perfectly reasonable blueprint for an ML accelerator that isn't tied into the google ecosystem: nvidia's entire product line.

Between that and the fact Google already sells "Coral Edge TPUs" [1] I'd think they could manage to untangle things.

Whether the employees would want to be spun off or not is a different matter, of course...

[1] https://coral.ai/products/

nine_k · 2024-08-04T18:55:34 1722797734

Do you think that NVidia is happy to not have an online ecosystem to tie to its GPUs, for added (sales) value? They are more than happy to entangle the GPUs with their proprietary CUDA language.

For a large, established, quasi-monopoly company it's always more attractive to keep things inside their walled gardens. Suggesting that Google should start supporting TPUs outside Google Cloud is like suggesting that Apple should start supporting iOS on non-Apple hardware.

michaelt · 2024-08-04T20:21:33 1722802893

> Do you think that NVidia is happy to not have an online ecosystem to tie to its GPUs, for added (sales) value?

I think nvidia is ecstatic about having commoditised their complement, and having the only ML acceleration option that's available from every cloud provider and on-prem.

Why have Amazon, Google and Microsoft as competitors when you can have them as customers instead?

nine_k · 2024-08-04T20:29:01 1722803341

This is indeed so. But if NVidia could have some recurring revenue from the GPUs, maybe in the form of leasing GPU farms it runs in a proprietary way, it would also be nice. In that alternative universe, it would still have Google and MS as customers, the way AWS has many large-scale companies as customers.

hengheng · 2024-08-04T14:13:43 1722780823

Knowing what I know about big corporations, the biggest entanglement is going to be IP ownership, political constraints and promises to shareholders.

qwertox · 2024-08-04T10:19:05 1722766745

There would probably a huge demand, but would Google be able to satisfy it? Is it currently able to satisfy its own demand?

credit_guy · 2024-08-04T12:04:59 1722773099

That would be the point of spinning it out. They could have an IPO, raise as much capital as there is in the observable Universe, and build enough fabs to satisfy all the demand.

mike_hearn · 2024-08-04T12:25:04 1722774304

That wouldn't work. Even TPUv4 was on a 7nm node and you don't just build a 7nm fab just like that. If it were that easy NVIDIA would already be building their own fabs, as they have basically raised as much capital as there is in the known universe (bigger market cap than the entire London stock exchange), but they seem to prefer to let the fab experts get on with it rather than compete with them.

LLM AI is largely HBM bottlenecked anyway i.e. Samsung, SK Hynix and Micron are where the supply chain limits enter the picture.

throwaway48476 · 2024-08-04T12:56:03 1722776163

Fabless companies that are large enough such as apple front the capital for fab companies like TSMC to build fabs dedicated to their use. They do, in effect, build their own fabs. If the Google TPU group had the inclination they could have done the same.

The memory industry just got busted from the covid bubble and are not too keen to jump into the AI bubble.

ragebol · 2024-08-04T13:27:47 1722778067

They might front the money, but don't own them. Apple gladly lets someone else own and operate the fabs and take the risk (which is smaller with Apple as a client)

nsteel · 2024-08-04T13:53:31 1722779611

Let's not forget that a 7nm fab has a very limited period of usefulness for the likes of Apple etc. That leading edge is always moving forward and while it might be financially viable for some aspects of the process to be upgraded to the next node, that's not always the case and that's where TSMC's hundreds of other customers join in and the (now old) equipment can be still used for many more years.

Edit: But perhaps with the exclusivity deals, the likes of TSMC are less reliant on spreading the cost over 15+ years than they used to be. To be clear, I was talking about long-term use.

throwaway48476 · 2024-08-04T13:58:20 1722779900

The leading edge has slowed down a lot. Apple is still selling M1 chips and AMD is just now releasing new models of zen3 AM4 chips.

throwaway48476 · 2024-08-04T13:47:06 1722779226

There is more co development and risk sharing than you think. TSMC has nodes only apple uses.

nsteel · 2024-08-04T14:03:51 1722780231

They do, but as far as I know those nodes are just early/late verisons and tweaks to the main, popular process.

michaelt · 2024-08-04T18:43:25 1722797005

But if Apple pays TSMC $$$$$$$$ in advance to build a 2nm node production line especially for Apple, and it turns out the 2nm node doesn't deliver the hoped-for improvements in power efficiency? The money's already spent.

rpeden · 2024-08-04T18:14:30 1722795270

Unless they've been issuing a ton of new shares recently and then selling them into the market at something resembling the current share price, the amount of capital they've raised is nowhere near their current market cap.

But it looks like they've actually been buying back some shares - they've got fewer shares outstanding than they did a year or two ago.

Not that it matters much - they've still got plenty of cash and other capital available.

bearjaws · 2024-08-04T15:11:24 1722784284

There seems to be this idea that the people who design and operate fabs are infinite, when it's actually a technically demanding job.

We don't even have enough McDonald's employees, how the hell are we going to just suddenly have multiple companies creating fabs left and right? TSMC cannot even build their Arizona plant without a shortage of workers.

Arainach · 2024-08-04T15:45:20 1722786320

Every time someone says "we don't have enough employees", remember to add "....at the (almost certainly too low) wage being offered".

bluecalm · 2024-08-04T17:00:29 1722790829

Maybe but then where are those CPU fab experts working right now that offer them higher wage?

michaelt · 2024-08-04T18:52:45 1722797565

Writing computer software, mostly.

ahefner · 2024-08-05T00:18:07 1722817087

China, supposedly.

Workaccount2 · 2024-08-04T18:16:58 1722795418

Intel has been trying to make cutting edge fabs...and we all know how that is going.

There is good reason nobody wants to be in the fab business.

bushbaba · 2024-08-04T16:22:15 1722788535

> It's crazy that Google doesn't spin-out their TPU work as a separate company.

Not really. Google TPUs require google's specific infrastructure, and cannot be deployed out side the Google Datacenter. The software is google specific, the monetization model is google specific.

We also have no idea how profitable TPUs would actually be if a separate company. The only customer of TPUs is Google and Google Cloud.

theptip · 2024-08-04T16:15:53 1722788153

Why would you spin out a competitive moat?

monkeydust · 2024-08-04T13:48:18 1722779298

Any activist investors lurking in here?

ec109685 · 2024-08-04T04:48:38 1722746918

Impressive: “Overall, more than 60% of funded generative AI startups and nearly 90% of gen AI unicorns use Google Cloud’s AI infrastructure, including Cloud TPUs.”

lsb · 2024-08-04T06:34:22 1722753262

Doesn’t Google Cloud’s AI infrastructure include Colab? That’s useful for so many things

htrp · 2024-08-04T12:51:58 1722775918

Google will also offer GCP credits for Free Nvidia GPUs with almost no questions asked.

AWS and Azure (to a lesser extent) can also make this argument.

zackangelo · 2024-08-04T17:32:54 1722792774

Any strings attached? Do you know if they’ll do it pre-funding?

bushbaba · 2024-08-04T16:24:10 1722788650

Use does not mean heavily rely on. If an AI Startup uses google colab or runs 1 POC with TPUs, then they would fall under this stat.

walterbell · 2024-08-04T01:46:43 1722736003

Apple Intelligence uses Google TPUs instead of GPUs.

bigcat12345678 · 2024-08-04T02:23:07 1722738187

That's something not surprising, given JG and Ruoming's Google stint.

Google is going to dominate LLM ushered AI era. Google has been AI first since 2016, they just don't have the opening. Sam, as inapt at engineering, just has no idea how to navigate the delicate biz & eng competitions.

j16sdiz · 2024-08-04T02:21:40 1722738100

If you read all their paper, they use a mix of them

dlewis1788 · 2024-08-04T14:09:18 1722780558

For training, yes, but no indications on inference workloads. Apple has said they would use their own silicon for inference in the cloud.

walterbell · 2024-08-04T23:34:56 1722814496

Plus the Apple "Neural Engine" which has shipped on millions of iPhones for local inference.

alecco · 2024-08-04T09:01:50 1722762110

How are they connected? PCIe? Something like NVLink?

nl · 2024-08-04T10:11:06 1722766266

They use custom optical "Interchip Interconnect" within each 256-chip "pod" and their custom "Jupiter" networking between pods.

See https://cloud.google.com/blog/products/compute/introducing-t... and https://cloud.google.com/blog/topics/systems/the-evolution-o...

tucnak · 2024-08-04T10:09:28 1722766168

Optical circuit switches https://arxiv.org/abs/2304.01433

PedroBatista · 2024-08-04T10:57:02 1722769022

The real winner here is the marketing department who manage to make this article a "celebration of successes" when in fact we know the TPU is yet one more of those biggest failures of Google to have the lead by a mile and then.. squander it. And no, "it's on our cloud and Pixel phones" doesn't cut it at this level.

visarga · 2024-08-04T12:38:46 1722775126

I have a strong suspicion that previous generations of TPU were not cost effective for decent AI, explaining Google's reluctance to release complex models. They have had superior translation for years, for example. But scaling it up to the world population? Not possible with TPUs.

It was OpenAI that showed you can actually deploy a large model, like GPT-4, to a large audience. Maybe Google didn't reach the cost efficiency with just internal use that NVIDIA does.

orbat · 2024-08-04T17:28:47 1722792527

Google used to have superior translation but that hasn't been the case for years now. Based on my experience DeepL (https://www.deepl.com/) is vastly superior, especially for even slightly more niche languages. I'm a native Finnish speaker and I regularly use DeepL to translate Finnish into English in cases where I don't want to do it by hand, and the quality is just way beyond anything Google can do. I've had similar experiences with languages I'm less proficient with but still do understand to an extent, such as French or German

throwawaymaths · 2024-08-04T15:44:46 1722786286

there are several talks out there where Google soft-admits that at least the early gens of TPUs really sucked, e.g.:

https://www.youtube.com/watch?v=nR74lBO5M3s

(note the lede on the TPU is buried pretty deep here)

throwaway48476 · 2024-08-04T13:03:33 1722776613

I suspect it had much more to do with lacking product market fit. They spent 10 years faking demos and dreaming about what they thought AI could do eventually but since it never worked the products never released and so they never expanded. A well optimized TPU will always beat a well optimized GPU on efficiency.

cavisne · 2024-08-04T17:10:33 1722791433

Only because of Nvidia's margins. "Worse but cheaper" is actually great for a company of Googles scale, but it doesn't make for a particularly compelling press release or paper.

amelius · 2024-08-04T09:40:18 1722764418

[flagged]

walterbell · 2024-08-04T10:52:19 1722768739

Google Pixel phones.

8X Coral Edge TPU M.2 on PCIe card, https://www.digikey.com/en/products/detail/asus/CRL-G18U-P3D... & https://www.asus.com/networking-iot-servers/aiot-industrial-...

hansihe · 2024-08-04T12:21:46 1722774106

Edge TPUs are definitely not comparable to the datacenter TPUs. They only support TFLite for one.

walterbell · 2024-08-04T14:20:42 1722781242

Google Coral Edge TPUs have found a practical niche in low-power OSS Fargate NPU appliances, e.g. object recognition for security camera feeds.

throwaway48476 · 2024-08-04T13:11:07 1722777067

Didn't they abandon edge TPUs?

walterbell · 2024-08-04T14:41:55 1722782515

Any references on that? For a couple of years, they were fetching 100% price premiums on eBay, due to high demand and low supply.

throwaway48476 · 2024-08-04T21:22:31 1722806551

https://github.com/blakeblackshear/frigate/issues/10056

walterbell · 2024-08-04T23:28:14 1722814094

Helpful thread, thanks: Google support team churn after the distribution transition to Asus IoT, Frigate devs were preparing to fork Google repos, then new Google devs appeared.

> Google is getting back on top of things aka coral support which is nice.. it seems that the original devs weren't on the project and new devs needed to be given notice. Hopefully this continues and things are kept up to date.. updated libcoral and pycoral libraries are coming as well.

It's good that Frigate brought attention to languishing Linux maintenance for Coral. Rockchip 3588 and other Arm SoCs have NPUs, which will likely be supported in time, but each SoC will require validation. Coral Edge TPUs were a convenient single target that worked with any x86 and Arm board, via USB or M.2 slot.