Amazon Bets $150B on Data Centers Required for AI Boom

anonylizard · 2024-03-29T02:51:47 1711680707

That's just $10b a year for 15 years, fairly conservative I'd say.

I expect a GPT-5 training cluster to cost $10b (Say 100k blackwell chips + associated infrastructure), and a GPT-6 capable cluster to cost $100b.

It sounds like a lot, but its just a typical mega-infrastructure funding. The Californian high speed rail is also around $100b for example, and Microsoft has $90b in cash reserves.

The gulf states are also dogpiling in AI, because they finally see a true viable alternative to oil: renting out GPU clusters. Just swapping to renewables is not enough, as their economies is predicated on having a huge, simple, defendable profit source that is exportable. Now they can redeploy the spare oil money to a use case, that has predictable demand, extremely capital intensive, and far more productive than building new cities in the desert.

The world in the end built $trillions in value of power stations and roads. I expect the same for data centers, having entire farms being replaced by humming GPU racks. But that will take many decades.

vineyardmike · 2024-03-29T03:22:32 1711682552

I really thought you were going into quite a tangent until the last paragraph. It really proves your point. Arguably, we’ve already seen trillions in value generated from data centers.

I still question GPT-6 having a 100b data center. I expect GPT-6 to be out before 2026, since I expect them to regularly release now models, as a matter of marketing. Neither the cash nor hardware purchasing capacity will exist by then.

I don’t give the gulf states that much credit either. I think they’re just spraying money at every opportunity. They were pretty into crypto before too. They’ve been around the valley with outsized checks forever. If I had endless money with an expiration date I’d also invest in everything. What would be interesting is to see them really commit and try to on-shore fabs and silicon engineering. That’s the ultimate move to gain geopolitical protection when oil interest stops.

hn_throwaway_99 · 2024-03-29T03:15:29 1711682129

> I expect a GPT-5 training cluster to cost $10b (Say 100k blackwell chips + associated infrastructure), and a GPT-6 capable cluster to cost $100b.

You just completely made up those numbers.

anonylizard · 2024-03-29T03:18:13 1711682293

GPT-4 cost about $100 mil to train. That's opex (Renting the GPUs). Expect the corresponding capex (buying the GPUs) on Azure's end to be 10x of that. Hence $1 bil

GPT-5 being 10x the training cost of GPT-4 is a reasonable estimate. GPT-4 was itself 10x the size of GPT-3. So $10 bil.

GPT-6 being another 10x over GPT-5 is therefore just a extrapolation, hence $100 bil.

xcv123 · 2024-03-29T03:26:16 1711682776

The cost of computing decreases over time. No one will pay $100 Billion to train GPT-6. That is absurd. The current top supercomputer in the world (Frontier) cost $600M.

It is rumoured that GPT-4 was trained on 10,000 A100 processors released in 2020. Total cost is $100M at $10k each.

Today they can buy more powerful hardware and train much larger models for the same cost.

https://nvidianews.nvidia.com/news/nvidia-blackwell-platform...

https://www.cerebras.net/product-chip

acchow · 2024-03-29T10:21:20 1711707680

It is mind boggling to me that it was so cheap to get GPT4. That is not even 0.1% of California HSR but providing dramatically more value.

skybrian · 2024-04-01T01:04:48 1711933488

Not hard to beat zero. (So far.) Maybe HSR will work out, but funding is uncertain.

anko · 2024-03-29T03:53:32 1711684412

> The cost of computing decreases over time. No one will pay $100 Billion to train GPT-6. That is absurd. The current top supercomputer in the world (Frontier) cost $600M.

Just wanted to ask the question - do you think frontier has provided more or less value to the world than gpt-4?

xcv123 · 2024-03-29T03:58:40 1711684720

GPT-4 is not a piece of hardware. The comparison makes no sense. GPT-4 was enabled by hardware.

The idea is that you use something like a Frontier to create something like a GPT-4.

My point was about cost of computing. What kind of GPT can you train with a $600M computer?

hn_throwaway_99 · 2024-03-29T03:27:03 1711682823

Besides the "GPT-4 cost about $100 mil to train" number, everything else is still just a number you pulled out of your ass.

Why would you estimate that the $100 million training bill would require a billion in GPUs? That's kind of like saying getting a $100 million of water would take a billion in plumbing.

I don't know what the number is, but I'm not going to start just making them up.

tanseydavid · 2024-03-29T15:57:17 1711727837

Have you ever heard of a "back-of-the-napkin" estimate?

Why so much reluctance to even try to make a reasoned guess?

hn_throwaway_99 · 2024-03-29T16:03:06 1711728186

Because his "back-of-the-napkin" estimate that someone is going to spend $100 billion to train a "GPT-6" (whatever that means) is laughably bad. These aren't estimates, these are just uninformed guesses pulled from nowhere.

anonylizard · 2024-03-29T18:14:21 1711736061

I provided my reasoning, and you just seem incapable of understanding it. Here, the information literally just wrote an article on $100 billion data center: https://www.theinformation.com/articles/microsoft-and-openai...

It seems the most ignorant also tend to be the most arrogant.

ksec · 2024-03-29T13:10:58 1711717858

>That's just $10b a year for 15 years, fairly conservative I'd say.

Reposting what I wrote over the years

[1] ( 11 months ago ) It was only about a year or so AWS was expanding as fast as they could. Bringing up new Datacenter per week. Getting Graviton 2 Wafer from TSMC whenever extra capacity are available on top of their orders. And they dont see any end of expansion in sight. Now it seems all the demand are suddenly gone.

[2] ( ~ 2 years ago )

>Amazon said Thursday that revenue growth in its cloud-computing unit slowed in the third quarter to 27.5%.

27.5%. It is lower that their previous 33% over the past few years, but at the current size of AWS growing 27.5% is still ridiculously good. To put this in perspective, if AWS continues to grow at 33% in 2022 and 2023. Then the whole 2023 33% growth alone, would equal to the size of the entire AWS in 2018. It is not the first time Amazon said they are limited by how fast they are building out Datacenter and getting hardware resources ready.

It will be interesting to see further details given out in AWS re:Invent 2022. Especially on Graviton roadmap.

It is interesting we have a huge increase in compute density in the past 2 years and upcoming 3-5 years. Where a single socket CPU could have 160 Core 320 Thread. and more. Retrofitting older DC with these type of density will simply increase AWS total compute by 2-3x minimum. At the scale of current AWS. Continue to spend money building DC is pretty impressive in my book.

[1] https://news.ycombinator.com/item?id=35753169

[2] https://news.ycombinator.com/item?id=33384628

singularity2001 · 2024-03-29T07:06:29 1711695989

I can't wait for the whole Nvidia and data center stack to be disrupted by a completely new kind of computing device specifically for deep learning: analog and or photonic integrators. (probably associated with an adaptation of architecture like Hintons forward forward networks). Time is right and the incentives are in the trillions. hook me up if you share the vision or know a guy.

andrewstuart · 2024-03-29T02:40:46 1711680046

I'm surprised they are not developing their own GPUs.

grogenaut · 2024-03-29T02:42:54 1711680174

They are... https://aws.amazon.com/machine-learning/inferentia/

my123 · 2024-03-29T02:46:10 1711680370

Those aren't anywhere near the flexibility of a GPU. It's a more specialized accelerator tailored to machine learning workloads.

rockwotj · 2024-03-29T03:17:21 1711682241

Google already did this with TPUs and it supposedly saved them a ton in energy costs. I read somewhere that NVIDIA doesn’t gain as much by making the chips more cost effective, just making them more powerful

wmf · 2024-03-29T02:52:32 1711680752

Which is what people want.

schneems · 2024-03-29T03:20:29 1711682429

Brawndo, it's got what plants crave.

w-ll · 2024-03-29T02:52:22 1711680742

thats the point. they dont want GPUs, they want AIPUs

mr_toad · 2024-04-01T01:45:02 1711935902

> It's a more specialized accelerator tailored to machine learning workloads.

Hard to imagine anyone using a cloud GPU for anything else.

singhrac · 2024-03-29T03:02:21 1711681341

and Trainium for training of course

wenc · 2024-03-29T02:54:29 1711680869

And compete with CUDA? It's not the hardware that's the lockin but CUDA.

xbpx · 2024-03-29T03:38:34 1711683514

The CUDA lock-in is over played. Tensorflow, Pytorch and any large framework supports multiple hardware including Google TPUs. Any company making significant investment will steer some of that towards hardware support in the software they need.

wmf · 2024-03-29T03:57:25 1711684645

Name one model (besides Gemini obviously) that was trained on non-Nvidia.

xbpx · 2024-03-29T20:36:23 1711744583

Who knows, likely not many aside from some folks training in GCP on TPUs but any large funded corporation has a path laid out by Google. And Apple with its M-series. You can build hardware and dedicated ML chips and if you can do that the software ecosystem knows how to handle it. CUDA isn't the moat, it's the chips. NVIDIAs moat is still the chips. Building huge systems and ecosystems is a game for only the most capitalized entities but all of them can do so. The software part is already a solved problem, at the cost of a new compiler.

mr_toad · 2024-04-01T01:47:56 1711936076

That probably had less to do with CUDA and more to do with the fact that Nvidia dominates the high end of the market.

AgentOrange1234 · 2024-03-29T03:05:45 1711681545

How much of the big expensive training jobs are CUDA specific? If it’s billions of dollars of compute, rewriting the software to use whatever hardware is cheapest may make sense?

vineyardmike · 2024-03-29T03:28:59 1711682939

It takes time to re-engineer an entire ecosystem of tools. The whole 9-pregnant ladies in 1 month analogy comes to mind.

If you’re trying to accomplish a goal, how long are you willing to wait for your entire dependency tree to be engineered in-house. It’s happening slowly, but teams have to ship, and can’t wait for other teams to build fresh tools.

Additionally, the compute hardware is rented, and if there’s no alternatives available for rent it doesn’t matter. Data centers aren’t full of NVidia and AMD GPUs and TPUs (because the support isn’t there). It’s a crazy chicken/egg situation where everyone benefits but no one makes the move. It’s slowly happening, but it’s not there yet to totally replace them.

delfinom · 2024-03-29T11:37:10 1711712230

That's the thing, they can be working on multiple paths in parallel.

They can be building on nvidia and have a semiconductor team in another corner experimenting for alternate in the future.

When profit margins are insane, there is always competition quietly fomenting.

We just won't know until they release it because it's also a competitive advantage to keep your own plans underwrap until it's ready. Or else it may start an arms race that'll only drive up costs to get done faster.