That's just $10b a year for 15 years, fairly conservative I'd say.
I expect a GPT-5 training cluster to cost $10b (Say 100k blackwell chips + associated infrastructure), and a GPT-6 capable cluster to cost $100b.
It sounds like a lot, but its just a typical mega-infrastructure funding. The Californian high speed rail is also around $100b for example, and Microsoft has $90b in cash reserves.
The gulf states are also dogpiling in AI, because they finally see a true viable alternative to oil: renting out GPU clusters. Just swapping to renewables is not enough, as their economies is predicated on having a huge, simple, defendable profit source that is exportable. Now they can redeploy the spare oil money to a use case, that has predictable demand, extremely capital intensive, and far more productive than building new cities in the desert.
The world in the end built $trillions in value of power stations and roads. I expect the same for data centers, having entire farms being replaced by humming GPU racks. But that will take many decades.
I really thought you were going into quite a tangent until the last paragraph. It really proves your point. Arguably, we’ve already seen trillions in value generated from data centers.
I still question GPT-6 having a 100b data center. I expect GPT-6 to be out before 2026, since I expect them to regularly release now models, as a matter of marketing. Neither the cash nor hardware purchasing capacity will exist by then.
I don’t give the gulf states that much credit either. I think they’re just spraying money at every opportunity. They were pretty into crypto before too. They’ve been around the valley with outsized checks forever. If I had endless money with an expiration date I’d also invest in everything. What would be interesting is to see them really commit and try to on-shore fabs and silicon engineering. That’s the ultimate move to gain geopolitical protection when oil interest stops.
GPT-4 cost about $100 mil to train. That's opex (Renting the GPUs). Expect the corresponding capex (buying the GPUs) on Azure's end to be 10x of that. Hence $1 bil
GPT-5 being 10x the training cost of GPT-4 is a reasonable estimate. GPT-4 was itself 10x the size of GPT-3. So $10 bil.
GPT-6 being another 10x over GPT-5 is therefore just a extrapolation, hence $100 bil.
The cost of computing decreases over time. No one will pay $100 Billion to train GPT-6. That is absurd. The current top supercomputer in the world (Frontier) cost $600M.
It is rumoured that GPT-4 was trained on 10,000 A100 processors released in 2020. Total cost is $100M at $10k each.
Today they can buy more powerful hardware and train much larger models for the same cost.
> The cost of computing decreases over time. No one will pay $100 Billion to train GPT-6. That is absurd. The current top supercomputer in the world (Frontier) cost $600M.
Just wanted to ask the question - do you think frontier has provided more or less value to the world than gpt-4?
Besides the "GPT-4 cost about $100 mil to train" number, everything else is still just a number you pulled out of your ass.
Why would you estimate that the $100 million training bill would require a billion in GPUs? That's kind of like saying getting a $100 million of water would take a billion in plumbing.
I don't know what the number is, but I'm not going to start just making them up.
Because his "back-of-the-napkin" estimate that someone is going to spend $100 billion to train a "GPT-6" (whatever that means) is laughably bad. These aren't estimates, these are just uninformed guesses pulled from nowhere.
>That's just $10b a year for 15 years, fairly conservative I'd say.
Reposting what I wrote over the years
[1] ( 11 months ago ) It was only about a year or so AWS was expanding as fast as they could. Bringing up new Datacenter per week. Getting Graviton 2 Wafer from TSMC whenever extra capacity are available on top of their orders. And they dont see any end of expansion in sight. Now it seems all the demand are suddenly gone.
[2] ( ~ 2 years ago )
>Amazon said Thursday that revenue growth in its cloud-computing unit slowed in the third quarter to 27.5%.
27.5%. It is lower that their previous 33% over the past few years, but at the current size of AWS growing 27.5% is still ridiculously good. To put this in perspective, if AWS continues to grow at 33% in 2022 and 2023. Then the whole 2023 33% growth alone, would equal to the size of the entire AWS in 2018. It is not the first time Amazon said they are limited by how fast they are building out Datacenter and getting hardware resources ready.
It will be interesting to see further details given out in AWS re:Invent 2022. Especially on Graviton roadmap.
It is interesting we have a huge increase in compute density in the past 2 years and upcoming 3-5 years. Where a single socket CPU could have 160 Core 320 Thread. and more. Retrofitting older DC with these type of density will simply increase AWS total compute by 2-3x minimum. At the scale of current AWS. Continue to spend money building DC is pretty impressive in my book.
I can't wait for the whole Nvidia and data center stack to be disrupted by a completely new kind of computing device specifically for deep learning: analog and or photonic integrators. (probably associated with an adaptation of architecture like Hintons forward forward networks). Time is right and the incentives are in the trillions. hook me up if you share the vision or know a guy.
Google already did this with TPUs and it supposedly saved them a ton in energy costs. I read somewhere that NVIDIA doesn’t gain as much by making the chips more cost effective, just making them more powerful
The CUDA lock-in is over played. Tensorflow, Pytorch and any large framework supports multiple hardware including Google TPUs. Any company making significant investment will steer some of that towards hardware support in the software they need.
Who knows, likely not many aside from some folks training in GCP on TPUs but any large funded corporation has a path laid out by Google. And Apple with its M-series. You can build hardware and dedicated ML chips and if you can do that the software ecosystem knows how to handle it.
CUDA isn't the moat, it's the chips.
NVIDIAs moat is still the chips. Building huge systems and ecosystems is a game for only the most capitalized entities but all of them can do so.
The software part is already a solved problem, at the cost of a new compiler.
How much of the big expensive training jobs are CUDA specific? If it’s billions of dollars of compute, rewriting the software to use whatever hardware is cheapest may make sense?
It takes time to re-engineer an entire ecosystem of tools. The whole 9-pregnant ladies in 1 month analogy comes to mind.
If you’re trying to accomplish a goal, how long are you willing to wait for your entire dependency tree to be engineered in-house. It’s happening slowly, but teams have to ship, and can’t wait for other teams to build fresh tools.
Additionally, the compute hardware is rented, and if there’s no alternatives available for rent it doesn’t matter. Data centers aren’t full of NVidia and AMD GPUs and TPUs (because the support isn’t there). It’s a crazy chicken/egg situation where everyone benefits but no one makes the move. It’s slowly happening, but it’s not there yet to totally replace them.
That's the thing, they can be working on multiple paths in parallel.
They can be building on nvidia and have a semiconductor team in another corner experimenting for alternate in the future.
When profit margins are insane, there is always competition quietly fomenting.
We just won't know until they release it because it's also a competitive advantage to keep your own plans underwrap until it's ready. Or else it may start an arms race that'll only drive up costs to get done faster.
I expect a GPT-5 training cluster to cost $10b (Say 100k blackwell chips + associated infrastructure), and a GPT-6 capable cluster to cost $100b.
It sounds like a lot, but its just a typical mega-infrastructure funding. The Californian high speed rail is also around $100b for example, and Microsoft has $90b in cash reserves.
The gulf states are also dogpiling in AI, because they finally see a true viable alternative to oil: renting out GPU clusters. Just swapping to renewables is not enough, as their economies is predicated on having a huge, simple, defendable profit source that is exportable. Now they can redeploy the spare oil money to a use case, that has predictable demand, extremely capital intensive, and far more productive than building new cities in the desert.
The world in the end built $trillions in value of power stations and roads. I expect the same for data centers, having entire farms being replaced by humming GPU racks. But that will take many decades.