> Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.
which is trivial to do, and pretty much a must when bringing up the cluster to make sure its working properly, so much that most clusters do this on every maintainance, along with another bunch of benchmarks;
and that they say this:
> cost equivalent versus Nvidia GPU, Tesla claims they can achieve 4x the performance, 1.3x higher performance per watt, and 5x smaller footprint.
but have no MLPerf results, tells you everything you need to know about it.
The list of long-term hype-only AI-hardware companies with billions of dollars of VC investment and literally nothing to show is incredible and keeps growing.
Every MLPerf round, the list of companies that want to submit is "huge", and 1 week before the deadline, 99.999% of them have been saying "we'll submit next round" for years.
It's as-if people would spend billions on creating an F1 team, and then notice during pre-season training that the car can't even finish a lap. And then fail to even start a lap on every race of the season. And then do this again, year after year, for a decade. Burning billions and billions...
What's often overlooked is just because you have a shit-ton of compute nodes doesn't mean you could make it to the TOP500. You might have the compute power, but the system most likely doesn't have the connectivity. E.g. on the first slide it says this is distributed over more than three locations, which essentially guarantees that the system doesn't have supercomputer-like connectivity as a whole. Worth pointing out that the networking in a supercomputer is a very significant chunk of cost and power.
And if it is tailored for AI, it might not even do 32-bit float, or only at a fraction of the "AI FLOPS".
The more 3D renders in a presentation the more skepticism I develop. I noticed there is a direct leading indicator of a stock's price in relation to the quality of the 3D renders in the company's presentations/PR events. This is of course only anecdotal evidence but I'm pretty sure the hypothesis can hold it's ground against 50% of "ML Papers" today.
> Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.
That line is referring to their current Nvidia A100 powered supercomputer which they set up with 5,760 A100 GPUs recently this year [0].
Read the previous line of the line which you’ve shared from the post:
> Tesla has been expanding the size of their GPU clusters for years. Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.
Had to look into MLPerf as I don’t follow super compute.
But I don’t see them not doing that test as an issue as they made it quite clear that their entire system is tailor made, like an ASIC, to focus on neural nets and specific, relevant compute pipelines to what they care about. I would imagine dropping a generic, broad based ML benchmarking tool will not only perform suboptimally but also not be representative of what they’re trying to do. It’s not meant to be a general purpose ML super computer, it’s supposed to be a super computer to solve a narrowish niche of problems.
I only work on a lowly private cluster but running standard benchmarks is utterly routine here (it is in fact automated). As others with HPC experience pointed out, running benchmarks is pretty much mandatory when bringing a new system up, not just to ensure it actually performs as promised, but also to weed out bad and marginal components. You do one or two weeks of intense benchmarking and testing and you're assured numerous nodes will fail and need parts replaced. When you apply patches, you benchmark again. Why? Because the suppliers fix for "code XYZ crashes nodes" is probably "let's uh just reduce power limits by a few %". Or because of unintentional issues limiting performance. When a node crashes, you benchmark it again. Why? Because linpack and friends are good at making marginal hardware fail.
So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.
> So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.
They haven't gotten to this point yet.
They have a single tile (A node). It wouldn't surprise me if the tile they have is just a prototype as well.
You have to have a cluster before you can start running cluster benchmarks.
I don't get it - they're using it for actual work, rather than burning power to run useless benchmarks for bragging rights - and you think that makes it hype?! Surely it's the opposite - running benchmarks rather than doing something useful is hype.
You have to connect thousands of cables, have hundreds of nodes, with thousands of components, everything interconnected, and if you connect one wrong, the computer outputs incorrect results. You have to routinely update the software, and if a software upgrade introduces a 20% perf regression (which happens), then your 10 MWh cluster starts burning 2MWh for nothing. Or maybe your cooling system sucks, and after a minute of running at full capacity, you need to throttle your cluster to 0.1% of the peak to keep it cool enough that it runs "something".
That's why all systems in the top500 i've been involved with (15 or so) run these benchmarks as an integration tests on every single cluster maintenance (node updates, servicing, OS updates, etc.).
Submitting these results to the Top500 costs you nothing... if your cluster actually works. When you submit to the Top500, they ask for access so that they can re-run them themselves, which happens typically during / after the next maintenance to avoid impacting any users.
If they haven't submitted, 100% sure their cluster does not deliver what they say it should deliver on paper. Maybe it delivers 1% of it, or 0.01% of it (seen both cases in real life). If they haven't fixed it, then maybe it can't be fixed.
HPL, MLPerf, Spec, Stream, OSU.... these are not "benchmarks for bragging rights", these are tests that show that your system works.
> run these benchmarks as an integration tests on every single cluster maintenance
Why run someone else's benchmark and not your own application to test performance? And what's the point of submitting to Top500? Why do you care how your system ranks? What's the business or technical purpose in that?
For the same reason that we don't test engines during a race. Only what's under test should change, with the rest being fixed.
It would be much more difficult to adapt some of your own applications as a test. If the result are bad, how would you even know if it's the application's fault or a problem with the cluster? LINPACK on the other hand is well understood and there has been tons of work to make sure that it uses all the power your cluster can deliver.
When a company claims that their new vaccine cures cancer, but is against applying for FDA approval because "no reasons given", you seem to be the kind of person that camps in front of their offices overnight to get the first 10 shots, all in the neck.
Tesla is saying their hardware is 10x better than the competition.
I don't believe it. They don't publish any numbers, and everybody else does.
You believe them. Good for you.
I don't. I think not doing that is extremely suspicious, because if their cluster can be turned on, they have the results. So the reality of this is 100% certain that their results just suck, and they don't publish them to save face.
You believe their shit is so good, they didn't even test it while installing it. Good for you.
You think there is a masterplan to keep the results secret. Good for you.
I think their results are just horrible, and that's why they don't show them.
With any serious company, this wouldn't even be a discussion, because they would just show the verified facts about what their cluster can do or shut up about it if its really so secret.
You don't see the NSA or the DOD bragging around about being 100x better than "the competition".
To me they look forced to sell that they are doing something about these clusters to... drive hype up and get the stocks back up, and since their numbers are actually horrible, this is what we get.
But they've got no obligation to prove anything to random people on the internet. They aren't asking for peer review. They probably don't want to waste time and energy running a benchmark to compete in someone else's top N list. What's in it for them?
> But they've got no obligation to prove anything to random people on the internet.
They are a publicly traded company.
I'm a tesla investor and deserve to know.
If they are right and their hardware is 10x better than the competitor, then I'm happy. If they are wrong and they could have get 10x better outcomes by spending 100x less money, I'll be very pissed off.
> They probably don't want to waste time and energy running a benchmark to compete in someone else's top N list.
It costs no time. It's the first thing anybody does when building any cluster. It's part of accepting the materials from your suppliers to check that they didn't sell you shit.
I'm not an expert in securities law, but I don't believe being an investor entitles you to demand arbitrary proprietary information you want from a company. Otherwise people would buy one Apple share and find out what specs the new iPhone will have.
HPL is a stress test with a notionally useful output measurement - just about the most effective way of pushing CPU load to the limit, and it tells you what fraction of the theoretical maximum FLOPS you can actually achieve given the other system constraints like memory and network throughput.
Running benchmarks is not only useful but necessary. It's a point where planning meets the real world. Private clusters are also routinely benchmarked, often as part of validation.
In the end if someone just tells you their supercomputer "would" be in the TOP500, they just tell you that it is definitely not in the TOP500. There might be good reasons for that, but still, it's like claiming that you would win Olympic gold if only you'd bother to participate.
There might be some level of showmanship involved, but Tesla isn't selling those things, it is using them. Quite a different situation in comparison to a random startup which tries to convince investors to finance their products.
> Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.
which is trivial to do, and pretty much a must when bringing up the cluster to make sure its working properly, so much that most clusters do this on every maintainance, along with another bunch of benchmarks;
and that they say this:
> cost equivalent versus Nvidia GPU, Tesla claims they can achieve 4x the performance, 1.3x higher performance per watt, and 5x smaller footprint.
but have no MLPerf results, tells you everything you need to know about it.
The list of long-term hype-only AI-hardware companies with billions of dollars of VC investment and literally nothing to show is incredible and keeps growing.
Every MLPerf round, the list of companies that want to submit is "huge", and 1 week before the deadline, 99.999% of them have been saying "we'll submit next round" for years.
It's as-if people would spend billions on creating an F1 team, and then notice during pre-season training that the car can't even finish a lap. And then fail to even start a lap on every race of the season. And then do this again, year after year, for a decade. Burning billions and billions...