The fact that they didn't do this: > Their current training cluster would be the...

formerly_proven · on Aug 21, 2021

What's often overlooked is just because you have a shit-ton of compute nodes doesn't mean you could make it to the TOP500. You might have the compute power, but the system most likely doesn't have the connectivity. E.g. on the first slide it says this is distributed over more than three locations, which essentially guarantees that the system doesn't have supercomputer-like connectivity as a whole. Worth pointing out that the networking in a supercomputer is a very significant chunk of cost and power.

And if it is tailored for AI, it might not even do 32-bit float, or only at a fraction of the "AI FLOPS".

ilovefood · on Aug 21, 2021

The more 3D renders in a presentation the more skepticism I develop. I noticed there is a direct leading indicator of a stock's price in relation to the quality of the 3D renders in the company's presentations/PR events. This is of course only anecdotal evidence but I'm pretty sure the hypothesis can hold it's ground against 50% of "ML Papers" today.

addicted · on Aug 21, 2021

Sure, your heuristic may be suspicious of 3D renders, but has it considered men in robot costumes?

dmitriid · on Aug 21, 2021

This reminds of Magic Leap. 3 billion dollars invested in a 3D render.

rajnathani · on Aug 22, 2021

> Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.

That line is referring to their current Nvidia A100 powered supercomputer which they set up with 5,760 A100 GPUs recently this year [0].

Read the previous line of the line which you’ve shared from the post:

> Tesla has been expanding the size of their GPU clusters for years. Their current training cluster would be the 5th largest supercomputer if Tesla stopped all real workloads, ran Linpack, and submitted it to the Top500 list.

[0] https://blogs.nvidia.com/blog/2021/06/22/tesla-av-training-s...

clomond · on Aug 23, 2021

Had to look into MLPerf as I don’t follow super compute.

But I don’t see them not doing that test as an issue as they made it quite clear that their entire system is tailor made, like an ASIC, to focus on neural nets and specific, relevant compute pipelines to what they care about. I would imagine dropping a generic, broad based ML benchmarking tool will not only perform suboptimally but also not be representative of what they’re trying to do. It’s not meant to be a general purpose ML super computer, it’s supposed to be a super computer to solve a narrowish niche of problems.

elurg · on Aug 21, 2021

In the presentation they were quite open that they got to the point of running real loads but only on a single tile on a bench.

Not clear what you're disputing here?

formerly_proven · on Aug 21, 2021

I only work on a lowly private cluster but running standard benchmarks is utterly routine here (it is in fact automated). As others with HPC experience pointed out, running benchmarks is pretty much mandatory when bringing a new system up, not just to ensure it actually performs as promised, but also to weed out bad and marginal components. You do one or two weeks of intense benchmarking and testing and you're assured numerous nodes will fail and need parts replaced. When you apply patches, you benchmark again. Why? Because the suppliers fix for "code XYZ crashes nodes" is probably "let's uh just reduce power limits by a few %". Or because of unintentional issues limiting performance. When a node crashes, you benchmark it again. Why? Because linpack and friends are good at making marginal hardware fail.

So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.

cogman10 · on Aug 23, 2021

> So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.

They haven't gotten to this point yet.

They have a single tile (A node). It wouldn't surprise me if the tile they have is just a prototype as well.

You have to have a cluster before you can start running cluster benchmarks.

formerly_proven · on Aug 24, 2021

Thanks for the correction, I mixed up the article's description of their existing clusters and their upcoming clusters.

volta83 · on Aug 21, 2021

> Not clear what you're disputing here?

The article, that assumes "FLOPS on paper == FLOPS in practice".

chrisseaton · on Aug 21, 2021

I don't get it - they're using it for actual work, rather than burning power to run useless benchmarks for bragging rights - and you think that makes it hype?! Surely it's the opposite - running benchmarks rather than doing something useful is hype.

volta83 · on Aug 21, 2021

You have never built a super computer, have you?

You have to connect thousands of cables, have hundreds of nodes, with thousands of components, everything interconnected, and if you connect one wrong, the computer outputs incorrect results. You have to routinely update the software, and if a software upgrade introduces a 20% perf regression (which happens), then your 10 MWh cluster starts burning 2MWh for nothing. Or maybe your cooling system sucks, and after a minute of running at full capacity, you need to throttle your cluster to 0.1% of the peak to keep it cool enough that it runs "something".

That's why all systems in the top500 i've been involved with (15 or so) run these benchmarks as an integration tests on every single cluster maintenance (node updates, servicing, OS updates, etc.).

Submitting these results to the Top500 costs you nothing... if your cluster actually works. When you submit to the Top500, they ask for access so that they can re-run them themselves, which happens typically during / after the next maintenance to avoid impacting any users.

If they haven't submitted, 100% sure their cluster does not deliver what they say it should deliver on paper. Maybe it delivers 1% of it, or 0.01% of it (seen both cases in real life). If they haven't fixed it, then maybe it can't be fixed.

HPL, MLPerf, Spec, Stream, OSU.... these are not "benchmarks for bragging rights", these are tests that show that your system works.

chrisseaton · on Aug 21, 2021

> run these benchmarks as an integration tests on every single cluster maintenance

Why run someone else's benchmark and not your own application to test performance? And what's the point of submitting to Top500? Why do you care how your system ranks? What's the business or technical purpose in that?

cedilla · on Aug 21, 2021

For the same reason that we don't test engines during a race. Only what's under test should change, with the rest being fixed.

It would be much more difficult to adapt some of your own applications as a test. If the result are bad, how would you even know if it's the application's fault or a problem with the cluster? LINPACK on the other hand is well understood and there has been tons of work to make sure that it uses all the power your cluster can deliver.

volta83 · on Aug 21, 2021

When a company claims that their new vaccine cures cancer, but is against applying for FDA approval because "no reasons given", you seem to be the kind of person that camps in front of their offices overnight to get the first 10 shots, all in the neck.

chrisseaton · on Aug 21, 2021

But FDA approval is mandatory, and participation in top N lists is just marketing. And they aren't even asking you to use their machine!

volta83 · on Aug 21, 2021

Dude i don't know what you don't get it.

Tesla is saying their hardware is 10x better than the competition.

I don't believe it. They don't publish any numbers, and everybody else does.

You believe them. Good for you.

I don't. I think not doing that is extremely suspicious, because if their cluster can be turned on, they have the results. So the reality of this is 100% certain that their results just suck, and they don't publish them to save face.

You believe their shit is so good, they didn't even test it while installing it. Good for you.

You think there is a masterplan to keep the results secret. Good for you.

I think their results are just horrible, and that's why they don't show them.

With any serious company, this wouldn't even be a discussion, because they would just show the verified facts about what their cluster can do or shut up about it if its really so secret.

You don't see the NSA or the DOD bragging around about being 100x better than "the competition".

To me they look forced to sell that they are doing something about these clusters to... drive hype up and get the stocks back up, and since their numbers are actually horrible, this is what we get.

N1H1L · on Aug 21, 2021

Is this some sort of a post-truth joke?

If you are claiming something is the best, then without any comparable metrics across systems we are in the Twilight zone.

chrisseaton · on Aug 21, 2021

But they've got no obligation to prove anything to random people on the internet. They aren't asking for peer review. They probably don't want to waste time and energy running a benchmark to compete in someone else's top N list. What's in it for them?

volta83 · on Aug 21, 2021

> But they've got no obligation to prove anything to random people on the internet.

They are a publicly traded company.

I'm a tesla investor and deserve to know.

If they are right and their hardware is 10x better than the competitor, then I'm happy. If they are wrong and they could have get 10x better outcomes by spending 100x less money, I'll be very pissed off.

> They probably don't want to waste time and energy running a benchmark to compete in someone else's top N list.

It costs no time. It's the first thing anybody does when building any cluster. It's part of accepting the materials from your suppliers to check that they didn't sell you shit.

chrisseaton · on Aug 21, 2021

> I'm a tesla investor and deserve to know.

I'm not an expert in securities law, but I don't believe being an investor entitles you to demand arbitrary proprietary information you want from a company. Otherwise people would buy one Apple share and find out what specs the new iPhone will have.

Sanguinaire · on Aug 21, 2021

HPL is a stress test with a notionally useful output measurement - just about the most effective way of pushing CPU load to the limit, and it tells you what fraction of the theoretical maximum FLOPS you can actually achieve given the other system constraints like memory and network throughput.

N1H1L · on Aug 21, 2021

Hi fellow supercomputer person!

My experience is on Summit & am now working with Frontier setup.

cedilla · on Aug 21, 2021

Running benchmarks is not only useful but necessary. It's a point where planning meets the real world. Private clusters are also routinely benchmarked, often as part of validation.

In the end if someone just tells you their supercomputer "would" be in the TOP500, they just tell you that it is definitely not in the TOP500. There might be good reasons for that, but still, it's like claiming that you would win Olympic gold if only you'd bother to participate.

_ph_ · on Aug 21, 2021

There might be some level of showmanship involved, but Tesla isn't selling those things, it is using them. Quite a different situation in comparison to a random startup which tries to convince investors to finance their products.

matmatmatmat · on Aug 21, 2021

I don't know, given Elon's showmanship, is it really that different?