In the presentation they were quite open that they got to the point of running r...

formerly_proven · on Aug 21, 2021

I only work on a lowly private cluster but running standard benchmarks is utterly routine here (it is in fact automated). As others with HPC experience pointed out, running benchmarks is pretty much mandatory when bringing a new system up, not just to ensure it actually performs as promised, but also to weed out bad and marginal components. You do one or two weeks of intense benchmarking and testing and you're assured numerous nodes will fail and need parts replaced. When you apply patches, you benchmark again. Why? Because the suppliers fix for "code XYZ crashes nodes" is probably "let's uh just reduce power limits by a few %". Or because of unintentional issues limiting performance. When a node crashes, you benchmark it again. Why? Because linpack and friends are good at making marginal hardware fail.

So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.

cogman10 · on Aug 23, 2021

> So it is literally unbelievable that Tesla not just stands up a cluster, but created their own hardware to do so, and didn't run any quotable benchmark and only has the theoretic FLOPS numbers for marketing.

They haven't gotten to this point yet.

They have a single tile (A node). It wouldn't surprise me if the tile they have is just a prototype as well.

You have to have a cluster before you can start running cluster benchmarks.

formerly_proven · on Aug 24, 2021

Thanks for the correction, I mixed up the article's description of their existing clusters and their upcoming clusters.

volta83 · on Aug 21, 2021

> Not clear what you're disputing here?

The article, that assumes "FLOPS on paper == FLOPS in practice".