Cool. Can anyone chime in on how this compares to other ML SoC/ASICs? I know man...

Patrick-STH · on Aug 24, 2022

This is a crazy oversimplification, but let me take a shot. The difference to others is easily a large "novel size" discussion.

Most training SoCs are focused on building something the size of an NVIDIA GPU, but designed for ML versus general purpose GPU HPC compute (FP64) plus ML. Often those accelerators today have a few types of models they are optimized for. NVIDIA is the baseline and so the competitors are looking for areas where they can get a large boost at a lower cost with something about the size of a A100/H100.

Cerebras is perhaps the biggest exception with its WSE-2, a wafer size chip. Having the wafer size chip means that Cerebras does not need to go into higher-latency and higher-power off-package interconnect as frequently because its chip is 50x larger. In turn, Cerebras drives performance and cost savings by not needing NVLink4 NVSwitches / InfiniBand.

Tesla's Dojo Tile is 25 chips roughly equivalent in size to a NVIDIA GPU in a single package with die-to-die communication facilitated by the base tile and then built for scale up units. Tesla also has focused on the interconnect and pipeline feeding the D1s and Tiles.

Ultimately, I think that it takes something beyond a "solution X saves 30% over NVIDIA in these workloads in performance/ $" to survive. NVIDIA has a massive software ecosystem and can handle more types of tasks versus some of the other AI accelerators. That goes beyond just the training and also to other parts of the data prep and movement pipeline. NVIDIA extracts high margins from this work so that is why some effectively are competing with "it costs less and on some problems can be faster" architectures but what Tesla, Cerebras, Google, and a few others have another level of differentiation.

Nothing is perfect, nor was that explanation, but just a high-level view of why the technology featured is impactful.

sinenomine · on Aug 24, 2022

It's a very focused and impressive effort. Training from SRAM can be one-two orders of magnitude faster than training from DRAM.

Modern GPUs, even with some process advantage (and no shipments her, H100) strive to be general purpose processors too much, and that caps their peak deep learning performance.

Every serious player will have to make or buy their own TPU.

dragontamer · on Aug 24, 2022

They'll be at a process disadvantage. D1 is allegedly TSMC 7nm, as per last year's information (https://www.tomshardware.com/news/tesla-d1-ai-chip).

An ASIC can strip out features they don't need and save some space. But a good chunk of modern GPUs are memory-controllers, registers, and SIMD-cores. And modern GPUs (both AMD's MI250x and NVidia's A100) have 16-bit matrix multiplication units (aka: Tensor cores). Once we factor in the process disadvantage, I'm not sure if the D1 will be as competitive as they hope.

Tesla's hope is that their D1 chip has more 16-bit matrix multiplication cores than the NVidia / AMD designs. But A100 is quite solid, and NVidia Hopper has been announced at HC34 (aka: NVidia's next generation).

https://www.nvidia.com/en-us/technologies/hopper-architectur...

-------

Most of this presentation on the Tesla Dojo is about the interconnect system. Alas, NVidia's on like the 4th (or was it 5th?) generation of NVlink, available from their DGX servers (and I'm sure a Hopper version will come out soon).

AMD's not far behind, also with a lot of good presentations this year from HC34 that points out how AMD's "Frontier" Supercomputer has huge bandwidth. In particular, each MI250X GPU is a twin-chiplet design (two GPUs per... GPU), with 5 high-speed links to connect to other GPUs in a high-speed fashion. There's a reason why Frontier is the #1 supercomputer of the world right now, in both absolute Double-precision FLOPs, and in Green Double-precision FLOPS-per-watt.

NVidia's Hopper will be hitting 4nm. AMD's MI250x is 5nm. That means the D1 chip has less than 1/2 the transistors at the same area compared to NVidia.

> but I’d imagine ASIC based supercomputers (like Google TPU) are the way to go forward.

Only if you keep up with the process shrinks. 7nm is getting long in the tooth now. All eyes are looking forward to 5nm, 4nm, and even 3nm designs (now that Apple is the customer of TSMC's 3nm node).

-----------

That being said, if the 7nm node is cheaper, maybe this exercise was still cost-effective for Tesla. As the newer nodes obsolete the older nodes, the older nodes become more cost-effective.

Cost-efficiency is less popular / less cool, but still an effective business plan.

> but I’d imagine ASIC based supercomputers (like Google TPU) are the way to go forward.

The issue is that it probably costs hundreds-of-millions of dollars to design something like the D1. Sure, the mass production of the chip afterwards will be incredible, but chips have stupidly high startup costs (masking, engineering, research, etc. etc.)

GPUs on the other hand, are more general purpose and are applicable to more situations. So you can sell the GPU to more customers and spread out the R&D costs. In particular, GPUs capture the attention of the video game crowd, who will fund high-end GPU research just to play video games.

Much like how Intel's laptops allow servers to share the R&D effort, so too does NVidia's consumer GPUs share research/development costs with their high end A100 cards.

Robotbeat · on Aug 24, 2022

Nvidia’s stuff is good but it’s pretty high margin. They don’t give access to it for cheap,& in the last five years, the cost per unit performance has been nearly flat. They’re also more generalized than Tesla needs. Performance advantages from process shrinks have also stagnated. A good time for a custom approach.

dragontamer · on Aug 24, 2022

NVidia is claiming 1000 16-bit Tensor-FLOPS on Hopper: https://developer.nvidia.com/blog/nvidia-hopper-architecture...

While Tesla is claiming less than 400 Tensor-FLOPS on D1.

So yeah, the claims of NVidia's GH100 / Hopper GPU are an order of magnitude faster than the D1. Which is no surprise, because when your transistors are less than 1/2 the size of the competition, you can easily have 2x the performance in a embarrassingly parallel problem.

--------

Note that the A100, released in 2020, offers 312 TFlops of 16-bit Tensor matrix-multiplication operations per second. Meaning D1's chip is barely competitive against the 2-year-old NVidia A100, let alone the next-generation Hopper.

And note that NVidia's server-GPUs (like A100 or GH100) already come in prepackaged supercomputer formats with extremely-high speed data-links between them. See the DGX line of NVidia supercomputers. https://www.nvidia.com/en-us/data-center/dgx-station-a100/

--------

You can't beat physics. Smaller transistors use less power, while more transistors offer more parallelism. A process-node advantage is huge.

Robotbeat · on Aug 24, 2022

The transistors aren’t always literally smaller with each processor node, and there are negative quantum effects that also occur as you shrink.

But anyway, you dodged my entire point about cost-to-performance ratio by looking just at performance. If NVidia is insisting on pocketing all the performance advantages of the process shrink as profit, then it still make sense for Tesla to do this.

dragontamer · on Aug 24, 2022

> But anyway, you dodged my entire point about cost-to-performance ratio by looking just at performance

"Dodged" ?? Unless you have the exact numbers for the amount of dollars Tesla has spent on mask-costs, chip engineers, and software developers, we're all taking a guess on that.

But we all know that such an engineering effort is in the 100-million+ project size or more. Maybe even $Billion+ size.

All of our estimates will vary, and the people who work inside of Tesla would never tell us this number. But even in the middle hundreds-of-millions, it seems rather difficult for Tesla to recoup costs.

-------

Especially compared to say... using an AMD MI250X or Google's TPUs or something. (Its not like NVidia is the only option, they're just the most complete and braindead option. But AMD MI250x have tensor cores as well that are competitive to A100, albeit missing the software engineering of the CUDA ecosystem)

------

Ex: A quickie search: https://news.ycombinator.com/item?id=26959053

> For 7 nm, it costs more than $271 million for design alone (EDA, verification, synthesis, layout, sign-off, etc) [1], and that’s a cheaper one. Industry reports say $650-810 million for a big 5 nm chip.

How many chips does Tesla need to make before this is economically viable? And for what? They seemingly aren't even outperforming the A100 or MI250x, let alone the next-generation GH100.

What's your estimate on the cost of an all-custom 7nm chip with no compiler-infrastructure, no software support and 100% all manual software built from the ground up with no previous ecosystem?