With no virtual memory, no caches, and interface processors instead of direct access to external DRAM, this thing must be a programming nightmare?
Having tons of small CPUs with fast local SRAM is of course not a new idea. Back in 1998, I talked to a startup that believed it could replace standard cell ASIC design with tiny CPUs that had custom instruction sets. (I didn't believe it could: it's extremely area inefficient and way to power hungry for that kind of application. The startup went nowhere.) And the IBM Cell is indeed an obvious inspiration.
But AFAIK, the IBM Cell was hard to program. I've seen PS3 presentations where it was primarily used as a software defined GPU, because it was just too difficult to use as a general purpose processor.
Now NOT being a general purpose process is the whole point of Dojo, so maybe they can make it work. But from my limited experience with CUDA, virtual memory and direct access to DRAM is a major plus, even if the high performance compute routines make intensive use of shared memory. The fact that an interface processor is involved (how?) in managing your local SRAM must make synchronization much more complex than with CUDA, where everything is handled by the same SM that manages the calculations: your warp issues a load, it waits on a barrier, the calculations happens, sometimes in a side unit in which case you again wait on a barrier, you offload the data and wait on a barrier. And while one warp waits on a barrier, another warp can take over. It's pretty straightforward.
The Dojo model suggests that "wait on a barrier" becomes "wait on the interface processor".
If it only ever runs one program, and that program is an implementation of vanilla Transformers, that might be all it needs to be useful. Sufficiently large Transformers can do an incredible variety of tasks. If someone invents something better than vanilla Transformers, then they can write a second program for that.
Also investing in a branch predictor when the intended workload doesn't seem at all scalar is a confusing choice to me. Also the 362 F16 TFLOPs sounds super impressive, except the memory bandwidth is I think 800 GB/s (or is it 5 times that? Or effectively less than that if data has to be passed along multiple hops? I'm a bit confused), which means having to do 1000 ops (or 200? or more?) on each 16 bit value loaded in. Maybe you could do that, but it feels like you'd probably end up bandwidth bound most of the time.
My understanding is they load in weights occasionally into sram and then pump in training data on the sides of the die and have multiple cores operate on a wavefront of data. So the cores don't compete for host memory bandwidth because the same data flows (transformed) through multiple cores.
You are right that this won't work well with any language that assumes a "normal" processor. But a small language that is written for it could be fine.
From my understanding the CELL was meant to be the GPU for the PS3 but Sony instead found the same issues and could not program a reasonable performing SDK using it within the time limits (MS Xbox 360) and added in a Nvidia RSX GPU.
> could not program a reasonable performing SDK using it within the time limits
It feels like "within the time limits" has always been the problem of difficult to program for software-dependant architectures: time vs competitors.
E.g. in the time it takes to write an intelligent compiler (IA-64), your better-resourced (because they're getting revenue from the current market) competitor has surpassed your performance via brute evolution force.
There are use cases out there (early supercomputing, NVIDIA) where radical, development-heavy architectures have been successful, but they generally lack of competitor (the former) or iterate ruthlessly themselves (the latter).
Sounds to me like a programming dream. The usual way of things these days is 'don't waste your employer's time trying to optimize; everything that can profitably be done has already been done by other people; you just have to accept that particular part of your skill set is useless'. Dojo would let you actually use a lot more of your skills.
It is incredibly odd tesla decided to invest in this level of engineering, even perfect hardware is like, 1/5th the battle. Now you get to stand up the entire software stack, make it performant, and hope you can compete with whatever Nvidia has in the pipeline.
Let me play the devil's advocate - when you have an $1T company, investing a couple million into something like this signals to investors that you have big plans for the future and have a leg up on the competition in the self-driving space.
Even if that increases the stock price by 1%, this investment pays for itself a hundred times over.
Hundred million for the chip alone, a few hundred million for software (compiler, assembler, router, protocols, communications, load balancing, server, supercomputer utilization, power management).
The $100-million chip might very well be the easy part of this project...
As a software person it baffles me how we tend to overestimate our value compared to hardware. The chip design field is really demanding, especially with deep submicron processes and with rigorous testing & verification culture that blows out of the water almost everything software organizations do.
I doubt developing a compiler for this chip will be as costly as your estimate, especially given mature opensource compiler frameworks like LLVM (already adapted to similar TPU-like architectures) and upcoming pytorch glow, apache TVM ML-specific compiler frameworks. And especially given in our industry there are people with advanced degrees who are willing to jump through additional hoops and even accept pay cuts, just to work on a cool compiler.
I respect the hardware for what it is and what it could enable.
I think they developed this during the years where all the nvidia hardware was gobbled up by crypto miners. The resulting pricing probably did not help either.
Tesla is big enough that they can afford to experiment a little. This sounds like an experiment that actually worked. Once they had it working, doing more of it and iterating on it is just business as usual for them.
The self-driving AI they use this chip for appears to be less than impressive from my perspective. Its mistakes seem to come up in the news quite frequently.
Are you aware of a better implementation though? It might not be perfect; it's still miles ahead of competing systems. Yes, it makes mistakes. But that's a rather poor argument to dismiss the whole strategy and technology.
From my point of view, Tesla is converging on good enough more rapidly than anyone else in the industry. Courtesy of their AI strategy and technology; including the in house developed chips.
This is a big problem, though. Any ADAS is a safety critical system and while "good enough" can help push volumes, it is extremely bad from safety perspective. If you really want to read on this u/adamjosephcook/ has some awesome writing going into far deeper detail than I could explaining this particular issue.
Tesla has a path to economies of scale: they already announced that if Dojo works as expected they'll make it available to others as an AWS-style service.
Which is brilliant: they might end up making money on this.
AI is clearly here to stay. The demand for AI training will clearly explode in the future.
Running training in-house is not easy or cheap. You don't just plug in 1000 NVIDIA GPUs. You need massive up-front payment for GPUs and you're basically running your own extremely energy hungry datacenter .
Tesla might built and operate massive datacenters. They'll use as much as they need for internal needs and sell the remaining capacity to others.
This might take 5 years but the path to do it is clear.
I don’t see how they’re going to commercialise this as a cloud compute service.
For one, they’ve built a chip that operates in a fundamentally different way to other chips. So any other company that wanted to use it would have to invest a considerable amount of resources in building up the institutional knowledge to use it effectively.
Additionally, the lack of virtual memory and multi-tasking support renders it pretty much impossible to divide up compute between multiple customers. So, commercialising this would require customers renting out the whole unit, which is contrary to how cloud computing usually works.
Are there companies out there that have the capital and use cases necessary to fit into Dojo Cloud? Maybe, though not one I’ve worked for. Would they trust the stable genius currently heading up Tesla enough to make such an investment? Perhaps, but I wouldn’t, but what do I know?
> Additionally, the lack of virtual memory and multi-tasking support renders it pretty much impossible to divide up compute between multiple customers. So, commercialising this would require customers renting out the whole unit, which is contrary to how cloud computing usually works.
Only if you want to subdivide the compute on each dojo chip. You can still provide multi-tenant, support by allocating entire dojo chips to a single customer at a time. Even traditional time division multi-tasking is possible as long as you’re happy to accept multi-second long time slices. Then the overhead of clearing an entire dojo chip (or batch of chips), and setting up a new application, isn’t too high.
If you’re doing AI workloads, then none of the above are an issue. Training a large net takes days to weeks of continuous, single task computation. So selling dojo access in whole 1 hour blocks is a perfectly reasonable thing to do.
Reliability is an important factor here, and I don't mean technology.
Things don't look so good for everything that has to do with Musk. Today like this, tomorrow like that
>Things don't look so good for everything that has to do with Musk. Today like this, tomorrow like that
Such as? Except for FSD, his record is unmatched AFAIK when you take into account the novelty / complexity / difficulty.
One example, and certainly his main achievement: he said Tesla would sell and produce half a million cars by 2020, back in early 2014, and they hit that number with a 93.6% precision. https://youtu.be/BwUuo6e10AM?t=156
Some of Musk's stuff is great - other stuff isn't.
SpaceX? Great. Starlink? Sounds neat. Tesla? Pioneered electric cars with respectable performance and range.
But on the other hand, where's the hyperloop? Where's the affordable tunnelling? Where's the $35k Tesla - not available for order on the website, that's for sure. Where's the miniature submarine for rescuing children trapped in caves? Why has my buddy in Europe been waiting over a year for his powerwall to be delivered? Why are these norwegian tesla owners on hunger strike? Where's the full self driving, with taxi service? Why on earth would anyone want to buy Twitter?
Makes it very difficult to know which of Musk's statements are just spitballing, which are unrealistic timescale guesses and which can be relied on.
Getting any serious project architecturally 'locked in' to a special type of CPU you can only get from Tesla would be a bold move.
It's simple, SpaceX, Tesla and Starlink are evolutions of existing technology.
FSD, Hyperloop and such would be revolution like out of a Sci-Fi movie.
They fail all because Musk like would like to have those things but in reality these things are much more complicated as he says.
yes, you still write checks to pay online and rocket boosters to this day are single use $100M pieces of hardware that we throw away into the ocean after each use.
seriously. musk is shady and weird, more so in the last couple of years, but come on.
It's about the perception of Musk.
He built some successful companies but he is not Tony Stark as people tend him to see.
He is a salesman not an genius inventor.
Don't expect FSD in the near future and don't expect a Mars colony.
FSD, especially the way he was describing it, was a blunder. the Mars city is a pipe dream... but I absolutely understand why it makes people follow him - he's the only one who set out to build a private space company with the purpose to actually get to Mars, with the side effect of completely uprooting the space launch industry. the achievements are undeniable, but it's the vision that makes the perception be so surprisingly good still. there's literally nobody else who says things like that and has the means to even try.
I assume you never actually built any cloud infrastructure yourself. Plus Tesla (aka) Elon, well, say a lot of stuff, not always necessarily correct.
Internal research product is super far from any actual production usage. Especially if you go against some established paradigms, that require enormous amount of effort (more than developing silicon) to build tolling around, so people can design, program, debug, monitor it.
But that’s internal usage. Cloud is a totally different ballgame. You have to deal with thousands more requirements (and you cannot generally tell customer to do something else instead, as you can with internal teams). And customers that have operating procedures totally different from yours, 0 access to your internal knowledge and infinite less tolerance for BS answers (as you are paying customer, not a someone on the same boat).
Building cloud is extremely hard, and there’s a reason why Google is still losing money on it.
Plus, let’s even say that your 5 year estimate is correct, Dojo is amazing and the future of tech and they may have viable product by then. Do you think that Nvidia wont advance their AI offering by then? Google TPU will stop being developed? Or will Tesla continue investing to churn new generation of Dojo every year?
> You have to deal with thousands more requirements (and you cannot generally tell customer to do something else instead, as you can with internal teams).
You can. AWS started with S3 when everyone was using databases. As long as it’s cheaper than its competition, single use-case (you won’t serve a website on these) has a market.
> You can. AWS started with S3 when everyone was using databases.
AWS staryed when there was no competitor.
Google started with a ton of world-class expertise when AWS was up and running and while operating already a colossal network of server farms using special-purpose, which Tesla has none of which, and after all these years barely got a 10% market share.
What they want is a training engine that is cheaper than whatever AWS or Google (or anyone else) can offer. If I can point my PyTorch to it instead of an AWS GPU for less money, why not?
> What they want is a training engine that is cheaper than whatever AWS or Google (or anyone else) can offer.
Bold assumption, considering Tesla's hardware does not exist, the market is limited and Google has already years of providing machine learning services with special purpose hardware.
Their hardware. They have, at best, 1 supercomputer (though it's not actually clear if they have more than some Dojo prototypes to me). That does not a cloud make.
Ah yes, I see that now. But assuming they make the computer, they could also lease it to one or more cloud providers as a service. They don't necessarily have to build the whole thing.
> Tesla has a path to economies of scale: they already announced that if Dojo works as expected they'll make it available to others as an AWS-style service.
"If we manage to put together a working processor, supporting hardware, OS, and possibly ad-hoc programming language, our next step is to also develop a bunch of web services to provide cloud hosting services."
Not very credible. As if the key to offer competing cloud hosting services is developing the whole hardware and software stack.
And network infrastructure, isolation between customers, scheduling hardware allocation, etc etc running one own data enter is quite different than inviting all sort of third parties in.
> Yeah, but it’s not like this is rocket science or anything.
The key difference between this goal and SpaceX is that Elon Musk bought a private space company that already had the know-how and the market presence in a market with virtually no competitor.
In this case, Tesla is posting wild claims about developing the whole vertical integration of the whole tech stack barely over mining semiconductor raw materials, with which goal? Competing with the likes of AWS, Google, and Microsoft, on a very niche market?
Digging holes in the ground is hardly rocket science as well.
Ever since Elon became the world's richest man people like this have showed up. I don't know if they have been misinformed or if they just want to say negative things about billionaires. But the early history of SpaceX is very well documented in the book liftoff if anyone wants to know the truth.
You say that Tesla might do this for others, AWS style.
Then talk about the upfront in house costs of setting up for GPU ops. But ignore that if an AWS style model works for you, well, AWS is already capable of giving it to you in GPUs.
They aren't going after economies. If you look closely at their design choices, they are building a pure scale-out vector machine unlike anything else currently on the market. I'm guessing they expect it to be head & shoulders ahead for their inhouse workload.
During first AI day Tesla already said that they have plans and ideas for 10x improvements for Dojo v2. And I'm sure one of the improvements is to use 5nm or 3nm process.
They've already been working on this for almost 7 years. This is not some side projects but a serious operation. They have to ship something sometime and they decided to ship this now.
I'm confident that they know about 5nm process and chose 7nm for good reasons.
>And I'm sure one of the improvements is to use 5nm or 3nm process.
Implementing a process shrink is not just scaling the masks by the appropriate percentage. It often is a completely different optical train, at a different wavelength, different pellicle, changing the refractive index of the immersion fluid, different multiple patterning. It takes months (years?) of work.
For many applications it's worth it, but it's not at all a slam dunk. The vast majority of ASICs are designed for a particular node and never upgraded. It's the kind of crazy long-term speculative capital investment TSLA might have indulged in when the stock was at $414 a share, but it's nowhere remotely near that today.
They're already engaging in layoffs, today. Why spend the money?
> vast majority of ASICs are designed for a particular node and never upgraded
They are, if the new node is cheap enough to justify the investment in shrinking the design. In most ASICs, being faster won’t make people rip out their embedded electronics for new models.
They might not work fast enough for your satisfaction but can you point any other car company designing chips like Dojo?
Hell, can you point to any company at all that can come out of the gate with a chip design competitive with the best design of NVIDIA, a company that makes nothing but chips?
After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.
> After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.
They are clearly confident that FSD will work, to the point of promising it "this year"...
coincidentally, they've been promising "this year" for the last 7 years. And FSD still fails in spectacular/hilarious/horrifying ways, and isn't _remotely_ close.
Hell, can you point to any company at all that can come out of the gate with a chip design competitive with the best design of NVIDIA, a company that makes nothing but chips?
The point was that very few companies can pull off a clean sheet chip design from nothing. Google has done it, but Google is an elite company. So saying Tesla isn't the only company that has done it because Google has done it only shows that Tesla is doing very well.
But it seems clear to me that the point isn't to compete with these other companies, but to vertically integrate a critical component of their systems. They can discard a lot of legacy concerns and focus on raw power.
The thing is that if Tesla just uses Nvidia, like everyone else, then Tesla's stuff is only differentiated by software. Everyone uses Nvidia and it becomes hard to set themselves apart. But if Nvidia is chasing a broad customer base and has all this extra stuff to think about, then Tesla could potentially be more nimble, and produce a holistic system design that solves exactly the problems they have with no chuff. This could result in a more advanced hardware platform so their robotics products differentiate themselves with both hardware and software.
I am also happy because I am a robotics engineer, and in my opinion we need hardware that is 1000x more powerful than today to do what we really need. Nvidia wants to move at a certain pace, but if Tesla is trying to beat them on raw power, then Nvidia will play catch up and accelerate development of more powerful systems. This is great for everyone.
Internally Tesla can use whatever they want and it might even make sense in the long term, but if they want to sell their chips for general model training they better be much better than the future Nvidia cards they will be competing with when they start selling. Like twice faster at half the price and half the power - with perfect framework support. That last part is extremely important: if I see any mentions of anyone who changed “cuda” to “dojo” in their Pytorch code and ran into any issues, I’m not going to touch it with a ten foot pole. Just like I avoid TPUs because 2 years ago I heard people were having issues with them. And I’m the guy who has decided which hardware to spend millions of dollars on at several companies.
Yeah I just don't think that is really the main part of their strategy. Maybe they would sell chips or boards or servers if they are already making them, but I think it is mostly about internal use so their end products have a competitive advantage as complete robots. Robotics needs HUGE advances in compute and with their own chips Tesla won't have to be dependent on a third party for their success.
All the stuff you talked about about needing perfect support before you will touch it is something that takes a lot of work for nvidia and others, slowing them down. Tesla can ignore all that and focus on performance for their specific application, and I think this gives them the freedom to lead the pack on raw performance for their application.
I'm not sure if you've watched the presentations for how their self driving system is trained, but basically they have a million vehicles out in the real world with camera systems on them, and they have a massive server farm that is collecting new data from vehicles all the time, and they train their neural net by running millions of scenarios in simulation and against real world data collected from all those vehicles. And they have to re-train the system all the time with new data, and run it against old data to check for regressions. So they have this huge compute requirement to keep improving their system. They think that functional self driving will revolutionize their business (setting aside the valid criticism, this is what Tesla thinks) so they need to be able to handle an ever growing compute load that they have to be running constantly. So raw compute power is critical to the success of their plan. It may not be enough, but if they certainly can't succeed without it. But their needs are very specific, and it sounds like they've found an architecture which is simpler than most nvidia chips, but has loads of power. So it sounds like they are making a good decision, based on their specific needs. It is a huge risky bet, but then that's how Musk likes to do things.
This is surprising to me. Robotics clearly needs huge advances in algorithms (RL or something better). Do you mean you need faster hardware to discover those algorithms?
Oh we definitely need better algorithms too! But I’ve imagined that we’d want something like GPT-3 but for sensor experiences. So the way GPT-3 can ingest lots of text and predict a reasonable paragraph as a continuation of a prompt, we could have a system ingest lots of simultaneous sensor data in the form of LiDAR, cameras, IMU data, and touch sensor/skin sensor data, and then given the current state of those sensors it could predict what in the world is going to occur next, and use this as input to an RL system to make a choice of action. This seems to me to be both a useful system and one that could require a lot of compute. And that’s still probably not a complete AI system so there’s probably many many pieces required.
Looking at it another way, the human brain has wayyy more compute power than any of our current portable computers (robotics really needs to do most of its compute at the edge, on robot). Every robot I’ve ever worked with has been maxing out it’s CPU and GPU and still needed more compute.
When you look at Tesla’s hydra network for their self driving system you get an idea for what is needed in robotics, but just as we saw GPT-3 improve with network size, I suspect a lot of the base systems involved in a hydra net could improve with network size. And I suspect that there’s still more advanced stuff required when you move beyond a simple task like driving a car to a more advanced general purpose AI system. For example the Tesla self driving system doesn’t need any language centers, and we know GPT-3 and similar networks are large.
robotics really needs to do most of its compute at the edge, on robot
Why can't you hook your robot up to a GPU cluster? Tesla already has 7k+ A100 GPUs, the question is do they have algorithms which would clearly work better if only we could run them on 70k or 700k GPUs?
I mean, what you say makes sense, but have people actually tried all that and realized they need bigger GPU clusters? Is there a GPT-3 equivalent model in robotics which would probably work better if we scaled it up? If not, perhaps they should start with a small proof of concept before asking for thousands of GPUs. Original Transformer --> GPT1 --> GPT2 --> GPT3.
The problem with this is that autonomous robots need to function in the real world even without internet connectivity. For example I am designing a solar powered farming robot. We do have Wi-Fi and starlink here but Wi-Fi and internet go down. In general we think it makes most sense for it to be able to operate completely without a continuous internet connection. And take self driving cars where millisecond response times matter - those can’t rely on a continuous connection to function or someone will get killed. But as systems get more advanced it is my opinion that edge compute will be an important function. And edge compute can’t handle the state of the art networks that people are building for even text processing, let alone complete autonomous AGI.
And no, the models I’m talking about don’t exist yet, I am speculating on what I think will be required based on how I see things. But I’m not asking for thousands of GPUs. I’m just speculating in internet comments about what robotics will need for a fully embodied AGI to function. I believe more edge compute power is needed. Perhaps 1000x more power than the Nvidia Orin edge compute for robotics today.
Sure, I understand the need for the robot autonomy. The problem, as I see it, is that current robots (autonomous or not) suck. They suck primarily because we don't have good algorithms. Without such algorithms or models, it does not matter whether a robot is autonomous or not, or how much processing power it has. Only after we have the algorithms, the other issues you mentioned might become relevant. Finally, it's not clear yet if we need more compute than what's currently available (e.g. at Tesla) to develop the algorithms.
p.s. I don't think AGI is needed for robotics. I suspect that even an ant's level of intelligence (together with existing ML algorithms) might be enough for robots to do almost all the tasks we might want them to do today. It's ironic that robots today can see, read, speak, understand spoken commands, drive a car (more or less), play simple video games, and do a lot of other cool stuff, but still can't walk around an apartment without falling or getting stuck, do laundry, wash dishes, or cook food.
Also, even if the first version didn't make financial sense, building the team and the infrastructure required to make this work will likely translate into a next generation chip.
Part of Tesla vertical integration mantra is that you are building internal capacity.
Remember AI? AGI seems just around the corner since the 60’s. There we learned that the things that seemed complicated (such as beating a champion at chess) were simple and the simple things are still almost mind blowingly hard.
It seems like to me they took a big bet and it didn’t pay off. They may not have felt like they could rely on Nvidia to deliver the promised Tensor Core performance.
It's a bit premature to call this a failure given they are barely at a stage of making first chips and validating them.
You don't seem to understand the timelines of designing chips like that from scratch.
Dojo has been in works for almost 7 years. Tesla will continue to work on this for the next 20 years. There will be Dojo v2, Dojo v3 etc. just like at SpaceX there was Falcon 1, Falcon 9, Falcon Heavy.
This still might end up a failure but they clearly feel confident enough to talk about Dojo publicly, which wasn't the case for the first 5 years of its development.
Thanks for explaining to me what I don’t know. Clearly you’re a Tesla fanboy.
There is no doubt that it is an amazing piece of tech but I’m not confident Tesla will be able to pull off beating NVidia. Especially when compared to NVidias Tensor Cores and economics of scale. I don’t like their whole approach to self driving ML. I know a lot of people disagree with me so I would rather not get into it.
I think a lot of the talk of future performance is posturing in order to get NVidia chips at cheaper prices.
I'm sure that it's at least partially for them to have a better negotiating position with NVIDIA. But the reality is that Tesla actually has some very good expertise in chip design. Jim Keller worked there for several years and along with Pete Bannon designed their Autopilot HW3 computer which is in 1+ million cars right now. At the time that HW3 was released it outperformed what NVIDIA had to offer. That said it's not likely they'll be able to beat NVIDIA, but they may be able to beat them for their hyper domain specific use cases. Additionally NVIDIA chips are difficult to get and they're extremely expensive. Even if Tesla can get something that performs 80-90% as good as NVIDIA but at significantly lower cost then it may still be worth it.
I know these things. I think some of the dojo architecture was a reaction to their FSD chips being over optimized for resnet style models. They’re targeting extremely large models which is a new frontier for ML and in my view not the panacea they hoped it would be.
I think with better ML they wouldn’t need so many chips, be it NVidia or Dojo.
Perhaps they do this so they can carry over this tech into their cars. If you are making millions of cars per year, with tens or hundreds of chips per car, it is easy to reason doing this on your own instead of outsourcing it.
The main concern mentioned in the comments seems to be that programmers will find it hard to work with this hardware.
But is that really the intention? Wouldn't it be enough if there is one program written for it, one that takes a bunch of inputs and a bunch of outputs and then creates a set of NN weights that perform well?
So the AWS style offer would not be "rent our hardware and program on it" but "rent our hardware/software stack, throw your input/output pairs at it and get back a set (billions) of NN weights that perform well".
This is meant to run in a datacenter, not a car, right? There’s some value to custom ASICS if you can get it - Google seems happy with TPUs and I assume those aren’t on the latest Nvidia-competitive process.
If they are spending this much money on the data center chip, they probably will port the assembly language over to an ASIC inference chip for their cars.
I honestly can't see how they make (or save) enough money with just the data center chip here.
Maybe if the inference chip is on a cheaper 12nm process or something... Maybe?
NVidia chips can do this job. (EDIT: The job of training a neural network I should say. I don't think that's enough to get self driving, but Dojo ain't anything but a faster NN training chip anyway)
The question is if they are saving money compared to buying an off the shelf DGX system. I really doubt it.
Presumably Tesla, at the time they decided to pursue this option, thought they'd potentially have a competitive advantage with in-house designs.
It's entirely possible that's just their hubris showing, time will tell if this was the right decision or not. After seeing the NVidia presentation announcing their latest datacenter-scale AI hardware, I'd be surprised if Tesla's in-house design is more than just a massive cost center vs. buying something from NVidia.
But sometimes you do things that appear irrational in part to keep your talented engineers from seeking work elsewhere. Just look at NASA's SLS, how much of that is a job program in part to prevent hoards of talented folks building rockets for competing nations?
But once Tesla is designing chips for their in-vehicle inference needs, they need to keep those people interested and the large-scale training side is arguably more interesting to DIY.
the biggest product of Tesla is its stock, there are no ifs and buts about it. this must change soon, since that mad money is barely enough to eek out a profitable quarter.
> After seeing the NVidia presentation announcing their latest datacenter-scale AI hardware
Did we watch the same presentation? NVidia knocked it out of the park.
Thread block cluster is obviously amazing. Routing between SMs / compute units will be far faster with this level of software abstraction, and it will be exceptionally easy to write code for. NVidia always impresses me with their advanced software techniques and clear understanding of the fundamental model of SIMD compute.
------
Ignoring those software details... the important thing is GH100 will be TSMC 4nm, which is 1.5 nodes ahead of the 7nm Dojo. A significant process advantage, representing 60+% less power usage and 300% the transistor density of the older 7nm tech.
Even if NVidia's GPU had issues, there's something to be said about just being a raw process node (or 1.5 nodes) ahead.
> Did we watch the same presentation? NVidia knocked it out of the park.
Perhaps I worded it poorly, I agree with you.
My meaning was vs. NVidia's latest tech it seems like Tesla's in-house datacenter NN could be nothing more than a huge cost center without even offering an advantage over what NVidia could sell them.
But like I said, if you have a staff of folks capable of building such things you have to keep them satisfied with practicing their craft or they leave.
> NASA's SLS, how much of that is a job program in part to prevent hoards of talented folks building rockets for competing nations
Zero. Because NASA has no problem if those engineers would work for other nation as long as it isn't Russia or North Korea and co. And that wouldn't happen anyway.
Those people would likely work in one of the huge amount of space startups or just go to the typical ULA, BlueOrigin, SpaceX and so on.
You make the totally wrong assumption that SLS has anything to do with rational thought. It really doesn't.
Tesla is very vertically integrated. This is just how they operate. You can make the argument that they shouldnt be so vertically integrated but it has worked for them thus far.
So... no? They are clearly leveraging the NVidia ecosystem right now. Now maybe they have ambitions to get off of NVidia, but they're doing so in a rather asinine fashion. There's probably half-a-dozen groups trying to make a faster systolic matrix multiplication unit for the deep learning crowd. Tesla probably should have either worked with those groups and/or bought one out, for example.
Sure. You only have to design a chip, design an assembly language, design a compiler, design the kernels, design a parallelization framework, design a server system to load-balance tasks, and then rework the pytorch/tensorflow code to use your new faster custom primitives that no one else has.
-----
Except step 1: "design a chip", is already something on the order of hundreds-of-megabucks of investment
AMD pays for this flexibility by spending about 44% of “Zeppelin” die area on logic other than cores and cache. Dojo only spends 28.9% of die area on things other than SRAM and cores.
Doesn't sound worth it? You're basically spending a ton of money to fab a chip for what amounts to like a 30% training performance boost? Just buy more chips.
neat :) Love seeing other people building new and weird architectures. I wonder what their actual use case for SMT was, I don't really buy the "one thread for vector, one thread for pulling data into SRAM". I'm also a bit surprised to see they didn't go for a VLIW ISA, it always seemed like these tightly integrated data processing chips were the ideal candidates since binary compatibility isn't an issue when you're building your own HW.
In terms of absolute compute performance per chip, perf per watt, and perf per die area, it looks like Dojo matches or surpasses the best GPUs of today: «Tesla claims a die with 354 Dojo cores can hit 362 BF16 TFLOPS at 2 GHz»
For comparison, the fastest single-chip GPU today is the AMD MI250X which has 220 "cores" (compute units) totaling 383 BF16 TFLOPS at 1.7 GHz, and that's a monster 560 watt chip.
The Dojo chip is likely under this 560 was TDP, so more efficient. And Tesla provides roughly the same compute performance, but the chip is arrayed in 61% more cores, meaning it is far more suitable to handle branchy code. Also Tesla claims the die measures only 645 mm², compared to 1540 mm² for AMD. So the wafer fabrication cost is roughly half(!) of AMD's!
If Tesla has truly managed to build that, I'm impressed.
Edit: I missed that Tesla claims "less than 600 watt" per chip. So we know its comparable or less than AMD (560 watt).
Edit 2: 25 dies are packed on a single system-on-wafer. That's 15 kW on a disc of 30 cm (12 in) of diameter. Sheesh! That must require an ungodly liquid cooling system!
Edit 3: there is more info, including rendering of the host interface card at:
A 2x efficiency improvement from a GPU to a specialised ASIC not particularly impressive. How much would you gain by removing the graphics related stuff from a GPU (texture pipelines, vertex processing, etc.?). In addition, they lose existing programming models like OpenCL and the compiler progress that happened in the last 10 years and have to roll their own. The amount of SW work needed to support this must be a lot to get the same ease of use out of it as GPUs. Maybe they made it more CPU like to make it easier to program?
FP16 AI workloads are very important to AMD and Nvidia (AMD MI250X and Nvidia A100/H100 were really designed for this), and yet Tesla leapfrogged them with a more than 2x reduction in die area, and more features (eg. out of order exec). This is what's impressive. AMD, Nvidia, and even Intel, should have been leading this, but they weren't. Seems to be a classic innovator's dilemma.
TOPS isn't indicative of actual performance https://semiengineering.com/lies-damn-lies-and-tops-watt/ Also, this IMG chip is vaporware and they can quote an arbitrarily high TOPS/W by trading voltage & frequency with die area, therefore increasing cost beyond reasonable. I'm not surprised IMG shares no other metric.
The same way we like to talk about odd OSs, like Lisp Machines, Plan 9, Oberon, Smalltalk, I'd love to see one of these things used as a desktop CPU (since I never got to play with one of the desktoppable Xeon Phi's) just to see the weird and interesting OSs and applications that would emerge from it.
I know it must be horrendously difficult to program for general purpose use, but that's the point - figuring out what CAN be done and what an OS for it would look like.
Discussions about how this will fail to be productized as a cloud service are premature. There are many, many ways to do SAAS. If the benefit is there people will find a way to use it.
In any case the problem is moving data, then the compute side. If Tesla is really going to do it they have someone working on that already. SAAS is not rocket science, and the issues are well-known.
Can anyone tell if this architecture makes sense to ship in their actual cars to replace whatever is doing the computation there now? That is the only way I can see this making any sense to develop.
They have already developed and deployed custom silicon for their cars.
There is probably some overlap in the IP between these chips, but Dojo is optimized for operating in a large data center cluster for training purposes, whereas their car chips are optimized for running pre-trained models.
they want to get federal funding for chip manufacturing on US soil?
That would be typical of Elon Musk to get some federal funding $$$$$
but Elon can make it work on par if not better than nVidia, just because Tesla can hardcode and overoptimize for one single task (FSD AI), while nVidia will always be limited in keeping their chips generic enough for all purpose NN training/inference and even gaming + backwards compatibility for all historical stuff.
Tesla can work with blank slate and can reap performance benefits over there
OK, but why? Is this somehow supposed to make their flaky "self-driving" work? Or what?
Certainly you can build custom chips for deep learning. It's mostly a simple inner loop, replicated many times, after all. Ideal case for ASICs. But will this actually benefit Tesla as a car company?
There was no reason to waste money for this. You don't want to focus on two incredibly resource-intense and error-prone processes in the same company (Automotive, Semiconductor).
This is a hype project simply created to pump up the stock price.
Its so hilarious when people think everything Tesla does is some 4D chess about stock price, when in reality Musk really doesn't care about stock price at all. He literally thinks multiple years ahead, and doesn't change that strategy because some short term head winds.
> two incredibly resource-intense and error-prone processes in the same company (Automotive, Semiconductor).
If you think making these chips is excessive, you don't know Tesla. Tesla is the most vertically integrated car company in the world (maybe BYD compares).
Compared to literally designing not only their own batteries cells, but their own battery chemistries plus their own battery materials process AND their own fully up battery manufacturing factories. They started doing this a decade ago and only this year the first cars rolled of the line and being sold to costumers. That is years and years of investment, planning and internal capacity building.
In 2017 people were laughing at them for attempting to do all their own manufactriong. Now they have industry dominating automotive margin. They literally have their own materials team that makes their own aluminum for casting and have co-developed production machines that are the size of a whole house. This is now slowly being copied by other manufactures.
They also make their own glass, seats and a whole host of other things that most car companies don't make. Doing these things in-house cost billions and billions in capx.
Also, Dojo is not the only chip they make, they have other chips that are already being sold in their cars by the millions every year. Dojo is a pittance of investment compared to that and a fraction of the total investment into the overall self-driving stack. Just because you don't agree with the overall strategy (assuming from your comment) doesn't mean that they many, many billions Tesla spends isn't real, or its all some sort of stock price play.
So, if you think Dojo is about stock price, I don't know what to tell you, its just wrong.
So many people here seem to be convinced that since NVidia is the leader in this tech all other attempts of building up something comparable is a loosing proposition, shouldn't be even attempted, it senseless, and everybody should just bow to them and buy from them.
Weird. I though competition, even when the Leader is far ahead is a good thing and up to the Investors.
No idea whether they can make it work, it is really hard, especially adjusting their tools, but they have written most tools themselves, so adjusting should be possible. Making those tools available to outsiders one day ? can happen, doesn't have to.
> So many people here seem to be convinced that since NVidia is the leader in this tech all other attempts of building up something comparable is a loosing proposition
A non-issue if you understand the design space of TPU-like AI accelerators. The understanding shouldn't even be necessary because the major players (Google, now Amazon) already build their TPUs and have been for years.
There's a legion of people convinced of their superior intellect with "helpful" advice about what Musk / Tesla / SpaceX should or should no do.
After first AI day Tesla stock dropped significantly. Wall Street doesn't understand Tesla as anything else but a car company and won't understand AI until it start producing billions in revenues.
It's exceedingly poor stock pump.
They've been working on Dojo for almost 7 years. They kept it secret for the first 5.
Kind of contradicts the "stock pump" memes.
They start talking about it now because those efforts are very advanced and they want to hire more people.
It has not really been a secret, they just haven’t been talking about details much. It was pretty clear to anyone who works in related areas what they were up to.
While I agree that most custom silicon is a bit silly (and the costs dramatically underestimated), it's worth noting that most of the serious companies in the AV space have some some amount of custom silicon for ML either in the works or already in production. Waymo is using TPUs for example and Cruise has its own chips as well. It's the cool thing to do in SV lately. Doesn't help that Nvidia is such a toxic partner that the prospect of not working with them makes custom silicon seem almost reasonable.
Given Musks public statements on goals and strategies, I think you should view it more holistically.
With Musk, everything is a step towards the end goal(s), like colonizing Mars (which might even be the only goal since it ties everything together; it’s much easier to send humanoids to prepare for colonization and it’s not like HCI seems at all a priority unlike many other AI efforts).
My guess is that with Tesla, Musk aims to create AGI in a way he can control (as you may recall, he has previously talked about fears of the tech and control is a natural strategy to handle this). The cars are just the step that enables the next, so he basically just invests in the future over short term gains, which is a very reasonable business decision.
Custom silicon is likely unavoidable if you want to lead the AGI field, so he would just be investing in the next step of the plan. Tesla Bot is a good indicator of this direction.
> Tesla Bot is a good indicator of this direction.
Tesla Bot was the human in a bot costume dancing on stage [0], right? If so, then yes, this is quite a good indicator of the seriousness of this direction.
Urgh, they explicitly mentioned that it was a real human in the same presentation. Your obsession with bashing Tesla is unhealthy. Pick a valid criticism.
Having tons of small CPUs with fast local SRAM is of course not a new idea. Back in 1998, I talked to a startup that believed it could replace standard cell ASIC design with tiny CPUs that had custom instruction sets. (I didn't believe it could: it's extremely area inefficient and way to power hungry for that kind of application. The startup went nowhere.) And the IBM Cell is indeed an obvious inspiration.
But AFAIK, the IBM Cell was hard to program. I've seen PS3 presentations where it was primarily used as a software defined GPU, because it was just too difficult to use as a general purpose processor.
Now NOT being a general purpose process is the whole point of Dojo, so maybe they can make it work. But from my limited experience with CUDA, virtual memory and direct access to DRAM is a major plus, even if the high performance compute routines make intensive use of shared memory. The fact that an interface processor is involved (how?) in managing your local SRAM must make synchronization much more complex than with CUDA, where everything is handled by the same SM that manages the calculations: your warp issues a load, it waits on a barrier, the calculations happens, sometimes in a side unit in which case you again wait on a barrier, you offload the data and wait on a barrier. And while one warp waits on a barrier, another warp can take over. It's pretty straightforward.
The Dojo model suggests that "wait on a barrier" becomes "wait on the interface processor".