Hacker News new | past | comments | ask | show | jobs | submit login

It is incredibly odd tesla decided to invest in this level of engineering, even perfect hardware is like, 1/5th the battle. Now you get to stand up the entire software stack, make it performant, and hope you can compete with whatever Nvidia has in the pipeline.

Let me play the devil's advocate - when you have an $1T company, investing a couple million into something like this signals to investors that you have big plans for the future and have a leg up on the competition in the self-driving space.

Even if that increases the stock price by 1%, this investment pays for itself a hundred times over.

Setting up a hardware platform of this kind that is useable in production costs significantly more than a couple of million

Still a blip for a company that was toying with $1B in Bitcoin.

this is more like couple hundred million TBH

Hundred million for the chip alone, a few hundred million for software (compiler, assembler, router, protocols, communications, load balancing, server, supercomputer utilization, power management).

The $100-million chip might very well be the easy part of this project...

As a software person it baffles me how we tend to overestimate our value compared to hardware. The chip design field is really demanding, especially with deep submicron processes and with rigorous testing & verification culture that blows out of the water almost everything software organizations do.

I doubt developing a compiler for this chip will be as costly as your estimate, especially given mature opensource compiler frameworks like LLVM (already adapted to similar TPU-like architectures) and upcoming pytorch glow, apache TVM ML-specific compiler frameworks. And especially given in our industry there are people with advanced degrees who are willing to jump through additional hoops and even accept pay cuts, just to work on a cool compiler.

I respect the hardware for what it is and what it could enable.

I think they developed this during the years where all the nvidia hardware was gobbled up by crypto miners. The resulting pricing probably did not help either.

Tesla is big enough that they can afford to experiment a little. This sounds like an experiment that actually worked. Once they had it working, doing more of it and iterating on it is just business as usual for them.

> This sounds like an experiment that actually worked.

Are you sure? We haven’t even seen the chip in any physical form yet. We haven’t seen it running any code.

The self-driving AI they use this chip for appears to be less than impressive from my perspective. Its mistakes seem to come up in the news quite frequently.

Dojo’s D1 chip for training AFAIK is not being used yet. Tesla still trains with their ~7000 cluster of A100s

And the D1 is for their training machine anyways. Not their inference chip in the cars

Are you aware of a better implementation though? It might not be perfect; it's still miles ahead of competing systems. Yes, it makes mistakes. But that's a rather poor argument to dismiss the whole strategy and technology.

From my point of view, Tesla is converging on good enough more rapidly than anyone else in the industry. Courtesy of their AI strategy and technology; including the in house developed chips.

> Tesla is converging on good enough

This is a big problem, though. Any ADAS is a safety critical system and while "good enough" can help push volumes, it is extremely bad from safety perspective. If you really want to read on this u/adamjosephcook/ has some awesome writing going into far deeper detail than I could explaining this particular issue.

In terms of most lives saved, I'd bet Tesla's AI is winning, all those people they automatically stopped from driving into rivers etc

Not to mention that in-house silicon is all about economies of scale, this is even more of a puzzling move

Tesla has a path to economies of scale: they already announced that if Dojo works as expected they'll make it available to others as an AWS-style service.

Which is brilliant: they might end up making money on this.

AI is clearly here to stay. The demand for AI training will clearly explode in the future.

Running training in-house is not easy or cheap. You don't just plug in 1000 NVIDIA GPUs. You need massive up-front payment for GPUs and you're basically running your own extremely energy hungry datacenter .

Tesla might built and operate massive datacenters. They'll use as much as they need for internal needs and sell the remaining capacity to others.

This might take 5 years but the path to do it is clear.

I don’t see how they’re going to commercialise this as a cloud compute service.

For one, they’ve built a chip that operates in a fundamentally different way to other chips. So any other company that wanted to use it would have to invest a considerable amount of resources in building up the institutional knowledge to use it effectively.

Additionally, the lack of virtual memory and multi-tasking support renders it pretty much impossible to divide up compute between multiple customers. So, commercialising this would require customers renting out the whole unit, which is contrary to how cloud computing usually works.

Are there companies out there that have the capital and use cases necessary to fit into Dojo Cloud? Maybe, though not one I’ve worked for. Would they trust the stable genius currently heading up Tesla enough to make such an investment? Perhaps, but I wouldn’t, but what do I know?

> Additionally, the lack of virtual memory and multi-tasking support renders it pretty much impossible to divide up compute between multiple customers. So, commercialising this would require customers renting out the whole unit, which is contrary to how cloud computing usually works.

Only if you want to subdivide the compute on each dojo chip. You can still provide multi-tenant, support by allocating entire dojo chips to a single customer at a time. Even traditional time division multi-tasking is possible as long as you’re happy to accept multi-second long time slices. Then the overhead of clearing an entire dojo chip (or batch of chips), and setting up a new application, isn’t too high.

If you’re doing AI workloads, then none of the above are an issue. Training a large net takes days to weeks of continuous, single task computation. So selling dojo access in whole 1 hour blocks is a perfectly reasonable thing to do.

Part of the plan is to have PyTorch compatibility.

Dojo has it's own IR but they also have a PyTorch to Dojo compiler.

People's opinion of Musk wont matter: either Dojo will be a capable service at a good price or it won't.

People will use it based on merits.

Reliability is an important factor here, and I don't mean technology. Things don't look so good for everything that has to do with Musk. Today like this, tomorrow like that

>Things don't look so good for everything that has to do with Musk. Today like this, tomorrow like that

Such as? Except for FSD, his record is unmatched AFAIK when you take into account the novelty / complexity / difficulty.

One example, and certainly his main achievement: he said Tesla would sell and produce half a million cars by 2020, back in early 2014, and they hit that number with a 93.6% precision. https://youtu.be/BwUuo6e10AM?t=156

Some of Musk's stuff is great - other stuff isn't.

SpaceX? Great. Starlink? Sounds neat. Tesla? Pioneered electric cars with respectable performance and range.

But on the other hand, where's the hyperloop? Where's the affordable tunnelling? Where's the $35k Tesla - not available for order on the website, that's for sure. Where's the miniature submarine for rescuing children trapped in caves? Why has my buddy in Europe been waiting over a year for his powerwall to be delivered? Why are these norwegian tesla owners on hunger strike? Where's the full self driving, with taxi service? Why on earth would anyone want to buy Twitter?

Makes it very difficult to know which of Musk's statements are just spitballing, which are unrealistic timescale guesses and which can be relied on.

Getting any serious project architecturally 'locked in' to a special type of CPU you can only get from Tesla would be a bold move.

It's simple, SpaceX, Tesla and Starlink are evolutions of existing technology.

FSD, Hyperloop and such would be revolution like out of a Sci-Fi movie. They fail all because Musk like would like to have those things but in reality these things are much more complicated as he says.

How is the Las Vegas tunnel going? Or the brain implant?

He is good at marketing and developing existing technology.

But of his announced revolutions, none works.

yes, you still write checks to pay online and rocket boosters to this day are single use $100M pieces of hardware that we throw away into the ocean after each use.

seriously. musk is shady and weird, more so in the last couple of years, but come on.

It's about the perception of Musk. He built some successful companies but he is not Tony Stark as people tend him to see. He is a salesman not an genius inventor.

Don't expect FSD in the near future and don't expect a Mars colony.

FSD, especially the way he was describing it, was a blunder. the Mars city is a pipe dream... but I absolutely understand why it makes people follow him - he's the only one who set out to build a private space company with the purpose to actually get to Mars, with the side effect of completely uprooting the space launch industry. the achievements are undeniable, but it's the vision that makes the perception be so surprisingly good still. there's literally nobody else who says things like that and has the means to even try.

Reliability? Name one major cloud service where you can count on not getting randomly banned overnight. And yet people still use them.

Like I wrote, it's not about technology but it's chief

Next week he tweets he will take the service offline to buy AWS and then calls it off, that kind of reliability

> I don’t see how they’re going to commercialise this as a cloud compute service.

The simples most obvious would be “give me your datasets and we’ll train your model”

I assume you never actually built any cloud infrastructure yourself. Plus Tesla (aka) Elon, well, say a lot of stuff, not always necessarily correct.

Internal research product is super far from any actual production usage. Especially if you go against some established paradigms, that require enormous amount of effort (more than developing silicon) to build tolling around, so people can design, program, debug, monitor it.

But that’s internal usage. Cloud is a totally different ballgame. You have to deal with thousands more requirements (and you cannot generally tell customer to do something else instead, as you can with internal teams). And customers that have operating procedures totally different from yours, 0 access to your internal knowledge and infinite less tolerance for BS answers (as you are paying customer, not a someone on the same boat).

Building cloud is extremely hard, and there’s a reason why Google is still losing money on it.

Plus, let’s even say that your 5 year estimate is correct, Dojo is amazing and the future of tech and they may have viable product by then. Do you think that Nvidia wont advance their AI offering by then? Google TPU will stop being developed? Or will Tesla continue investing to churn new generation of Dojo every year?

> You have to deal with thousands more requirements (and you cannot generally tell customer to do something else instead, as you can with internal teams).

You can. AWS started with S3 when everyone was using databases. As long as it’s cheaper than its competition, single use-case (you won’t serve a website on these) has a market.

> You can. AWS started with S3 when everyone was using databases.

AWS staryed when there was no competitor.

Google started with a ton of world-class expertise when AWS was up and running and while operating already a colossal network of server farms using special-purpose, which Tesla has none of which, and after all these years barely got a 10% market share.

What they want is a training engine that is cheaper than whatever AWS or Google (or anyone else) can offer. If I can point my PyTorch to it instead of an AWS GPU for less money, why not?

> What they want is a training engine that is cheaper than whatever AWS or Google (or anyone else) can offer.

Bold assumption, considering Tesla's hardware does not exist, the market is limited and Google has already years of providing machine learning services with special purpose hardware.

What doesn't exist?

Their hardware. They have, at best, 1 supercomputer (though it's not actually clear if they have more than some Dojo prototypes to me). That does not a cloud make.

Ah yes, I see that now. But assuming they make the computer, they could also lease it to one or more cloud providers as a service. They don't necessarily have to build the whole thing.

Kind of what Cray/HPE does with Microsoft's Azure - you could get your very own Cray to run your workloads.

Sadly, not a very interesting one running UNICOS or NOS/VE...

> Tesla has a path to economies of scale: they already announced that if Dojo works as expected they'll make it available to others as an AWS-style service.

"If we manage to put together a working processor, supporting hardware, OS, and possibly ad-hoc programming language, our next step is to also develop a bunch of web services to provide cloud hosting services."

Not very credible. As if the key to offer competing cloud hosting services is developing the whole hardware and software stack.

And network infrastructure, isolation between customers, scheduling hardware allocation, etc etc running one own data enter is quite different than inviting all sort of third parties in.

Yeah, but it’s not like this is rocket science or anything.

> Yeah, but it’s not like this is rocket science or anything.

The key difference between this goal and SpaceX is that Elon Musk bought a private space company that already had the know-how and the market presence in a market with virtually no competitor.

In this case, Tesla is posting wild claims about developing the whole vertical integration of the whole tech stack barely over mining semiconductor raw materials, with which goal? Competing with the likes of AWS, Google, and Microsoft, on a very niche market?

Digging holes in the ground is hardly rocket science as well.

He did what? The pre-existing know-how to build reusable rockets? Are you confusing SpaceX and Tesla?

Ever since Elon became the world's richest man people like this have showed up. I don't know if they have been misinformed or if they just want to say negative things about billionaires. But the early history of SpaceX is very well documented in the book liftoff if anyone wants to know the truth.

You say that Tesla might do this for others, AWS style.

Then talk about the upfront in house costs of setting up for GPU ops. But ignore that if an AWS style model works for you, well, AWS is already capable of giving it to you in GPUs.

They aren't going after economies. If you look closely at their design choices, they are building a pure scale-out vector machine unlike anything else currently on the market. I'm guessing they expect it to be head & shoulders ahead for their inhouse workload.

Cerebras could decide to compete in that space.

Dojo is rumored 7nm to boot. They'll be competing with a process node disadvantage. TSMC is down to 3nm and 5nm already.

AMD chips like MI250x are 5nm with tensor matrix multiplication units. NVidia Hopper will be like 4nm IIRC?

During first AI day Tesla already said that they have plans and ideas for 10x improvements for Dojo v2. And I'm sure one of the improvements is to use 5nm or 3nm process.

They've already been working on this for almost 7 years. This is not some side projects but a serious operation. They have to ship something sometime and they decided to ship this now.

I'm confident that they know about 5nm process and chose 7nm for good reasons.

There will be next version and next process.

>And I'm sure one of the improvements is to use 5nm or 3nm process.

Implementing a process shrink is not just scaling the masks by the appropriate percentage. It often is a completely different optical train, at a different wavelength, different pellicle, changing the refractive index of the immersion fluid, different multiple patterning. It takes months (years?) of work.

For many applications it's worth it, but it's not at all a slam dunk. The vast majority of ASICs are designed for a particular node and never upgraded. It's the kind of crazy long-term speculative capital investment TSLA might have indulged in when the stock was at $414 a share, but it's nowhere remotely near that today.

They're already engaging in layoffs, today. Why spend the money?

> vast majority of ASICs are designed for a particular node and never upgraded

They are, if the new node is cheap enough to justify the investment in shrinking the design. In most ASICs, being faster won’t make people rip out their embedded electronics for new models.

Node is not everything, cheaper production and larger availability matters.

You both have limited availability of the node and limited availability of the particular chip.

Plus you are paying the margin of that company.

If you are running a machine at near 100% utilization (like supercomputers should), then your biggest cost will be power costs.

So power efficiency becomes king, and process node is the best way to minimize power consumption.

As far as I can tell Tesla doesn't even use Dojo seriously yet. The real work is still done on NVIDIA hardware.

Yes, per the talk Dojo is at the stage of having made first chips.

But your comment is neither here not there.

What do you think is a timeline for designing a chip like Dojo from scratch?

Tesla has been working on this for almost 7 years (https://www.linkedin.com/in/ganesh-venkataramanan-99272a3/).

They might not work fast enough for your satisfaction but can you point any other car company designing chips like Dojo?

Hell, can you point to any company at all that can come out of the gate with a chip design competitive with the best design of NVIDIA, a company that makes nothing but chips?

After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.

> After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.

They are clearly confident that FSD will work, to the point of promising it "this year"...

coincidentally, they've been promising "this year" for the last 7 years. And FSD still fails in spectacular/hilarious/horrifying ways, and isn't _remotely_ close.

Hell, can you point to any company at all that can come out of the gate with a chip design competitive with the best design of NVIDIA, a company that makes nothing but chips?


Sounds like Tesla is in good company then.

Yes. In addition to Nvidia and Google it will also have to compete with Intel, Cerebras, SambaNova, Graphcore, and maybe even AMD.

The point was that very few companies can pull off a clean sheet chip design from nothing. Google has done it, but Google is an elite company. So saying Tesla isn't the only company that has done it because Google has done it only shows that Tesla is doing very well.

But it seems clear to me that the point isn't to compete with these other companies, but to vertically integrate a critical component of their systems. They can discard a lot of legacy concerns and focus on raw power.

The thing is that if Tesla just uses Nvidia, like everyone else, then Tesla's stuff is only differentiated by software. Everyone uses Nvidia and it becomes hard to set themselves apart. But if Nvidia is chasing a broad customer base and has all this extra stuff to think about, then Tesla could potentially be more nimble, and produce a holistic system design that solves exactly the problems they have with no chuff. This could result in a more advanced hardware platform so their robotics products differentiate themselves with both hardware and software.

I am also happy because I am a robotics engineer, and in my opinion we need hardware that is 1000x more powerful than today to do what we really need. Nvidia wants to move at a certain pace, but if Tesla is trying to beat them on raw power, then Nvidia will play catch up and accelerate development of more powerful systems. This is great for everyone.

Internally Tesla can use whatever they want and it might even make sense in the long term, but if they want to sell their chips for general model training they better be much better than the future Nvidia cards they will be competing with when they start selling. Like twice faster at half the price and half the power - with perfect framework support. That last part is extremely important: if I see any mentions of anyone who changed “cuda” to “dojo” in their Pytorch code and ran into any issues, I’m not going to touch it with a ten foot pole. Just like I avoid TPUs because 2 years ago I heard people were having issues with them. And I’m the guy who has decided which hardware to spend millions of dollars on at several companies.

Yeah I just don't think that is really the main part of their strategy. Maybe they would sell chips or boards or servers if they are already making them, but I think it is mostly about internal use so their end products have a competitive advantage as complete robots. Robotics needs HUGE advances in compute and with their own chips Tesla won't have to be dependent on a third party for their success.

All the stuff you talked about about needing perfect support before you will touch it is something that takes a lot of work for nvidia and others, slowing them down. Tesla can ignore all that and focus on performance for their specific application, and I think this gives them the freedom to lead the pack on raw performance for their application.

I'm not sure if you've watched the presentations for how their self driving system is trained, but basically they have a million vehicles out in the real world with camera systems on them, and they have a massive server farm that is collecting new data from vehicles all the time, and they train their neural net by running millions of scenarios in simulation and against real world data collected from all those vehicles. And they have to re-train the system all the time with new data, and run it against old data to check for regressions. So they have this huge compute requirement to keep improving their system. They think that functional self driving will revolutionize their business (setting aside the valid criticism, this is what Tesla thinks) so they need to be able to handle an ever growing compute load that they have to be running constantly. So raw compute power is critical to the success of their plan. It may not be enough, but if they certainly can't succeed without it. But their needs are very specific, and it sounds like they've found an architecture which is simpler than most nvidia chips, but has loads of power. So it sounds like they are making a good decision, based on their specific needs. It is a huge risky bet, but then that's how Musk likes to do things.

Robotics needs HUGE advances in compute

This is surprising to me. Robotics clearly needs huge advances in algorithms (RL or something better). Do you mean you need faster hardware to discover those algorithms?

Oh we definitely need better algorithms too! But I’ve imagined that we’d want something like GPT-3 but for sensor experiences. So the way GPT-3 can ingest lots of text and predict a reasonable paragraph as a continuation of a prompt, we could have a system ingest lots of simultaneous sensor data in the form of LiDAR, cameras, IMU data, and touch sensor/skin sensor data, and then given the current state of those sensors it could predict what in the world is going to occur next, and use this as input to an RL system to make a choice of action. This seems to me to be both a useful system and one that could require a lot of compute. And that’s still probably not a complete AI system so there’s probably many many pieces required.

Looking at it another way, the human brain has wayyy more compute power than any of our current portable computers (robotics really needs to do most of its compute at the edge, on robot). Every robot I’ve ever worked with has been maxing out it’s CPU and GPU and still needed more compute.

When you look at Tesla’s hydra network for their self driving system you get an idea for what is needed in robotics, but just as we saw GPT-3 improve with network size, I suspect a lot of the base systems involved in a hydra net could improve with network size. And I suspect that there’s still more advanced stuff required when you move beyond a simple task like driving a car to a more advanced general purpose AI system. For example the Tesla self driving system doesn’t need any language centers, and we know GPT-3 and similar networks are large.

robotics really needs to do most of its compute at the edge, on robot

Why can't you hook your robot up to a GPU cluster? Tesla already has 7k+ A100 GPUs, the question is do they have algorithms which would clearly work better if only we could run them on 70k or 700k GPUs?

I mean, what you say makes sense, but have people actually tried all that and realized they need bigger GPU clusters? Is there a GPT-3 equivalent model in robotics which would probably work better if we scaled it up? If not, perhaps they should start with a small proof of concept before asking for thousands of GPUs. Original Transformer --> GPT1 --> GPT2 --> GPT3.

The problem with this is that autonomous robots need to function in the real world even without internet connectivity. For example I am designing a solar powered farming robot. We do have Wi-Fi and starlink here but Wi-Fi and internet go down. In general we think it makes most sense for it to be able to operate completely without a continuous internet connection. And take self driving cars where millisecond response times matter - those can’t rely on a continuous connection to function or someone will get killed. But as systems get more advanced it is my opinion that edge compute will be an important function. And edge compute can’t handle the state of the art networks that people are building for even text processing, let alone complete autonomous AGI.

And no, the models I’m talking about don’t exist yet, I am speculating on what I think will be required based on how I see things. But I’m not asking for thousands of GPUs. I’m just speculating in internet comments about what robotics will need for a fully embodied AGI to function. I believe more edge compute power is needed. Perhaps 1000x more power than the Nvidia Orin edge compute for robotics today.

Sure, I understand the need for the robot autonomy. The problem, as I see it, is that current robots (autonomous or not) suck. They suck primarily because we don't have good algorithms. Without such algorithms or models, it does not matter whether a robot is autonomous or not, or how much processing power it has. Only after we have the algorithms, the other issues you mentioned might become relevant. Finally, it's not clear yet if we need more compute than what's currently available (e.g. at Tesla) to develop the algorithms.

p.s. I don't think AGI is needed for robotics. I suspect that even an ant's level of intelligence (together with existing ML algorithms) might be enough for robots to do almost all the tasks we might want them to do today. It's ironic that robots today can see, read, speak, understand spoken commands, drive a car (more or less), play simple video games, and do a lot of other cool stuff, but still can't walk around an apartment without falling or getting stuck, do laundry, wash dishes, or cook food.

>After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.

Being real for a second, Tesla don’t have an amazing track record of delivery on their ai software side of things.

Also, even if the first version didn't make financial sense, building the team and the infrastructure required to make this work will likely translate into a next generation chip.

Part of Tesla vertical integration mantra is that you are building internal capacity.

>After 7 years of work they are clearly confident that this will work or else they would scarp the project instead of giving talks at conferences.

Doesn't the same apply to FSD? How many years is it now ready the next year?

Remember AI? AGI seems just around the corner since the 60’s. There we learned that the things that seemed complicated (such as beating a champion at chess) were simple and the simple things are still almost mind blowingly hard.

It seems like to me they took a big bet and it didn’t pay off. They may not have felt like they could rely on Nvidia to deliver the promised Tensor Core performance.

It's a bit premature to call this a failure given they are barely at a stage of making first chips and validating them.

You don't seem to understand the timelines of designing chips like that from scratch.

Dojo has been in works for almost 7 years. Tesla will continue to work on this for the next 20 years. There will be Dojo v2, Dojo v3 etc. just like at SpaceX there was Falcon 1, Falcon 9, Falcon Heavy.

This still might end up a failure but they clearly feel confident enough to talk about Dojo publicly, which wasn't the case for the first 5 years of its development.

It's even more premature to predict what will happen 20 years in the future.

Thanks for explaining to me what I don’t know. Clearly you’re a Tesla fanboy.

There is no doubt that it is an amazing piece of tech but I’m not confident Tesla will be able to pull off beating NVidia. Especially when compared to NVidias Tensor Cores and economics of scale. I don’t like their whole approach to self driving ML. I know a lot of people disagree with me so I would rather not get into it.

I think a lot of the talk of future performance is posturing in order to get NVidia chips at cheaper prices.

I'm sure that it's at least partially for them to have a better negotiating position with NVIDIA. But the reality is that Tesla actually has some very good expertise in chip design. Jim Keller worked there for several years and along with Pete Bannon designed their Autopilot HW3 computer which is in 1+ million cars right now. At the time that HW3 was released it outperformed what NVIDIA had to offer. That said it's not likely they'll be able to beat NVIDIA, but they may be able to beat them for their hyper domain specific use cases. Additionally NVIDIA chips are difficult to get and they're extremely expensive. Even if Tesla can get something that performs 80-90% as good as NVIDIA but at significantly lower cost then it may still be worth it.

I know these things. I think some of the dojo architecture was a reaction to their FSD chips being over optimized for resnet style models. They’re targeting extremely large models which is a new frontier for ML and in my view not the panacea they hoped it would be.

I think with better ML they wouldn’t need so many chips, be it NVidia or Dojo.

Perhaps they do this so they can carry over this tech into their cars. If you are making millions of cars per year, with tens or hundreds of chips per car, it is easy to reason doing this on your own instead of outsourcing it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
