True FSD will only come from LIDAR + vision. Relying only on vision feels exceedingly dangerous. I recall a recent (last 2 years?) case where a guy was essentially decapitated by a flat-bed semi because his car couldn’t recognize the semi’s bed due to it being right at camera level.
I think the Pure vision stuff is nifty, but “nifty” is the exact dead-last parameter when trying to transport my family safely from point A to point B.
So how does LIDAR read a sign (not just the shape of the sign, but the content)? How does it read the colour of a traffic light? How does it read brake light status or indicators on the car in front?
Oh, it can't?
So you need cameras too?
So you have to build a 3-dimensional sensor fusion system for LIDAR to work?
Wouldn't that fusion system be more complex, less performant, and more fallible that just choosing a single model (vision/LIDAR) and optimising around that?
10 million years of animal evolution makes a good case for stereoscopic vision as a sensing system for navigating the world.
LIDAR is a crutch for low-maturity software systems.
Well, bad news, Tesla Vision is not stereoscopic since they use monocular cameras [1] except in the front where they use three side-by-side cameras with different focal lengths (60m, 150m, 250m) with less separation than human eyes which can not be used for stereoscopic depth calculation. So, even if we assumed that for some reason we would want to hamstring ourselves by restricting ourselves to evolution's solutions and the metaphorical equivalent of flapping airplanes, Tesla was apparently still too stupid to realize that animal evolution resulted in two eyes, not one.
It is actually shocking that anybody pushes the "good enough for evolution" narrative when Tesla has completely and utterly failed to do even that right. This is even ignoring all of the other purely mechanical inadequacies of cameras relative to eyes such as resolution, dynamic range, dynamic attention-based focal lengths, being mounted on mobile swivel to allow for parallax calculations, etc. Let alone the other neurological elements that are not fully understood of integral to human-level perception. But no, they could not even get the two eyes instead of one eye part right.
Not all animals have stereoscopic vision. Many birds and fish see entirely separate images from their two eyes. Animals with injuries to one eye still have functional vision, even if worse. The "brain" part is the one that really helps with animal vision. And it includes evolved generational models, sensor fusion, memories of past experiences and other inputs beyond just two "cameras".
Overall, the complexity of modeling the world the way an animal does seems much much bigger than a few different sensors.
There is no reason to believe that we can achieve brain-like success at 3D vision with any current approach. As such, having multimodal sensors that animals don't have access to seems like a much more promising approach, and far closer to a sure bet. Basically, leverage technology that far surpasses animal senses to make up for the much dumber processing powers.
If Tesla had been limiting its ideas to theoretical research, or even applied research, I'd be all for vision only as a valid research avenue. But they are putting this thing on the streets where I walk, and they are charging people thousands of dollars with claims that it works today, and that it will do wonders tomorrow. That is simply not acceptable for a green field research concept that probably has decades left in front of it.
Some people think that because we can build cameras we can build vision. Most of vision happens in the brain, and we're nowhere close to being able match the human brain's visual processing system (despite what some AI proponents would have you believe). That includes the ability to build 3D models of the environment based on two slightly different images, and then (most important) to infer what those models mean w.r.t. learned experience and common sense, and thus whether it's safe to run over them (e.g. a crumpled newspaper in the road) or not (a soccer ball rolling out into the road, which will often be followed by a child).
Assuming positive intent here, I'm not sure how you reason that a fusion system might be less performant than a single model?
My whole point is that a single system is not sufficiently safe to do what we're trying to do. The point of FSD isn't just to navigate without hitting stuff. The point is to do it with ~99.999% accuracy, ~99.999% of the time while flying down the road at 80 MPH.
Humans are _terrible_ at this, just check the deaths due to automobile accidents each year. We have some pretty amazing stereoscopic vision. But I don't like trusting my safety to another human, and I sure won't trust it to a machine whose vision isn't as good as mine.
For me to trust a machine, it needs to be an order of magnitude better than myself.
>Wouldn't that fusion system be more complex, less performant, and more fallible that just choosing a single model (vision/LIDAR) and optimising around that?
No, the whole point about sensor fusion is that it can be greater than the sum of it's parts.
I think the Pure vision stuff is nifty, but “nifty” is the exact dead-last parameter when trying to transport my family safely from point A to point B.