They do have stereo cameras and sensor fusing, as well as detecting more than just the lines in the road. Here is what the camera sees: https://www.youtube.com/watch?v=rACZACXgreQ
What in that video suggests sensor and/or stereo fusion to you?
I notice that the temporal coherence is pretty bad-- Pedestrians pop out of recognition when they go behind trees; lane/exit boundaries wiggle all over the place and occasionally frame-pop into different configurations. A kalman filter, for example, is a state estimator which maintains temporal coherence, and makes heavy use of previous estimates/sensor inference when computing the most updated estimate. It doesn't look to me like that kind of strategy is being used to maintain the vehicle's world model. IMO a good estimator wouldn't treat "a pedestrian popping out of existence" as the most likely estimate for any circumstance, let alone one where they were clearly present in the previous 50 frames. I don't doubt they're using KF on the vehicle's inertial movement, but based on the failures and this video, it sure doesn't look like it's using a fusion technique for the world model.
There are left and right-looking cameras, but the FOV overlap between them is not very substantial, and there can't be stereopsis where there is no overlap. Per the Tesla website, there are three forward-looking cameras, and they each have a different FOV. The parallax baseline between them is only a few centimeters, too, so the depth sensitivity isn't going to be spectacular. It's certainly possible that there could be some narrow-baseline stereo fusion, but it could only really happen inside the narrowest field of view, where the coverage overlaps with more than one camera. That's the circumstance where having a narrow baseline would hurt the most. Based on that it doesn't really seem like the system is well set-up for stereopsis; if it's there it seems like an afterthought.
I could certainly be wrong, as I don't have access to the code. Are you going by some other secondary source/information?
To be fair, it could be that this is what the camera segmentation does before it is combined with other sensors, and before it is used to update the word model (which then has temporal information)
I've got one eye at 20-20 vision, and the second legally blind without correction. My drivers license has a little note that it's not legal for me to drive without my glasses, which I never wear under any other circumstances.
So it's not so clear cut as you make it out to be.
(And you know what? Even if it were legal for me to drive without those glasses, I'd still drive with them. Because ranging is important!)