Not an expert, I would guess at least these two things:
1. There are actually around 20 lenses in a camera, not one.
2. A lens converts a ray into a cone, right? And then you need to pass that cone through the other 20 lenses which further modify it's shape, differently at different wavelengths.
I would guess this is extremely computationally expensive done naively.
Cone tracing is a thing that exists, but I think most path tracers just just regular rays. (There's also ray differentials, which if I understand correctly, are a way of tracking how a cone of light would spread out after bouncing off of curved surfaces, if the ray were a cone.)
As to why lenses would cause a problem specifically, I don't know the path tracing algorithm well enough to say for sure offhand, but it may have something to do with introducing random sampling before you even hit the first scene object. Usually in ray tracing the first hit is kind of a freebie; you can calculate direct illumination exactly, and it's only indirect illumination that's approximated. (In path tracing, it's approximated by doing a lot of random sampling.) So, not having to approximate the first-hit direct illumination reduces the noise quite a bit right off the bat.
1. There are actually around 20 lenses in a camera, not one.
2. A lens converts a ray into a cone, right? And then you need to pass that cone through the other 20 lenses which further modify it's shape, differently at different wavelengths.
I would guess this is extremely computationally expensive done naively.