> However I'm willing to bet that we'll soon have something much better: you'll ...

> However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.