The key element missing from the generated images is understanding of form. In recognizing objects we're generally relying on shape first(so, strong outlines and silhouettes, blobs of color, and so forth). But only afterwards does our brain start to see forms in perspective.
When learning drawing I gradually got a sense of what is really going on is that I'm gaining a more conscious command of different shapes, just like when I learned to write letters; but instead of abstract marks, I'm learning the shape of hands, arms, etc - and from various perspectives. And so if I study a lot of the same shapes in a topic like anatomy or wildlife, I can replicate them from memory with fairly accurate proportions.
The difference between me and the AI, in its current form, is that the AI continues along the path of being an extremely smart shape recognizer and reproducer(as it should be, given some the first applications of the tech were to text recognition). So it can output a lot of details I can't(without lots of reference) and blend in stylistic ideas I'm unaware of. But I, while having a much more limited visual library, can mix in more details of the perspective, how anatomy and clothing work, and other kinds of logic. I can push the shapes to convey specific action and expression, design lighting situations and so on.
AI's ability to do it all in one step gives it a result that is very "savant", because it doesn't know what is and isn't a coherent image, but it has total mastery at making the shapes and applying rendering. Some of the things I've seen it do to prompts are wildly creative in interpretation as a result. It's a good tool.
Those art ML models indeed operate on wrong premise that the input and output images are entirely raster fields, but most of them should actually be considered curve fields with the curves internally extrapolated to complete color or texture filled 3D-shapes by what's known as gestalt principles*, volume estimation from shading etc. What should be raster is only filling textures.
The current approach creates huge limitation of input/output images being like 512x512 small and a whole load of texture-turning-to-shape and vice versa artifacts.
It could be possibly overcome with a paradigm shift, though.
When learning drawing I gradually got a sense of what is really going on is that I'm gaining a more conscious command of different shapes, just like when I learned to write letters; but instead of abstract marks, I'm learning the shape of hands, arms, etc - and from various perspectives. And so if I study a lot of the same shapes in a topic like anatomy or wildlife, I can replicate them from memory with fairly accurate proportions.
The difference between me and the AI, in its current form, is that the AI continues along the path of being an extremely smart shape recognizer and reproducer(as it should be, given some the first applications of the tech were to text recognition). So it can output a lot of details I can't(without lots of reference) and blend in stylistic ideas I'm unaware of. But I, while having a much more limited visual library, can mix in more details of the perspective, how anatomy and clothing work, and other kinds of logic. I can push the shapes to convey specific action and expression, design lighting situations and so on.
AI's ability to do it all in one step gives it a result that is very "savant", because it doesn't know what is and isn't a coherent image, but it has total mastery at making the shapes and applying rendering. Some of the things I've seen it do to prompts are wildly creative in interpretation as a result. It's a good tool.