Going to have to be the naysayer here. First, I'll say the simple fact that Stable Diffusion produces anything coherent is incredible. I'm blown away by the tech. However, my honest opinion of the results showcased in this article is not positive. Many of the result images contain bizarre distortions or dreamlike artifacts that severely disrupt the flow of the image. Especially the first one. It's clear that there's something like a person with long hair standing in front of a peak shrouded in clouds. But it only takes one or two moments to see the flaws in the output. I wonder if hyper parameter tuning would help.
Again, I think the results are impressive in their own right. But they seem impractical on account of the flaws in the details.
Weird until I read your comment I was blown away. Then I had another proper look at the first image and in many ways I had to turn off my brains amazing ‘upscaling’ ability.
My brain had upscaled that human like blob to a woman spinning around with a sword so her hair covered her face.
Looking closely. None of that is there really, just a suggestion of it. And that is enough.
The more I learn about vision and sight the less I’m sure that we see reality.
The key element missing from the generated images is understanding of form. In recognizing objects we're generally relying on shape first(so, strong outlines and silhouettes, blobs of color, and so forth). But only afterwards does our brain start to see forms in perspective.
When learning drawing I gradually got a sense of what is really going on is that I'm gaining a more conscious command of different shapes, just like when I learned to write letters; but instead of abstract marks, I'm learning the shape of hands, arms, etc - and from various perspectives. And so if I study a lot of the same shapes in a topic like anatomy or wildlife, I can replicate them from memory with fairly accurate proportions.
The difference between me and the AI, in its current form, is that the AI continues along the path of being an extremely smart shape recognizer and reproducer(as it should be, given some the first applications of the tech were to text recognition). So it can output a lot of details I can't(without lots of reference) and blend in stylistic ideas I'm unaware of. But I, while having a much more limited visual library, can mix in more details of the perspective, how anatomy and clothing work, and other kinds of logic. I can push the shapes to convey specific action and expression, design lighting situations and so on.
AI's ability to do it all in one step gives it a result that is very "savant", because it doesn't know what is and isn't a coherent image, but it has total mastery at making the shapes and applying rendering. Some of the things I've seen it do to prompts are wildly creative in interpretation as a result. It's a good tool.
Those art ML models indeed operate on wrong premise that the input and output images are entirely raster fields, but most of them should actually be considered curve fields with the curves internally extrapolated to complete color or texture filled 3D-shapes by what's known as gestalt principles*, volume estimation from shading etc. What should be raster is only filling textures.
The current approach creates huge limitation of input/output images being like 512x512 small and a whole load of texture-turning-to-shape and vice versa artifacts.
It could be possibly overcome with a paradigm shift, though.
Artists have for the longest time used our brains ability to upscale. Many paintings, even ones that seem super detailed like those by James Gurney in his Dinotopia series, will have blobs in the background. Our brain will recognize based on silhouette and shape an extraordinary amount of detail that isn’t actually there. Detail such as the type of clothing and the action of a person. But if you look closer it’s a rectangular blob with a triangular blob within it to indicate clothes.
The difference between AI art and actual human art is the level of intention one can detect in it. When I look at human art I absolutely marvel at the cleverness of the artist to convey something that still looks like what I was imagining even when I look closer at it.
With AI art I look closer and realize that the blob presents more confusion the closer I look at it.
I’ve been telling anyone who will listen that AI art isn’t stealing much lunch when it comes to professional art. But it may very well be a powerful tool to artists to speed up their workflows and artists who refuse to use the tool stand a chance at being left behind the same way some artists got left behind in the illustration industry world once digital tools showed up.
> I’ve been telling anyone who will listen that AI art isn’t stealing much lunch when it comes to professional art.
Yet. These models have been out for only a matter of months. Just last year the state of the art was DALL-e v1, which is a toy in comparison[0] to DALL-e 2/imagen/SD.
Making predictions is perilous but it would be surprising to me if computers did not have fully super-human artistic ability in the next 5 years.
I mean for sure. I’m not going to make any predictions about the future when the field is so young. It’s only related to the current generation. Honestly though, art isn’t the field to look at for seeing when the jump is coming. It’s in AI actually being able to recognize relationships between things. Basic stuff like looking at enough pictures of horses and recognizing what the leg is and how many legs a normal horse will have. Even stable diffusion which has some of the best generation will still give me 5 legs or two legs coming out of the same side. These kinds of images are a boon to artists who will need to do all the final corrections.
Relationships between things is complicated. Someone resting their face on a fence is going to have a huge number of effects on the deformations of the face, especially the eyes and hair depending on how they are resting. It’s not enough for AI to have seen enough pictures of faces on fences to be able to apply that in an image. It needs to understand what pressure and gravity is doing to the underlying structures. That’s how human artists study at least. It’s why they can take that lesson and apply it to learnings about how skin behaves depending on the age of a person. They aren’t copying. They are solving problems by thinking about muscles underneath and how they change depending on any number of factors.
None of this even touches on lighting and colours.
If there’s one prediction of the future I’m willing to make, it’s that until research progresses on teaching computers to apply actual knowledge, AI within the creative space will remain assistive instead of replacing.
We see reality in some sense - light of various wavelenghts reflects off physical objects and enters our eye. However, our processing and subjective interpretation of the input is what can be more subjective, because every person's brain is going to process such items differently based upon experience (especially early life experience, when our brains are the most plastic). Also, some people have sensory differences (such as color blindness) that can influence the processing of the light that enters our eye.
Objective reality exists, for some definition of "exists" - there is physical matter present, with properties enabling some or all wavelengths of light to reflect (and similar for other senses like hearing). However, if we viewed reality devoid of the subjective processing, we'd "see" everything, but key existential concepts such as object permanence would not be possible, as that requires our brain be able to process and recognize an object in order to identify what the object is in the first place, to even be able to remember what it is. Not entirely unlike the iterative process of modern machine learning.
Are you naysaying Stable Diffusion or the idea of generative models for art in general?
It's hard to look at what's happened in the last few months and not think of it as akin to the invention of the steam engine, but for art.
It's not perfect, as early machines had many flaws, were wildly inefficient, produced irregular output. But the innovation that followed created the industrial revolution.
What's missing is the charm. In general, AI-generated visual content tends to have a very sterile look to it. The only image in said thread that looks even remotely decent to me is the one of a canyon/bridge. The rest are what I'd classify as voids.
>AI-generated visual content tends to have a very sterile look to it
I think it's because the prompts are often sterile as well. you have to add stuff like "matte painting" and "dream" just like in the first example in dreamstudio."very detailed landscape with xy" also works fine. Avoid prompts like "digital art" or "render".
The inaccuracy or weirdness of the resulting images has no bearing on how good or bad it is as art. Art has nothing to do with that. I would argue this is a shitty tech demo more than anything else.
I do not mean to discount the creator as it’s cool regardless, it just doesn’t really have anything to do with art. They’re literally just running some old computer images through a technology. That’s it.
There will probably be good art conceived of good artists that uses this style and these techniques at some point, though.
Very similar stuff was said about the camera. Vermeer even essentially used one to win the realism competition in painting (of course later becoming Ayn Rand's favorite painter since he was the most "objective").
This is the opening act—of course the tech has all sorts of issues.
What I’ve found more amazing than the tech is how rapidly and intensively a community has formed around Stable Diffusion and all the stuff they’re doing with it. I’m more confident than not these issues will get worked out.
This whole thing has been a breath of fresh air. We can run this stuff on high-end workstations and we’re not beholden to tech giants for interesting creative ML applications.
The inconsistent and often incoherent shadows and lighting effects are immediately annoying in all generated images if you've spent any time working with visual media.
I'd be interested to know the parameters used, especially prompt_strength
The correspondence to the original image is not especially high: the Leisure Suit Larry image, for example, enhances the original colours of the sea in a nicely realistic way, but all the foreground detail is essentially reinvented from scratch, including some very obvious omissions. In some of them the changes to perspective and more lifelike skull/canyons etc might improve on the original image, but it also flips even pretty basic stuff like which shoulder the woman's hand is placed on (and yes, once you look at that hand closely, the fingers SD has had to add in are all wrong...)
Ideally for this sort of use case you'd want high fidelity to the geometry of the original image but less fidelity to the palette (use more than 256 colours and naturalistic or artistic textures rather than lines and pixel dithering), but I'm not sure SD can manage that at the moment
There is an art to conveying a feeling with limited resources. That's what makes early computer game images (the good ones... because there were plenty of bad ones) so special.
The same could be said even more strongly for words. I'm not a writer and I don't remember everything, but someone famous once said something about eliminating everything non-essential from writing to make it better. That's what makes a writer really great.
Even so, the AI-upscaled versions of the original art are impressive.
I said it before on this topic, and I'll repeat it. We will someday (soonish) have games where the art is generated in real time, unique for each player, based on good inputs. And it will be awesome. Every play and every experience will be relatively unique, but most or all of the plays will be excellent. That's an exciting prospect.
Why stop at the art? We'll have unique storylines and characters as well. Possibly even unique gameplay, although at that point we would pretty much be close to some sort of AGI anyway.
I found the intro of one of the kings quest games run through SD a few days ago, but can't seem to find it now. It wasn't that impressive since I guess there wasn't much fine tuning going on, but I liked the general idea. It had a few funny hiccups, but I'd have expected much more erratic behavior, because there is no information shared between frames (I guess). But maybe this concept can be improved upon?
Wonder if one day we can do realtime video conversion with this tool on a phone using the camera. We be mind blowing to see it live in action and changing your backyard in any style you like to see. Maybe even with VR. Ultra tripping experience. I guess we have to wait 5 to 10 years
This is cool, but img2img likely couldn't be easily used to make images in such games, because every scene would look very different. The problem is that you can't maintain the exact look between images. It could be used for artistic ideas and raw material though.
If you have over 4MB VRAM you can run it locally. I've been experimenting recently and find that even with 10MB VRAM I can only get 256x256 resolution images. I have a Dockerfile I can share that packages up the install process and removes censorship if anyone is interested. I find the censoring is extremely conservative.
Yes, windows with a RTX 2070 Super. The logs say the app is trying allocate just a few hundred MB more than what I have. I'm reasonably happy with 256x256 for now, just messing around.
You can run it with a command like this (I'm on windows):
docker run -it -v <model file path>:/stable-diffusion/models/ldm/stable-diffusion-v1/model.ckpt
-v <outputs folder>:/stable-diffusion/outputs
-v <inputs folder>:/stable-diffusion/inputs
-v <cache folder>:/root/.cache
--gpus all knightley python /stable-diffusion/scripts/txt2img.py --W 256 --H 256
--prompt "a horse wearing a top hat"
assuming you build the image and tag it "knightley"
I liked it until I noticed that it costs me $10 a day. I'm also not sure if they support img2img. I have used huggingface and then replicate for it(NSFW filter is very annoying).
Unless you’re looking at something else, it’s not per day, it’s a credit system that amounts to about 1 cent per image. How many images do you plan to generate a day?
I'm not against paying but it should be somewhat raisonnable. to put things into perspective: Colab pro only costs about $10/month and you will probably be able to generate at the same speed.
I did too, but I bought another 1000 generations and haven't run out yet. Many days I don't use it at all. I like it better than a monthly subscription like MidJourney.
Yeah, it’s amazing that things like Stable Diffusion can make coherent looking images. But why? Why convert iconic EGA and VGA *art* into algorithmic representations? To me, it doesn’t improve on the originals in any way. I would never play a game with the graphics replaced by these. To me, it is the constraints of older technologies that led artists to create imaginative works that transcended the materials available (big pixels with a limited palette) and helped create worlds where your own imagination could run wild. Replacing these images with photo realistic Thomas Kinkade-esque treacle is an abomination.
Why do people always ask “why?” on these things? Not everything has to be “useful” or have commercial value. The original games aren’t going anywhere. People just enjoy doing this kind of stuff, pushing the technology to make interesting things.
"""At the 1983 Academy Awards, Oscar voters declined to nominate [Tron’s] pioneering special effects. Lisberger said it was because they felt using computers as an animation tool was cheating.""" - https://www.moviefone.com/2017/07/08/19-things-you-never-kne...
Who said it's art? It's new and cool and fascinating. That's all.
Contrary to what you wrote, these AI-generated ones aren't replacing the originals. The originals are still there.
Sorry, but I think you're reading way too much into this... it's a fun toy, it's not replacing game art and it's not an "abomination". It's cool and new. That's all.
Again, I think the results are impressive in their own right. But they seem impractical on account of the flaws in the details.