So I tried it with an image of a monkey that I often use for profile pictures (https://mathstodon.xyz/@OscarCunningham). This image wasn't made by Stable Diffusion. It gave me this prompt:
> a monkey plushie on a white background, photograph taken by steve buscemi from a zoom lens, studio lighting, ultrarealistic
Can someone tell me what Steve Buscemi is doing here?
CLIP Interrogator uses BLIP, an image captioning model, as well as trying a bunch of prompts with CLIP. I guess you mean that this model uses the captioning model to generate the complete prompt? Is the code for this one available?
Maybe it was trained on a image set of politicians. I put in a image Dr. Evil doing air quotes and it came up with "FirstName LastName from FirstName LastName in star trek the next generation ( 2005 ) ( 2 0 1 9 )".
FirstName LastName being the name of a politician.
This is the thing about AI that I find simultaneously wonderful and terrifying. Something to do with me as a human, noticing a hilarious detail amidst a fathomless ocean. It affirms my humanity but the backdrop is dizzying randomness.
Isn’t it just CLIP ? The model that made these image generation models possible.
It’s good to describe a picture but it’s not reverse engineering. The predicted prompt usually has very little in common with the actual prompt. And it’s worse when you use embeddings or fine tuned models.
What's interesting to me is that it even tries to predict the prompt on images that came straight from Stable Diffusion with no editing - which is weird because such images actually do have the prompt embedded inside of them already. (At least, that's the case for me - the prompt and parameters are stored in a tEXt chunk in the PNG file, which can be read with, for example, "pngcheck -t".)
True, images generated through some UIs have prompts in meta data, aim here is to work on images people find online with no metadata. So it doesn't try to read the metadata but actually predict a similar prompt
It is based on an image-captioning model, so a different approach then CLIP interrogator, though you are correct that aim is not to get the exact prompt back but actually get a prompt to generate similar styles of images
I'm surprised that the results seem much better (more detailed and sometimes closer to the original prompt) than the regular CLIP interrogation (at least based on my limited experimentation).
But as you say, even so, it still has little in common with the original prompt.
I'm having a lot of fun dropping in my Midjourney images, getting a more detailed prompt, then putting it back into Midjourney for more interesting variations
So, people are commenting that it's not very accurate etc., but I love it. Delightfully quirky tool for exploring prompts.
Also I laughed out loud after putting a selfie into it and getting "mark zuckerberg's face reflected in a mirror, close up, realistic photo, medium shot, dslr, 4k, detailed"
Hee, cute, but predictable once the ... showed up. Honestly I think the ending was unnecessary; it destroys all subtlety.
Of course, this is not actually how latent space works. It's the AI's understanding of concepts, not the inherent nature of concepts; that's why every model has its own version of "latent space". Though the understanding of latent space in the story is internally consistent; given a superintelligent image generator, you could do prompt engineering like this.
I often use terms like "sexy", "risque" etc. in the process of getting images that are quite sensible (like military people playing chess). I use img2img repeatedly looking for particular photo-film aesthetics and tend to accumulate prompts. Anyway, this would open me to charges of sexism (or worse "misogyny"), and makes me uneasy about using SD.
Edit/OH: it generates prompts for like Excel screenshots but not for images made with the img2img model at hugginface. Fascinating.
Why does it choke on img2img creations? This is just fascinating. It gives a plausible prompt to at least a handful of real non-AI photos from DSLRs and iphones alike.
I mean, it does mistake Frank Sinatra for Louis Armstrong -- I guess it makes errors. But it just refuses to process my img2img images. It breaks down. Why?
It made me so agitated I made a gallery[0]. Granted, some of those images are strange, but others are just normal people doing normal things.
img2img creations are probably confusing it because they are totally rearranging how the prompt emerges from what's in the latent space with respect to a new image. So if you make an image of <subject> by passing in an image of <subject>, it's going to represent <subject> in a fundamentally different way than if it relies purely on its own "imagination" for the rendition.
Yes, it's strongly influenced by clip-interrogator, but I revamped the algorithm quite a bit. I think it could be improved even further without resorting to fine-tuning the BLIP model.
Agreed, results not good for this style of images. I have a model training on a much bigger dataset of image-prompt pairs which should perform better on this.
How is this different from image captioning when the model used is a booru model? That's already a thing people do with making their training data for fine tuning these models.
It actually works on top of an image captioning model, SD takes in keywords as well like "artstation" and "octane render" which are not covered in standard captioning so that is why the difference between using an off-the-shelf captioning model vs this
Ok, this is very cool! I took one image from a series I generated, had it guess a prompt (very different from mine, but it doesn't matter) and have it regenerate an another image that captures the same feeling: https://imgur.com/a/Jz0mBej
folks, the AI called me "beautiful" and said I look like Chris Pratt, despite being a middle-aged and overweight computer programmer. They need to monetize this immediately.
> the man with the stupid face of a homeless person, portrait photography, 1 9 7 0 s, street photo, old photography, highly detailed, hyperrealistic
The actual description is that she was Mary Ann Bevan (1874 - 1934) also known as Rose Wilmot, a woman who claimed the title of the ugliest woman in London as she suffered from acromegaly.
The model was trained exclusively on stable diffusion generated images, so can be unpredictable with non-generated images, specially images with people in it
interesting.. but my guess is it's using a big library of generated image and prompt pairs? So all its suggested prompts are right out of someones 'stable diffusion prompt cheatsheet.pdf' . That is to say overly outputting commonly known artists, and things like 'trending on deviant art'
It works by using an image-captioning model finetuned on SD prompts, so it may be outputting common known artists based on their occurrence in the training data
This doesn't work for anything that doesn't use the upstream default stable diffusion checkpoint. I generate a lot of images with Pastel-mix, Anything, Waifu Diffusion and Counterfeit, and none of those are giving sensible results with this tool.
The model underneath this is trained only on data from SD 1.4/5 so this would be expected. I have another model training which covers all models you mentioned which should perform well on those
Can I have some kind of direct contact with you? Email, Twitter DM, Telegram, heck, I would download some new kind of app just to have some conversations.
It called a picture of my girlfriend "unbelievably cute" so she's now a fan.
And it described my profile picture on Mastodon as "my husband from the future that looks similar to travis scott and mark owen. he is also a good boy, very!!!, and beautiful!!!"...
> a monkey plushie on a white background, photograph taken by steve buscemi from a zoom lens, studio lighting, ultrarealistic
Can someone tell me what Steve Buscemi is doing here?