Hacker News new | past | comments | ask | show | jobs | submit login
Img2Prompt – Get prompts from stable diffusion generated images (img2prompt.io)
183 points by sahil_chaudhary on Feb 8, 2023 | hide | past | favorite | 75 comments



So I tried it with an image of a monkey that I often use for profile pictures (https://mathstodon.xyz/@OscarCunningham). This image wasn't made by Stable Diffusion. It gave me this prompt:

> a monkey plushie on a white background, photograph taken by steve buscemi from a zoom lens, studio lighting, ultrarealistic

Can someone tell me what Steve Buscemi is doing here?


Can you clarify whether or not Steve Buscemi actually took the photo?


This one also had me rolling, haha. Thank you, too. <3 :))))


How do you do fellow humans?


This sounds a lot like the results you get from CLIP interrogation. Maybe they just use that and made another online service for it?


It is actually based on a different approach, it uses an image-captioning model finetuned on image-prompt pairs


CLIP Interrogator uses BLIP, an image captioning model, as well as trying a bunch of prompts with CLIP. I guess you mean that this model uses the captioning model to generate the complete prompt? Is the code for this one available?


Ah yes, this model treats this purely as image captioning. The model isn't open source yet.


Maybe it was trained on a image set of politicians. I put in a image Dr. Evil doing air quotes and it came up with "FirstName LastName from FirstName LastName in star trek the next generation ( 2005 ) ( 2 0 1 9 )".

FirstName LastName being the name of a politician.


This is the thing about AI that I find simultaneously wonderful and terrifying. Something to do with me as a human, noticing a hilarious detail amidst a fathomless ocean. It affirms my humanity but the backdrop is dizzying randomness.


I generated a bunch of variations and eventually got something kinda close to your source, but it seems pretty hit or miss: https://i.pica.so/b014bb2f-ca05-499a-a4da-1d178f49bb06.jpg https://i.pica.so/5ee6e122-c2a4-44a7-8ae9-ecec24da0246.jpg


Steve Buscemi -> Parting Glances -> AIDS -> monkeys ???


I'm dying laughing right now, this is phenomenal comedy gold.

Thank you.


Pictures with zoom lenses, obviously :D


His style is just so beautifully unique


A photo taken in the style Steve Buscemi would have if he were a photographer.


Isn’t it just CLIP ? The model that made these image generation models possible.

It’s good to describe a picture but it’s not reverse engineering. The predicted prompt usually has very little in common with the actual prompt. And it’s worse when you use embeddings or fine tuned models.


What's interesting to me is that it even tries to predict the prompt on images that came straight from Stable Diffusion with no editing - which is weird because such images actually do have the prompt embedded inside of them already. (At least, that's the case for me - the prompt and parameters are stored in a tEXt chunk in the PNG file, which can be read with, for example, "pngcheck -t".)


True, images generated through some UIs have prompts in meta data, aim here is to work on images people find online with no metadata. So it doesn't try to read the metadata but actually predict a similar prompt


This is something specific to the automatic 1111 version. It's just a setting, but I believe it's on by default.


Not just automatic 1111. Other SD forks like InvokeAI also embed the prompt in the png.


It is based on an image-captioning model, so a different approach then CLIP interrogator, though you are correct that aim is not to get the exact prompt back but actually get a prompt to generate similar styles of images


I'm surprised that the results seem much better (more detailed and sometimes closer to the original prompt) than the regular CLIP interrogation (at least based on my limited experimentation).

But as you say, even so, it still has little in common with the original prompt.


I usually have problems coming up with prompts to generate the kind of images I want.

This tool is useful for reverse-engineering prompts from the kind of images I want, then generating new ones in the same style.

Very cool.


Glad you like it


I'm having a lot of fun dropping in my Midjourney images, getting a more detailed prompt, then putting it back into Midjourney for more interesting variations


Glad you enjoy it


So, people are commenting that it's not very accurate etc., but I love it. Delightfully quirky tool for exploring prompts.

Also I laughed out loud after putting a selfie into it and getting "mark zuckerberg's face reflected in a mirror, close up, realistic photo, medium shot, dslr, 4k, detailed"


Glad you enjoy it! It hasn't been trained on non-generated images so is unpredictable when uploading a real photo, specially one with people in it


Shoutout to https://banana.dev , couldn't have made this demo without their hosting


you spoke too soon, POSTs to https://www.img2prompt.io/api/banana are 504-ing


That was an issue on my end, should be fine now


Related - here's a fun short sci-fi story about a savant ("the prompt whisperer") who is able to intuit the prompt that was used to generate things: https://interconnected.org/home/2022/08/03/whisperer


Hee, cute, but predictable once the ... showed up. Honestly I think the ending was unnecessary; it destroys all subtlety.

Of course, this is not actually how latent space works. It's the AI's understanding of concepts, not the inherent nature of concepts; that's why every model has its own version of "latent space". Though the understanding of latent space in the story is internally consistent; given a superintelligent image generator, you could do prompt engineering like this.


Hugged to death already?

I often use terms like "sexy", "risque" etc. in the process of getting images that are quite sensible (like military people playing chess). I use img2img repeatedly looking for particular photo-film aesthetics and tend to accumulate prompts. Anyway, this would open me to charges of sexism (or worse "misogyny"), and makes me uneasy about using SD.

Edit/OH: it generates prompts for like Excel screenshots but not for images made with the img2img model at hugginface. Fascinating.


Why does it choke on img2img creations? This is just fascinating. It gives a plausible prompt to at least a handful of real non-AI photos from DSLRs and iphones alike.


The dataset used to train this model didn’t have any img2img data so that would explain it


I mean, it does mistake Frank Sinatra for Louis Armstrong -- I guess it makes errors. But it just refuses to process my img2img images. It breaks down. Why?

It made me so agitated I made a gallery[0]. Granted, some of those images are strange, but others are just normal people doing normal things.

[0]: https://publish.obsidian.md/zero-chroma-infinity/Image+galle...


img2img creations are probably confusing it because they are totally rearranging how the prompt emerges from what's in the latent space with respect to a new image. So if you make an image of <subject> by passing in an image of <subject>, it's going to represent <subject> in a fundamentally different way than if it relies purely on its own "imagination" for the rendition.


Shameless plug: I have a similar open-source tool which uses locally executed pre-trained models, here https://github.com/kir-gadjello/extract_prompt


Nice, does this use clip-interrogator?


Yes, it's strongly influenced by clip-interrogator, but I revamped the algorithm quite a bit. I think it could be improved even further without resorting to fine-tuning the BLIP model.


Yeah I wouldn't say it's very close:

Tried it on last image I generated on: https://blazordiffusion.com/artifacts/50/50418_studio-ghibli...

Original Prompt:

> Studio ghibli, rocket explosion, jungle, solar, green technology, optimist future

> 8k, Bokeh effect, Cinematic Lighting, Octane Render, Iridescence, Vibrant

> by Beeple, Asher Brown Durand, Dan Mumford, Greg Rutkowski, WLOP

Img2Prompt:

> a vehicle in the grass, colorful light dust, cinematic lighting, trending on artstation, ultra detailed, art by akihito yoshida

Looks like a decent image classifier, but not useful for extracting the original stable diffusion prompt.


Agreed, results not good for this style of images. I have a model training on a much bigger dataset of image-prompt pairs which should perform better on this.


How is this different from image captioning when the model used is a booru model? That's already a thing people do with making their training data for fine tuning these models.


It actually works on top of an image captioning model, SD takes in keywords as well like "artstation" and "octane render" which are not covered in standard captioning so that is why the difference between using an off-the-shelf captioning model vs this


Doesn't seem to be working - an alternative is to use https://tinybots.net/artbot/interrogate, which is a front-end to the crowdsourced https://stablehorde.net network.


Ok, this is very cool! I took one image from a series I generated, had it guess a prompt (very different from mine, but it doesn't matter) and have it regenerate an another image that captures the same feeling: https://imgur.com/a/Jz0mBej


Nice, that's the aim, not to get the actual prompt back but get a prompt which can generate same style of images


Something similar (image to text description) from a bit ago - Seeing AI app from Microsoft (2017) - https://youtu.be/bqeQByqf_f8

It's not prompt based intended to generate another one, but rather an accessibility tool.

And some related videos:

Seeing AI 2016 Prototype - A Microsoft research project - https://youtu.be/R2mC-NUAmMk

Seeing AI: Making the visual world more accessible - https://youtu.be/DybczED-GKE


folks, the AI called me "beautiful" and said I look like Chris Pratt, despite being a middle-aged and overweight computer programmer. They need to monetize this immediately.


I tested it with an actual (albeit colorized) photograph:

https://i.imgur.com/GpGG0SL.jpg

And got this prompt:

> the man with the stupid face of a homeless person, portrait photography, 1 9 7 0 s, street photo, old photography, highly detailed, hyperrealistic

The actual description is that she was Mary Ann Bevan (1874 - 1934) also known as Rose Wilmot, a woman who claimed the title of the ugliest woman in London as she suffered from acromegaly.


The model was trained exclusively on stable diffusion generated images, so can be unpredictable with non-generated images, specially images with people in it


Tbh I think that prompt is spot on.


Bad form, poor taste. Preferably not anywhere, but not on Hacker News, please.

Thank you.


It's not terribly bad at guessing!

I made a few generic military guys for a side project and the actual prompt isn't too far from what I used.


I sent a picture of Mr. Spock from the first pilot, looking back while walking on the transporter[1]. And it generated this prompt:

> john cena walking on stage at television talk show, very coherent!!!!!!!!!!!!!!!!!!!!!!

[1] https://i.imgur.com/L6hbWHX.jpg


interesting.. but my guess is it's using a big library of generated image and prompt pairs? So all its suggested prompts are right out of someones 'stable diffusion prompt cheatsheet.pdf' . That is to say overly outputting commonly known artists, and things like 'trending on deviant art'


It works by using an image-captioning model finetuned on SD prompts, so it may be outputting common known artists based on their occurrence in the training data


This doesn't work for anything that doesn't use the upstream default stable diffusion checkpoint. I generate a lot of images with Pastel-mix, Anything, Waifu Diffusion and Counterfeit, and none of those are giving sensible results with this tool.


The model underneath this is trained only on data from SD 1.4/5 so this would be expected. I have another model training which covers all models you mentioned which should perform well on those


Can I have some kind of direct contact with you? Email, Twitter DM, Telegram, heck, I would download some new kind of app just to have some conversations.


You can find my email on my HN profile


I'd be interested to see the results of that. My email address is on my Hacker News profile if you want to talk there.


I tried it with a cat image, and it didn't quite capture the feel:

https://m.galaxybound.com/@vidar/109829298036416109


Interesting, it hasn’t been tested extensively for non-generated images.


It called a picture of my girlfriend "unbelievably cute" so she's now a fan.

And it described my profile picture on Mastodon as "my husband from the future that looks similar to travis scott and mark owen. he is also a good boy, very!!!, and beautiful!!!"...

Not sure how to take that ;)


Cool! I also how a project that does image captioning: https://github.com/DavidHuji/CapDec


Enjoying playing with this! It would be great if the generated images had their prompts embedded in the metadata..


Whenever I try I get 500 error :/


Can we also reverse engineer text? Would love text2prompt, super helpful to get better at prompting.


Can someone explain how this differs in method and efficacy to CLIP?


this is neat, do you have any docs/posts about it? I presume it isn't on github?


Not yet, looking into creating a write up on how it works and possibly open-sourcing


Not working.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: