I just posted a comment :D I work for a computer vision company. I use CLIP almo...

chankstein38 · on Oct 26, 2023

I have a noob-to-CLIP question. When I've tried to use it to auto-caption photos or things the result has been like 4-5 words that may really vaguely describe the image but honestly it's usually like "A woman holding a pencil" and sometimes "A woman woman holding a pencil" or just "A woman"

Do different models do better or worse at this? Is this just untuned outputs? Like are there parameters I should be tweaking? Sorry I'm not able to give too much more detail. I'm mostly using it within A1111's "Interrogate CLIP" option but I've tried using a model I found on replicate as well as installing locally. Same results every time.

It seems vaguely useful but like it misses the mark a lot of the time. I'm assuming I'm doing something wrong.

Filligree · on Oct 26, 2023

For a task like that, I'd recommend LLaVA instead. It's still inaccurate, but it's a great deal more accurate than the other options I've tried. It also works with llama.cpp.

LLaVA is a multimodal language model you ask questions of. If you don't provide a question, then the default is "Describe this picture in detail". But if you have a concrete question, you're likely to get better results. You can also specify the output format, which often works.

(Make sure to use --temp 0.1, the default is far too high.)

It runs very slowly on CPU, but will eventually give you an answer. If you have more than about four-five pictures to caption, you probably want to put as many as possible as the layers on the GPU. This requires specific compilation options for CUDA; on an M1/M2 it's possible by default, but still needs to be turned on. (-ngl 9999)

Philpax · on Oct 26, 2023

iirc "Interrogate CLIP" is a bit of a misnomer - what it's actually doing is generating a basic caption with BLIP ("a woman holding a pencil"), then iterating over categories and checking with CLIP if any items in those categories are depicted in that image, then concatenating any hits to the resulting caption.

This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.

To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.

There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)

chankstein38 · on Oct 26, 2023

Thank you for this! I've always been confused about BLIP vs CLIP. That makes a lot of sense and explains the weird duplication of a noun I see sometimes "A woman woman" things like that.

simonw · on Oct 26, 2023

I suggest trying BLIP for this. I've had really good results from that.

https://github.com/salesforce/BLIP

I built a tiny Python CLI wrapper for it to make it easier to try: https://github.com/simonw/blip-caption

dpflan · on Oct 26, 2023

Very cool. I've used CLIP and VQGAN in a grad school project ~2years ago, when StyleGAN, StyleCLIP and similar projects were emerging for controlled image manipulations.