Hacker News new | past | comments | ask | show | jobs | submit login

I just posted a comment :D

I work for a computer vision company. I use CLIP almost every day. Example use cases for which I have used CLIP:

  - Image classification
  - Automated labeling for classification models
  - Image clustering
  - Gathering images for model training that are sufficiently dissimilar from existing samples
  - Content moderation
CLIP is also being used widely in new research. SAM-CLIP, shared with me by my manager today, is using CLIP (https://arxiv.org/abs/2310.15308) and knowledge distillation for training a new model. I have seen references to CLIP throughout multimodal LLM papers, too, although my knowledge of multimodal model architectures is nascent.



I have a noob-to-CLIP question. When I've tried to use it to auto-caption photos or things the result has been like 4-5 words that may really vaguely describe the image but honestly it's usually like "A woman holding a pencil" and sometimes "A woman woman holding a pencil" or just "A woman"

Do different models do better or worse at this? Is this just untuned outputs? Like are there parameters I should be tweaking? Sorry I'm not able to give too much more detail. I'm mostly using it within A1111's "Interrogate CLIP" option but I've tried using a model I found on replicate as well as installing locally. Same results every time.

It seems vaguely useful but like it misses the mark a lot of the time. I'm assuming I'm doing something wrong.


For a task like that, I'd recommend LLaVA instead. It's still inaccurate, but it's a great deal more accurate than the other options I've tried. It also works with llama.cpp.

LLaVA is a multimodal language model you ask questions of. If you don't provide a question, then the default is "Describe this picture in detail". But if you have a concrete question, you're likely to get better results. You can also specify the output format, which often works.

(Make sure to use --temp 0.1, the default is far too high.)

It runs very slowly on CPU, but will eventually give you an answer. If you have more than about four-five pictures to caption, you probably want to put as many as possible as the layers on the GPU. This requires specific compilation options for CUDA; on an M1/M2 it's possible by default, but still needs to be turned on. (-ngl 9999)


iirc "Interrogate CLIP" is a bit of a misnomer - what it's actually doing is generating a basic caption with BLIP ("a woman holding a pencil"), then iterating over categories and checking with CLIP if any items in those categories are depicted in that image, then concatenating any hits to the resulting caption.

This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.

To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.

There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)


Thank you for this! I've always been confused about BLIP vs CLIP. That makes a lot of sense and explains the weird duplication of a noun I see sometimes "A woman woman" things like that.


I suggest trying BLIP for this. I've had really good results from that.

https://github.com/salesforce/BLIP

I built a tiny Python CLI wrapper for it to make it easier to try: https://github.com/simonw/blip-caption


Very cool. I've used CLIP and VQGAN in a grad school project ~2years ago, when StyleGAN, StyleCLIP and similar projects were emerging for controlled image manipulations.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: