Open Flamingo – open framework to train multimodal LLMs

ftxbro · on March 28, 2023

In the demo I put the obama prank photo http://karpathy.github.io/2012/10/22/state-of-computer-visio... and asked "Why is this picture funny?" and it responded "Question: Why is this picture funny? Answer: President Obama is taller than the average person."

codetrotter · on March 28, 2023

Tbh I’m not sure why the pic is funny. And I am human

pretendscholar · on March 28, 2023

Bot detected

totetsu · on March 29, 2023

Is it something to do with the idiom of putting a thumb on the scale, meaning using your power to influence the outcome of something?

mylons · on March 28, 2023

he's stepping on the scale making the guy seem heavier than he is

pretendscholar · on March 28, 2023

Furthermore the man on the scale is faced the other way and wouldn’t know someone is stepping on the scale. There’s an element of theory of mind there. You would have to understand that the man on the scale is unaware of Obama’s action.

owlninja · on March 29, 2023

The article points this out and several things we all instantly recognize.

codetrotter · on March 29, 2023

Thanks. I was looking at the picture on my phone and it was difficult to see what was going on in the picture.

airstrike · on March 28, 2023

Luckily the article explains it right below the caption ;-)

elcomet · on March 29, 2023

But you understand why people might find it funny, and what people in the image are thinking, you have a mental model of them. That's the point.

david2ndaccount · on March 29, 2023

That’s because it’s very posed so it’s not that funny.

joenot443 · on March 29, 2023

Has this one fared better with other image prompt models? It’s a great little challenge, I’m curious for a follow up!

airstrike · on March 28, 2023

What's the GPT-4 answer to why that picture is funny?

vagabund · on March 28, 2023

> @karpathy: We tried and it solves it :O. The vision capability is very strong but I still didn't believe it could be true. The waters are muddied some by a fear that my original post (or derivative work there of) is part of the training set. More on it later.

https://twitter.com/karpathy/status/1635697741925064704

meghan_rain · on March 29, 2023

Is there a More on it?

yeldarb · on March 28, 2023

I always like to try these zero-shot models on things outside of the "normal" COCO classes. Here are some chess board queries:

Counting: https://imgur.com/KTuQ1Bv

Parse the chess board: https://imgur.com/2zYFK1P

(Result): https://imgur.com/Ei4MAl7

Few-Shot Object Detection (Pascal VOC): https://imgur.com/gZkDMn8

Few-Shot Object Detection (simplified): https://imgur.com/Hk8QGMd

Not quite there yet. I've been more impressed with the other new zero-shot multimodal models like Grounding DINO and Azure Dense Captioning. Really looking forward to putting multimodal GPT-4 through its paces as well.

rikimaru0345 · on March 29, 2023

> Parse the chess board:

Could it be that the actual issue has to do with it having trouble with small tokens (letters, numbers)?

Does it give a different result if you ask it to answer in a format like this?

> Please name what kind of piece is on each square of this board > A1: white rook > A2: white pawn > A3: empty > A4: empty > ...

Prompting can be so unintuitive sometimes. Maybe it just has an issue with the output representation or something...

vagabund · on March 28, 2023

Even at this scale the model's able to answer questions fairly impressively, but I created an image with some distinct shapes in different positions and it didn't go well [0]. I think however they're doing the image encoding doesn't capture positional information which, to my mind, limits a lot of use cases.

[0] https://i.postimg.cc/GtrGs8mw/Screenshot-2023-03-28-at-5-19-...

og_kalu · on March 28, 2023

it's not the image embedding. It's the objective task. Image to text is simply not good enough. It's really lossy and the datasets are garbage so it's not very robust.

mpaepper · on March 28, 2023

This is awesome work and they also provide their 9B OpenFlamingo model which is based on Llama:

https://huggingface.co/openflamingo/OpenFlamingo-9B

dfrankle · on March 28, 2023

What are the key features of Open Flamingo, and how does it compare to other frameworks for training multimodal LLMs?

juxtaposicion · on March 29, 2023

What’re the techniques that’ll get this to run on a single GPU?

dailcooper · on March 29, 2023

Most of the parameters are in the language model (LLaMa-7B). So, they'd pretty much be the same techniques that would let LLaMa run on a single GPU -- especially lower precision tricks. If you only want to run inference/forward (no training), it should be pretty doable.

You can almost definitely run it on consumer GPU if you swap out the language model for something smaller as well (although the performance would definitely not be as good on the language side).

duxup · on March 28, 2023

That title is pretty impressive/ big on mobile!

sublinear · on March 29, 2023

It's just bad unresponsive CSS.

It's so ugly to have the words break anywhere and for the intentional line breaks to still occur anyway.

All they needed to do was use media queries for at least three screen widths and adjust the font size in there accordingly.