Hacker News new | past | comments | ask | show | jobs | submit login
Chameleon: Meta’s New Multi-Modal LLM (arxiv.org)
304 points by gabrielbirnbaum 5 months ago | hide | past | favorite | 40 comments



Relevant thread on /r/locallama ([1]). A relevant quote from the comments:

> There's a Twitter thread from one of the authors ([2]). This part seems pretty important: "The models in this paper were done training 5 months ago. We've progressed significantly since then."

1. https://www.reddit.com/r/LocalLLaMA/comments/1ctsala/newly_p...

2. https://x.com/ArmenAgha/status/1791275549815648473


Thanks for sharing! It's encouraging to see the authors actively improving their models post-publication. Exciting to witness science in action!


There’s some pretty nice fundamental research in here, and I appreciate the publication very much. What stood out to me is their discussion of the difficulties of using softmax against different tokenization spaces; super interesting analysis (they say different modalities compete by upping their own strength relative to other modalities, leading to divergence), and the ultimate fix, (I can’t remember it right now, and leave this as a tease to the interested paper reader).

They also noted the problem was most pronounced once they got up to 34b sized. It’s a good reminder training large scale models leads to new interesting problems. I imagine a lot of techniques and know how are not published; all those little bits of experience add up to a lot of competitive advantage in many venues, so once again, thanks to Zuck and co for publishing.


the modality competition was one of my favorite insights, too!


Compared to Mirasol3B[1] this is not supporting audio as a modality. What Google has done with Mirasol3B made the demo of "Astro" in Google I/O possible. They do a little of cheating by converting audio to images(spectrogram) and video to 25 photo frames with some sort of attention system to things that change during those frames. So the tokenizer is basically the same for audio and video and images.

I believe Meta is going to this direction with multimodality as well. The new GPT voice mode is probably using the same architecture.

What's mind boggling is that models perform better at the same parameter size with new modality added to them!

It seems obvious that 3D is the next modality.

[1] https://arxiv.org/pdf/2311.05698


Am I reading this correctly:

Training time was 4282407hrs. At, conservatively, 200w gpus, that's (4282407*200)/1_000_000_000 GWh ~= 1 GWh. At 10c/kWh that's $100,000 ?

So if you have a single eqv GPU at home, it's 500yrs of training time and $100k in energy costs. Or, in practice, 3000 gpus for 2mo.

The AI industry has to hope the world doesnt change fast enough for these models to be useless.

EDIT: price is $100k


Numbers like these really don't bode well for the longer term prospects of open source models, I doubt the current strategy of waiting expectantly for a corporation to spoonfeed us yet another $100,000 model for free is going to work forever.

That $100k is conservative too, it doesn't include the cost of buying/renting the hardware, or the compute time spent on experimental training runs, or the cost of data acquisition, labeling and cleaning, or the cost of RLHF fine-tuning.


> Numbers like these really don't bode well for the long-term prospects of open source models, I doubt the current strategy of waiting expectantly for a corporation to spoonfeed us yet another $100,000 model for free is going to work forever.

I would add “in their current form” and agree. There’s three things that can change here: 1. Moore’s law: The worldwide economy is built around the steady progression of cheaper compute. Give it 36 months and your problem becomes a $25,000 problem. 2. Quantization and smaller models: There’ll likely become specializations of the various models (is this the beginning of the “Monolith vs Microservices” debate? 3. E2E Training isn’t for everyone: Finetunes and Alignment are more important than an end to end training run, IF we can coerce the behaviors we want into the models by finetuning them. That along with quantized models (imho) unlocked vision models which are now in the “plateau of producivity” in the gartner hype cycle compared to a few years ago.

So as an example today, I can grab a backbone and pretrained weights for an object detector, and with relatively little data (from a few lines to a few 10’s of lines of code, and 50 to 500 images) and relatively little wall clock time and energy (say 5 to 15 minutes) on a PC, I can create a customized object detector that can detect -my- specific objects pretty well. I might need to revise it a few times, but it’ll work pretty well.

Why would we not see the same sort of progression with transformer architectures? It hinges on someone creating the model weights for the “greater good,” or us figuring out how to do distributed training for open source in a “seti@home” style (long live the blockchain, anyone?).


Yeah, there's no accounting for breakthroughs in training efficiency. I wouldn't count on Moores Law though, the amount of compute you can put into these problems is effectively unbounded so more efficient silicon just means those with money can train even bigger models. 3D rendering is a decent analogy, Moores Law has made it easy to render something comparable to the first Toy Story movie, but Pixar poured those gains back into more compute and is using it to do things you definitely can't afford to.


I wonder if a kind of Seti@Home approach could work - although I'm guessing the limited VRAM in most consumer cards compared to an H100, as well as the much slower "virtual WAN interconnect" versus the mellanox goodies that nVidia clusters enjoy would be too big an obstacle?


Even if you could get that to work, how many people would be willing to run their >300W GPUs at full tilt 24/7 in order to contribute to the training cause? You would basically be asking people to deal with the logistics of running a cryptocurrency mining operation but without the prospect of getting paid for it.


Depends on the logistics. If I were confident about the security, I wouldn't mind letting my GPU participate in a distributed effort to significantly improve an open source model. This should be a few dollars a month on my power bill, not dozens or hundreds of dollars, especially if I undervolt.

Now, I don't know of any distributed training technique that will make a significant impact on improving a model, and that security component is a big "if". But if something promising comes a long, I'd bet lots of people would be willing to donate some GPU time, especially if it were easy to set up.


Things like petals (https://github.com/bigscience-workshop/petals) exist, distributed computing over willing participants. Right now corporate cash is being rammed into the space so why not snap it up while you can, but the moment it dries up projects like petals will see more of the love they deserve.

I envision a future where crypto-style booms happen over tokens useful for purchasing priority computational time, which is earned by providing said computational time. This way researchers can daisy-chain their independent smaller rigs together into something with gargantuan capabilities.


1 GWh is 1 million kWh, multiplied by $0.1 that should give $100k in energy costs?


Yes, thanks. I had assumed I had been off by a factor somewhere. Yet, 100k seems small -- the total cost of production is in the 10mil+ range.


100k is small, but you only get away with 100k if you nail everything perfect the first time around — something that we all know does not really happen. I think compiling is a good parallel to training, imagine if compiling your whole software project cost 100k if you did it from scratch. Sure there's incremental builds etc, but the cost is steep no matter which way you look at it.


Thanks for the figures. I suppose with expenses like that, they will be motivated to research methods of updating models which have already been trained.

Edit: I see the price was updated


Assuming $30k GPU with 3yr deprecation, it's additionally $1.14/h. Much more than energy.


It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.


What's cool (and tough to keep up with) with this wave of tech is just how quickly it moves.

On the plus side there's a lot of interesting things and it is generally easy to follow/figure out what they did.

On the minus side it's a little exhausting, and there's so much money in it feels like the vast majority of it is grifting. To add to that, the people who are trying to catalog it (the AI guy on LI) are the griftiest of them all.

I've found the best way to keep up is find one topic you want to learn about, deep-dive, and read all the related papers, then explore breadth-first from there until you find another topic...


How do you keep your knowledge after deep dive? Do you try to use it somehow? I found that reading a lot usually does not contribute to long term proficiency in a given topic, unless followed by non trivial amount of practice.


Yeah so I’m lucky that I work in/adjacent to the space so it doesn’t get buried. Otherwise I think it’s near impossible to retain learning without practice.


It's still in the early phase where the bubble is building up. This is necessary if we want a prosperous market. Hopefully, after the bubble bursts, some good companies will remain (which is very likely).


> On the minus side it's a little exhausting, and there's so much money in it feels like the vast majority of it is grifting.

It is all grifting. The moment someone creates something that can improve upon itself there will be an intelligence explosion and it won’t need press releases or debates about its intelligence. The current path of research will not lead to that, if there was something to be discovered here it would have been discovered already. It’s just new to the general consumer and there’s a wow factor associated with it like crypto and NFTs before. The truth is tech has lost its momentum and is desperate to find a new trick.


The rate of improving is important. If it's as intelligent as the average human, the rate of improving will be slow, very slow compared to what the researcher can do currently.


I think the opposite. There is value in intelligent software, but IMO we’re a long way from AGI. So lots of grifting but some gold along the way. And it’s intellectually interesting/nuanced (cool math, interesting infra), unlike crypto which was more of a massive energy burning waste than anyone likes to admit.


If our standard is cool math, crypto is also full of cool math. (Have you read Vitalik's explanation of Quadratic Arithmetic Programs?) Our standard can't be that low.

https://medium.com/@VitalikButerin/quadratic-arithmetic-prog...


Fair enough. More reacting to the concept of a blockchain which is like old news and an extremely inefficient way to do 90% of what people are (we’re) trying to do with it.

Will read!


What does grifting mean to you?


> Recent multimodal foundation models are very widely adopted but still model different modalities separately, often using modality specific encoders or decoder

Is this accurate? I thought for example gemini pro used image tokens and gpt4-o similar

> without the need for separate image/text encoders

but then they say they pre-trained two different tokenizers, so maybe they just mean that the tokens go into the same attention layer? But then I thought that is how all the multi-modal stuff was happening already?

two typos stabilitize and multiplicate


That seems odd since I also don't see how this differs from other approaches being published. Except what everyone else calls an Image Encoder (ie: some type of pre-trained VAE architecture) they call a tokenizer. The Apple MM1 paper used ViT-L for example for it's image encoder and then C-Abstractor for it's image tokenizer.


the biggest difference is that existing multimodal models (eg GPT-4V and MM1) trained the text model first, and then added in the image component after text training was done ('late fusion'). MM1 learns a projection into the text space, not discrete tokens, and thus cannot generate images.

Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.

You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.


Vision language models use various encoders to project the image into tokens. This is just a means of a unified encoder across modalities


Only browsed but this is really interesting and I'm glad it was published.

I understand why a unified model is an interesting thing to work on but doesn't the discovery of "modal-competition" suggest that at least short term it might be even better to train specialized models for each modality and some sort of modality-supervisor (glue code model)?


'sum of the whole is greater than the parts' is a very important line of research to investigate.


Does Meta plan to open source these models?


Are they downloadable?


every 3rd sentsnce is "the model was not trained on data from meta's products"


That makes sense. You probably don't want to train your LLM against your uncle's dubious claims about the government, flat earthers and the likes content :)


At least they're clear about provenance, unlike OpenAI.

https://www.reddit.com/r/ChatGPT/comments/1bfa7s3/openai_cto...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: