Sequential modeling enables scalable learning for large vision models

cs702 · 2023-12-05T16:28:10 1701793690

Upon reading this, my immediate thought is:

It's only a matter of time before we have robots powered by large models pretrained to "predict the next token" across a bunch of different sensory modalities -- sight, sound, smell, touch, taste, etc. in a variety of artificial and natural settings, including social-interaction settings. Learning to read, learning to talk, learning to interact with the physical world, and so on -- all of it could very well be built upon the simple idea of learning to "predict the next token."

We live in interesting times.

raidicy · 2023-12-05T17:00:29 1701795629

These sentiments are pretty close to my own. I read a paper that claimed that llms are General Pattern Machines and could be used to complete small games in gym environments. It seems to me that if these things really are General Pattern Machines all we have to do is figure out a way to represent any data as a pattern and try and predict the next step in the pattern right?

The multi token[1] project which allows you to take any type of data and turn it into a token it's pretty interesting and seems like it's going in this direction.

I would really like to see a framework where you can take any modality of any type turn it into a series of tokens and just cram it into a language model and effectively turning into a multimodal model with almost no effort.

[0]https://general-pattern-machines.github.io/ [1] https://github.com/sshh12/multi_token

kelseyfrog · 2023-12-05T21:24:02 1701811442

If they're looking for a name, The Glass Bead Game isn't a bad start.

texas2toss · 2023-12-05T23:06:06 1701817566

Probably my favorite “smart” book of all time

sshh12 · 2023-12-06T00:47:33 1701823653

Yeah this is exactly what I was thinking when building multi_token! Just bottlenecked so far by my own GPU, datasets, and freetime

raidicy · 2023-12-06T12:11:17 1701864677

Thank you thank you thank you! I think it's a wonderful repo it's a shame I haven't been able to dig deeper into it. I starred your project as soon as I saw it!

gwern · 2023-12-05T19:02:07 1701802927

That pretty much already exists. Look at DeepMind's Gato: all tasks and modalities are simply sequences of tokens, everything from 'predict English text' to 'predict VAE image token sequences' to 'predict robotic arms commands and movements IRL'.

cs702 · 2023-12-05T20:00:25 1701806425

Ah, yes, I'd forgotten about Gato. Thank you for reminding me. There's so much research activity that the Gato paper feels as if it was published eons ago. There's only so much I can retain in my puny little human mind at once!

In any case, I'm not sure Gato qualifies as a "large" model with 1.2B parameters -- it's kinda right below the threshold at which it could or would start exhibiting emergent behaviors. Maybe a new Gato with 10's or 100's of billions of parameters operating in the physical world?

gwern · 2023-12-05T22:16:37 1701814597

Yes. Gato was a good proof-of-concept that the Decision Transformer approach of 'just model literally everything as a sequence' scales well and doesn't exhibit some sort of catastrophic interference and can successfully imitation-learn from all the expert datasets, and a bit of transfer. But they need to push it at least another OOM or 2 to show major transfer, some emergences, and ideally do both from-scratch learning and additional learning on many new tasks. We continue to wait. :(

I hope it didn't all get rolled up into Gemini and become a state secret they'll never publish on again, or lost in the shuffle in the chaos of the DeepMind/Brain merger/liquidation.

cs702 · 2023-12-05T23:05:57 1701817557

> ...lost in the shuffle in the chaos of the DeepMind/Brain merger/liquidation

That's the most likely explanation, in my view.

jerpint · 2023-12-05T19:29:04 1701804544

GATO was heavily biased towards tasks in a simple simulator, but didn’t exhibit emergent behaviours

jefft255 · 2023-12-05T20:07:27 1701806847

I did that a few years back: https://arxiv.org/abs/2011.11751

cs702 · 2023-12-05T20:25:40 1701807940

With a large model? How many parameters?

See my other comment here:

https://news.ycombinator.com/item?id=38536178

jefft255 · 2023-12-06T15:13:57 1701875637

A couple millions IIRC. Nothing "large" compared to modern transformer models.

cs702 · 2023-12-07T14:17:07 1701958627

Thanks for getting back to me. That's what I thought. The magic seems to start happening in the low billions of parameters -- and I say "seems" b/c there's no consensus as to whether it's really truly magic! In any case, it's a shame that most of the human brainpower capable of improving SotA AI doesn't have access to large-scale resources.

og_kalu · 2023-12-05T17:15:37 1701796537

Oh yeah. The exciting thing is that is pretty low hanging fruit (at least for the common modalities)

What would a Large Language Model that can manipulate audio-visual data as expertly as it can manipulate text look like ? This is beyond just Text to Speech or Captioning and Image Q&A. I think we'll find out very soon.

jononor · 2023-12-06T10:44:36 1701859476

Yeah, I think this is now in the "we know it can be done, and how to do it". And also som

I think we might have audio2audio editing on the level of Stable Diffusion within a year or two. Based on recent progress with AudioLM etc.

For short clips, we will probably get video2video editing in the same timeframe. While it is computationally more challenging, it might end up being here before audio editing - because video is so popular in todays social media and marketing landscape.

Usable joint video and audio editing (that is coherent) will probably take longer.

Kelkonosemmel · 2023-12-05T21:29:37 1701811777

Yepp. I think a robot who can easily follow your commands should be doable in 10-20 years.

I plan to buy a farm when I have the money and I'm pretty sure while I will/want to do a lot of hands on renovation and sculpting (park, etc) long term some type of robot should be good and affordable enough to take over when I'm too old.

dottedmag · 2023-12-05T20:09:43 1701806983

One can argue there are ~8bln of these robots already roaming the Earth.

ww520 · 2023-12-05T20:48:50 1701809330

Let me guess. Someone will train a LLM on the stock prices to predict the stock market. It might work as well as how human has predicted the market.

astrange · 2023-12-05T23:00:13 1701817213

Past performance is not a guarantee of future results.

catchnear4321 · 2023-12-05T16:33:19 1701793999

across a bunch of different gradients. senses will be the next step, humans grok that good and easy. once multiple gradients are considered, non-sensory gradients are going to be next.

this is all a bunch of gobbledygook until it isn’t.

ldjkfkdsjnv · 2023-12-05T19:12:46 1701803566

Yeah its going to happen. We will see and speak with intelligent machines

agarsev · 2023-12-05T16:41:49 1701794509

I'm reading my thesis next tuesday, and the advances in AI the last couple of years have already made most of it obsolete :/

Anyway I'm excited and looking forward to the code and models to be released, hopefully I can use them for my research! I think it's easy to overlook how revolutionary the transformer "way" of doing things has been, and the fact that so many different tasks can be reformulated in a "language" way I believe hints at something deeper about how the universe, our minds and language work.

iandanforth · 2023-12-05T20:47:13 1701809233

If this paper is coming out of BAIR at a max 3B parameter model I suspect we'll quickly see much larger models from the industrial players. Hopefully Mistral takes an interest and releases an OSI licensed model.

og_kalu · 2023-12-05T14:14:21 1701785661

I'd simply been thinking about Large Vision Models in an annotation sense. QandA, Captioning...That sort of thing.

Even though it makes so much sense, I never thought about it like this. Inpainting, Object Detection, Rotation, Lighting, Segmentation, Edge Detection, Pose Estimation, Surface Normal, Colorization and much more achieved by a single model.

I believe this and Codi-2(https://codi-2.github.io/) offer a glimpse of the future of Large Multimodal Models.

BoiledCabbage · 2023-12-05T16:21:20 1701793280

One of the things sci-fi really seemed to get right is that we will have AGI long before we'll have agreement that it is actually AGI.

People will keep finding some small case or reason why not to call it AGI. And then finally once that last case is knocked down and we have agreement on a definition, we'll realize we crossed that threshold a "long" while back.

And I'm not saying we have AGI now, just that it's now clear to me how this process will play out.

(Where "long" in AI development timeliness probably doesn't mean the same thing "long" meant even in the 2010s.)

MorbidCuriosity · 2023-12-06T01:49:59 1701827399

I think the last hurdles will be sentience and a theory of the mind. Theory of the mind is probably something that a LM could predict given enough data, but sentience? I don't think we understand enough about our own sentience for us to create it in a machine.

naasking · 2023-12-06T05:22:55 1701840175

> I don't think we understand enough about our own sentience for us to create it in a machine.

Invention preceding understanding is the norm, not the exception. We created fire before understanding chemistry, and we constantly use pharmaceuticals without really understanding how they work. Invention first, then theory comes along to explain and generalize.

For all we know, sentience is a necessary side-effect of semantic processing of any kind, in which case LLMs already have a form of sentience.

So yes, you're right that we don't understand our own sentience. In fact, we understand so little that it could literally be staring us in the face right now and we don't realize it.

kenmorechalfant · 2023-12-06T04:05:07 1701835507

It goes back to the old question of whether matter is fundamental and consciousness emerges from it or the other way around. My inclination is the latter, by way of Descartes. "I think therefore I am." If you can think "I am me" then you know with certainty that your first person identity exists but all other information that comes to you through your senses will always be an incomplete picture of the material universe. I think this is the strongest argument why consciousness precedes matter (or pervades through all matter, depending how you look at it). If sentience is a fundamental property of the universe, the way I see it, all matter shares the same soul. That makes it perfectly logical for a machine that mimics the pathways of the brain to behave just like a brain, to me.

jononor · 2023-12-06T10:51:19 1701859879

It may be that we found out that sentience is not as magical or unique as we originally thought.

Some people claimed that "creativity" (another rather nebulous term) was a uniquely human trait, and machines could perform tasks that require this. But recent generative AI models have started to make people question that position.

og_kalu · 2023-12-06T03:52:57 1701834777

LLMs can already predict theory of mind in text and keep good track of character motivations and knowledge in a story.

fisf · 2023-12-06T12:12:03 1701864723

Neither of those are required for intelligence.

kaibee · 2023-12-05T15:44:47 1701791087

So this is artificial general intelligence, right?

ben_w · 2023-12-05T15:48:28 1701791308

The problem with the phrase "artificial general intelligence" is that everyone is arguing about the definition of all three words, and has a different threshold for the boolean pass/fail boundary.

og_kalu · 2023-12-05T17:45:08 1701798308

General Intelligence is a gradient, not a hard on/off. Obviously, these are machines so artificial. They're certainly not narrow in scope or abilities so general. They perform tasks we consider intelligent. So...sure. Like ENIAC, i imagine we'll build or have built agi well before everyone can agree it is so.

samstevens · 2023-12-06T04:21:19 1701836479

Very odd that they don't do linear probing on the model features and measure ImageNet validation accuracy. In general, this paper is missing comparisons to existing vision models on existing benchmarks.

The visual prompting is a neat trick and it's great to see scale continue to work, but without comparisons to other models or releasing the code/weights, I don't think this strategy is going to be competitive.

bambax · 2023-12-05T16:58:36 1701795516

Do captchas have a future? It seems inevitable that AI will beat humans on captcha real soon (if not already). What's next?

jazzyjackson · 2023-12-05T19:55:44 1701806144

I should patent this idea but here it goes anyway: in the future Captcha's will consist of requesting you make an antisemitic or misogynist remark to prove you are human, since the bots will be held to higher moral standards than man.

cooper_ganglia · 2023-12-05T20:33:42 1701808422

This is a completely silly comment, but I will admit that it made me chuckle just because of how incredibly random and shocking it was, especially here on HN, lol

PeterisP · 2023-12-05T21:21:23 1701811283

Obviously, not all bots will be held to such standards - these are the standards which apply to western corporations who offer free bots for PR and marketing purposes, since 'politically incorrect' behavior stunts that goal, however, anyone keeping their own bot (e.g. a spammer wanting to solve millions of captchas) has no problems running an 'uncensored' bot - right now you can get reasonably large open source models without the RLHF pretraining and thus also no attempt at 'moral standards' other than those you choose to put in when finetuning the bot for your purposes.

It's relatively trivial to flip the sign on that training data and have a bot that will instead refuse to make antifascist or feminist arguments; if someone wants a Hitlerbot to ghostwrite "My struggle" for them, there's nothing that could prevent them from finetuning such a model from one of the publicly available models; there is no one that can enforce any 'moral standards' on the bots other than their creators.

jksk61 · 2023-12-05T18:01:51 1701799311

I don't know about you but I use chat gpt to solve captcha because I can't.

swfsql · 2023-12-05T21:16:35 1701810995

Considering this, the only remaining captcha will be hard cold money. Paywalls.

dwaltrip · 2023-12-05T20:09:57 1701806997

Hypothesis: Intelligence is prediction?

keenmaster · 2023-12-06T01:41:06 1701826866

That intuitively makes sense to me for everything except higher-level goal selection. If someone works at an NGO instead of FAANG because they want to do good for the world, how can that be explained in terms of prediction? If someone minimizes consumption to reduce global warming, how did prediction result in them caring about global warming? I picked those examples on purpose, because they run against some core instincts. Maybe they’re predicting what maximizes peer approval, which optimizes other outcomes too? (Jk :D )

dwaltrip · 2023-12-06T01:47:12 1701827232

You are right, prediction doesn’t provide one with base motivations or drives. But those seem fundamentally different than raw intelligence.

Your examples aren’t an issue I don’t think. To take one: efforts to minimize one’s consumption to combat global warming could fall in the “help tribe survival” bucket, which makes evolutionary sense as a base drive. And it seems like a reasonably “intelligent” response to that base drive — the predicted results seems positive.

iandanforth · 2023-12-05T20:40:28 1701808828

You might enjoy 'On Intelligence' by Jeff Hawkins to learn more about this hypothesis. (It's an older book / theory at this point but still worth reading IMHO)

stevenhuang · 2023-12-05T23:10:59 1701817859

Yup, known as predictive coding/free energy principle https://en.wikipedia.org/wiki/Predictive_coding as one of leading theories for human consciousness

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363942/

trhway · 2023-12-06T04:08:46 1701835726

which is very close as to say that intelligence is learning [from mis-prediction], a subject of cybernetics - common machinery across the scales and instances - using a model (brain in case of animals, model/embedding in case of computer, DNA in case of evolution through natural selection) a prediction (which for example in case of evolution is the next generation) is made, and based on the error of that prediction the model is updated for the generation of the next prediction (in case of evolution the DNA of the best fit is the model to be used in the next iteration).

romesc · 2023-12-06T00:58:17 1701824297

Or prediction is intelligence.

I think we need a more rigorous definition to go beyond tautology

red75prime · 2023-12-05T20:39:44 1701808784

Yep, prediction of what needs to be done.

fancyfredbot · 2023-12-05T20:12:19 1701807139

Next step, accept user input from VR glasses and we've basically got a holodeck.

toxik · 2023-12-05T15:43:12 1701790992

Would have been neat to see some animations since it is video frames in many cases.

mpeg · 2023-12-05T16:51:54 1701795114

The "add a frame to a video" use-cases are probably the least exciting here, the image annotation capabilities seem to me the bigger deal.

mickdarling · 2023-12-05T15:25:33 1701789933

Has anyone tried using transformers on weather forecasts yet?

ben_w · 2023-12-05T15:41:27 1701790887

Yes, several.

• https://arxiv.org/pdf/2106.14742.pdf

• https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/met.2...

• https://ieeexplore.ieee.org/document/9671442

Also non-transformer models, because of the scaling complexity on input: https://arxiv.org/pdf/2212.12794.pdf

jerpint · 2023-12-05T19:29:28 1701804568

I would have loved to see videos on the blog post of completions

mola · 2023-12-05T19:06:21 1701803181

So this can solve visual analogies from iq tests?