Hacker News new | past | comments | ask | show | jobs | submit login
Sequential modeling enables scalable learning for large vision models (yutongbai.com)
132 points by og_kalu 11 months ago | hide | past | favorite | 58 comments



Upon reading this, my immediate thought is:

It's only a matter of time before we have robots powered by large models pretrained to "predict the next token" across a bunch of different sensory modalities -- sight, sound, smell, touch, taste, etc. in a variety of artificial and natural settings, including social-interaction settings. Learning to read, learning to talk, learning to interact with the physical world, and so on -- all of it could very well be built upon the simple idea of learning to "predict the next token."

We live in interesting times.


These sentiments are pretty close to my own. I read a paper that claimed that llms are General Pattern Machines and could be used to complete small games in gym environments. It seems to me that if these things really are General Pattern Machines all we have to do is figure out a way to represent any data as a pattern and try and predict the next step in the pattern right?

The multi token[1] project which allows you to take any type of data and turn it into a token it's pretty interesting and seems like it's going in this direction.

I would really like to see a framework where you can take any modality of any type turn it into a series of tokens and just cram it into a language model and effectively turning into a multimodal model with almost no effort.

[0]https://general-pattern-machines.github.io/ [1] https://github.com/sshh12/multi_token


If they're looking for a name, The Glass Bead Game isn't a bad start.


Probably my favorite “smart” book of all time


Yeah this is exactly what I was thinking when building multi_token! Just bottlenecked so far by my own GPU, datasets, and freetime


Thank you thank you thank you! I think it's a wonderful repo it's a shame I haven't been able to dig deeper into it. I starred your project as soon as I saw it!


That pretty much already exists. Look at DeepMind's Gato: all tasks and modalities are simply sequences of tokens, everything from 'predict English text' to 'predict VAE image token sequences' to 'predict robotic arms commands and movements IRL'.


Ah, yes, I'd forgotten about Gato. Thank you for reminding me. There's so much research activity that the Gato paper feels as if it was published eons ago. There's only so much I can retain in my puny little human mind at once!

In any case, I'm not sure Gato qualifies as a "large" model with 1.2B parameters -- it's kinda right below the threshold at which it could or would start exhibiting emergent behaviors. Maybe a new Gato with 10's or 100's of billions of parameters operating in the physical world?


Yes. Gato was a good proof-of-concept that the Decision Transformer approach of 'just model literally everything as a sequence' scales well and doesn't exhibit some sort of catastrophic interference and can successfully imitation-learn from all the expert datasets, and a bit of transfer. But they need to push it at least another OOM or 2 to show major transfer, some emergences, and ideally do both from-scratch learning and additional learning on many new tasks. We continue to wait. :(

I hope it didn't all get rolled up into Gemini and become a state secret they'll never publish on again, or lost in the shuffle in the chaos of the DeepMind/Brain merger/liquidation.


> ...lost in the shuffle in the chaos of the DeepMind/Brain merger/liquidation

That's the most likely explanation, in my view.


GATO was heavily biased towards tasks in a simple simulator, but didn’t exhibit emergent behaviours


I did that a few years back: https://arxiv.org/abs/2011.11751


With a large model? How many parameters?

See my other comment here:

https://news.ycombinator.com/item?id=38536178


A couple millions IIRC. Nothing "large" compared to modern transformer models.


Thanks for getting back to me. That's what I thought. The magic seems to start happening in the low billions of parameters -- and I say "seems" b/c there's no consensus as to whether it's really truly magic! In any case, it's a shame that most of the human brainpower capable of improving SotA AI doesn't have access to large-scale resources.


Oh yeah. The exciting thing is that is pretty low hanging fruit (at least for the common modalities)

What would a Large Language Model that can manipulate audio-visual data as expertly as it can manipulate text look like ? This is beyond just Text to Speech or Captioning and Image Q&A. I think we'll find out very soon.


Yeah, I think this is now in the "we know it can be done, and how to do it". And also som

I think we might have audio2audio editing on the level of Stable Diffusion within a year or two. Based on recent progress with AudioLM etc.

For short clips, we will probably get video2video editing in the same timeframe. While it is computationally more challenging, it might end up being here before audio editing - because video is so popular in todays social media and marketing landscape.

Usable joint video and audio editing (that is coherent) will probably take longer.


Yepp. I think a robot who can easily follow your commands should be doable in 10-20 years.

I plan to buy a farm when I have the money and I'm pretty sure while I will/want to do a lot of hands on renovation and sculpting (park, etc) long term some type of robot should be good and affordable enough to take over when I'm too old.


One can argue there are ~8bln of these robots already roaming the Earth.


Let me guess. Someone will train a LLM on the stock prices to predict the stock market. It might work as well as how human has predicted the market.


Past performance is not a guarantee of future results.


across a bunch of different gradients. senses will be the next step, humans grok that good and easy. once multiple gradients are considered, non-sensory gradients are going to be next.

this is all a bunch of gobbledygook until it isn’t.


Yeah its going to happen. We will see and speak with intelligent machines


I'm reading my thesis next tuesday, and the advances in AI the last couple of years have already made most of it obsolete :/

Anyway I'm excited and looking forward to the code and models to be released, hopefully I can use them for my research! I think it's easy to overlook how revolutionary the transformer "way" of doing things has been, and the fact that so many different tasks can be reformulated in a "language" way I believe hints at something deeper about how the universe, our minds and language work.


If this paper is coming out of BAIR at a max 3B parameter model I suspect we'll quickly see much larger models from the industrial players. Hopefully Mistral takes an interest and releases an OSI licensed model.


I'd simply been thinking about Large Vision Models in an annotation sense. QandA, Captioning...That sort of thing.

Even though it makes so much sense, I never thought about it like this. Inpainting, Object Detection, Rotation, Lighting, Segmentation, Edge Detection, Pose Estimation, Surface Normal, Colorization and much more achieved by a single model.

I believe this and Codi-2(https://codi-2.github.io/) offer a glimpse of the future of Large Multimodal Models.


One of the things sci-fi really seemed to get right is that we will have AGI long before we'll have agreement that it is actually AGI.

People will keep finding some small case or reason why not to call it AGI. And then finally once that last case is knocked down and we have agreement on a definition, we'll realize we crossed that threshold a "long" while back.

And I'm not saying we have AGI now, just that it's now clear to me how this process will play out.

(Where "long" in AI development timeliness probably doesn't mean the same thing "long" meant even in the 2010s.)


I think the last hurdles will be sentience and a theory of the mind. Theory of the mind is probably something that a LM could predict given enough data, but sentience? I don't think we understand enough about our own sentience for us to create it in a machine.


> I don't think we understand enough about our own sentience for us to create it in a machine.

Invention preceding understanding is the norm, not the exception. We created fire before understanding chemistry, and we constantly use pharmaceuticals without really understanding how they work. Invention first, then theory comes along to explain and generalize.

For all we know, sentience is a necessary side-effect of semantic processing of any kind, in which case LLMs already have a form of sentience.

So yes, you're right that we don't understand our own sentience. In fact, we understand so little that it could literally be staring us in the face right now and we don't realize it.


It goes back to the old question of whether matter is fundamental and consciousness emerges from it or the other way around. My inclination is the latter, by way of Descartes. "I think therefore I am." If you can think "I am me" then you know with certainty that your first person identity exists but all other information that comes to you through your senses will always be an incomplete picture of the material universe. I think this is the strongest argument why consciousness precedes matter (or pervades through all matter, depending how you look at it). If sentience is a fundamental property of the universe, the way I see it, all matter shares the same soul. That makes it perfectly logical for a machine that mimics the pathways of the brain to behave just like a brain, to me.


It may be that we found out that sentience is not as magical or unique as we originally thought.

Some people claimed that "creativity" (another rather nebulous term) was a uniquely human trait, and machines could perform tasks that require this. But recent generative AI models have started to make people question that position.


LLMs can already predict theory of mind in text and keep good track of character motivations and knowledge in a story.


Neither of those are required for intelligence.


So this is artificial general intelligence, right?


The problem with the phrase "artificial general intelligence" is that everyone is arguing about the definition of all three words, and has a different threshold for the boolean pass/fail boundary.


General Intelligence is a gradient, not a hard on/off. Obviously, these are machines so artificial. They're certainly not narrow in scope or abilities so general. They perform tasks we consider intelligent. So...sure. Like ENIAC, i imagine we'll build or have built agi well before everyone can agree it is so.


Very odd that they don't do linear probing on the model features and measure ImageNet validation accuracy. In general, this paper is missing comparisons to existing vision models on existing benchmarks.

The visual prompting is a neat trick and it's great to see scale continue to work, but without comparisons to other models or releasing the code/weights, I don't think this strategy is going to be competitive.


Do captchas have a future? It seems inevitable that AI will beat humans on captcha real soon (if not already). What's next?


I should patent this idea but here it goes anyway: in the future Captcha's will consist of requesting you make an antisemitic or misogynist remark to prove you are human, since the bots will be held to higher moral standards than man.


This is a completely silly comment, but I will admit that it made me chuckle just because of how incredibly random and shocking it was, especially here on HN, lol


Obviously, not all bots will be held to such standards - these are the standards which apply to western corporations who offer free bots for PR and marketing purposes, since 'politically incorrect' behavior stunts that goal, however, anyone keeping their own bot (e.g. a spammer wanting to solve millions of captchas) has no problems running an 'uncensored' bot - right now you can get reasonably large open source models without the RLHF pretraining and thus also no attempt at 'moral standards' other than those you choose to put in when finetuning the bot for your purposes.

It's relatively trivial to flip the sign on that training data and have a bot that will instead refuse to make antifascist or feminist arguments; if someone wants a Hitlerbot to ghostwrite "My struggle" for them, there's nothing that could prevent them from finetuning such a model from one of the publicly available models; there is no one that can enforce any 'moral standards' on the bots other than their creators.


I don't know about you but I use chat gpt to solve captcha because I can't.


Considering this, the only remaining captcha will be hard cold money. Paywalls.


Hypothesis: Intelligence is prediction?


That intuitively makes sense to me for everything except higher-level goal selection. If someone works at an NGO instead of FAANG because they want to do good for the world, how can that be explained in terms of prediction? If someone minimizes consumption to reduce global warming, how did prediction result in them caring about global warming? I picked those examples on purpose, because they run against some core instincts. Maybe they’re predicting what maximizes peer approval, which optimizes other outcomes too? (Jk :D )


You are right, prediction doesn’t provide one with base motivations or drives. But those seem fundamentally different than raw intelligence.

Your examples aren’t an issue I don’t think. To take one: efforts to minimize one’s consumption to combat global warming could fall in the “help tribe survival” bucket, which makes evolutionary sense as a base drive. And it seems like a reasonably “intelligent” response to that base drive — the predicted results seems positive.


You might enjoy 'On Intelligence' by Jeff Hawkins to learn more about this hypothesis. (It's an older book / theory at this point but still worth reading IMHO)


Yup, known as predictive coding/free energy principle https://en.wikipedia.org/wiki/Predictive_coding as one of leading theories for human consciousness

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6363942/


which is very close as to say that intelligence is learning [from mis-prediction], a subject of cybernetics - common machinery across the scales and instances - using a model (brain in case of animals, model/embedding in case of computer, DNA in case of evolution through natural selection) a prediction (which for example in case of evolution is the next generation) is made, and based on the error of that prediction the model is updated for the generation of the next prediction (in case of evolution the DNA of the best fit is the model to be used in the next iteration).


Or prediction is intelligence.

I think we need a more rigorous definition to go beyond tautology


Yep, prediction of what needs to be done.


Next step, accept user input from VR glasses and we've basically got a holodeck.


Would have been neat to see some animations since it is video frames in many cases.


The "add a frame to a video" use-cases are probably the least exciting here, the image annotation capabilities seem to me the bigger deal.


Has anyone tried using transformers on weather forecasts yet?



I would have loved to see videos on the blog post of completions


So this can solve visual analogies from iq tests?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: