Hacker News new | past | comments | ask | show | jobs | submit login
New algorithm discovers language just by watching videos (csail.mit.edu)
167 points by geox 11 days ago | hide | past | favorite | 79 comments





Given that blind people can learn to speak, audio alone must be enough to learn language. And given that deaf people can learn sign language, video alone must also be enough to learn language. That’s assuming that touch and emotion aren’t crucial to language learning.

Given Helen Keller's grasp of language, touch alone must be enough to learn language.

I've often wondered if there aren't some structures in the brain, that have been selected for since the advent of language, that are good at learning up languages.

This is one of the most discussed and argued ideas in all of linguistics and philosophy. https://plato.stanford.edu/entries/innateness-language/

Indeed. I'm reminded of the time a childhood friend essentially discovered the inverted spectrum argument[1]. That is, we can't know if my qualia when perceiving the color red doesn't match yours of the color blue.

We were unfortunately young and in poor company, so the idea didn't receive the appropriate attention.

[1]: https://en.wikipedia.org/wiki/Qualia#Inverted_spectrum_argum...


I think Carl Sagan discusses this topic in "Broca's Brain".

https://en.wikipedia.org/wiki/Broca's_Brain


I think what you're missing is deliberate action and feedback. We don't just listen or watch as if the world was a movie. We act and communicate intentionally and with purpose. The response to our deliberate actions is in my view what we mostly learn from.

Blind people surely compensate for the lack of visual information by deliberately eliciting audio and tactile feedback that others don't need.

Also, watching others interact with the world is never the same thing as interacting with the world ourselves, because there's a crucial piece of information missing.

When we decide to act, we know that we just made that decision. We know when we made the decision, why we made it and what we wanted to achieve. We can never know that for sure when we observe others.

We can guess of course, but a lot of that guessing is only possible because we know how we would act and react ourselves. A machine that has never intervened and deliberately elicited feedback cannot even guess properly. It will need incredible amounts of data to learn from correlations alone.


Different kinds of experiences, but highly correlated.

Emotion it’s crucial for sure



It's a stretch to call this discovering language. It's learning the correlations between sounds, spoken words and visual features. That's a long way from learning language.

No it isn't, this is exactly what babies do.

Babies don't know language. They know some words. Dogs recognize some words too associated with what they see. Have they discovered language too?

Noone knows what babies do. Anything else is speculation.

babies do a lot more. they have feelings. like pain, pleasure. and different verities of them. then they have a lot of things hardcoded. and they have control of their bodies and the environment. for example they quickly learn that crying helps getting what they want.

They do, and from it they learn many other things even before language, including some non-verbal expectation of what is likely to happen next, and that they have some ability to participate in the external world. Until recently, we have not seen anything that has gained some competence in language without having these precursors.

> including some non-verbal expectation of what is likely to happen next

Some of that isn't learned at all (other than through evolution). A newborn will flinch back from a growing circle on a screen.


Indeed - a half-billion years of training since the first neurons appeared - but here I'm thinking more of the sort of understanding which leads, for example, to reactions of surprise to incongruous outcomes.

https://news.mit.edu/2011/infant-cognition-0527


Ok, it’s a lot more like a boltzmann brain in Plato’s cave learning what language is with zero other external context of what reality is.

None of which has much to do with learning language.

Actually the feedback part of learning is important. There's a famous experiment with cats in baskets that demonstrated that.

But AI isn't an animal so the same constraints don't necessarily apply. I think you'd have to have a particularly anti-AI & pedantic bent to complain about calling this language discovery.


Feedback, yes. Feedback from interacting with the physical reality by means of physical body with flexible appendages? Useful in general, but neither necessary nor sufficient in case of learning language.

Feedback is fundamental to deep neural networks, it's how they're trained. And to be honest, all of the things 'astromaniak mentions can be simulated and made part of training data set too. While the "full experience" may turn necessary for building AGI, the amount of success the field had with LLMs indicates that it's not necessary for the model to learn language (or even to learn all languages).


A lot of a baby's language pickup is also based on what other people do in response to their attempts, linguistically and behaviourally. Read-only observation is obviously a big part of it, but it's not "exactly" the same.

Right. But that happens too with ML models during training - the model makes a prediction given training example input, which is evaluated against expected output, rinse repeat. A single example is very similar to doing something in response to a stimuli, and observing the reaction. Here it's more of predicting a reaction and getting feedback on accuracy, but that's part of our learning too - we remember things that surprise us, tune out those we can reliably predict.

I don’t think that you are wrong but what do you mean by language that can’t covered by p(y|x)?

I think he's wrong with the same logic. If sound, visual and meaning aren't correlated, what can language be? This looks like an elegant insight by Mr Hamilton and very much in line with my personal prediction that we're going to start encroaching on a lot of human-like AI abilities when it is usual practice to feed them video data.

There is a chance that abstract concepts can't be figured out visually ... although that seems outrageously unlikely since all mathematical concepts we know of are communicated visually and audibly. If it gets more abstract than math that'll be a big shock.


Linguistic relativity: https://en.m.wikipedia.org/wiki/Linguistic_relativity :

> The idea of linguistic relativity, known also as the Whorf hypothesis, [the Sapir–Whorf hypothesis], or Whorfianism, is a principle suggesting that the structure of a language influences its speakers' worldview or cognition, and thus individuals' languages determine or influence their perceptions of the world.

Does language fail to describe the quantum regime with which we could have little intuition? Verbally and/or visually, sufficiently describe the outcome of a double-slit photonic experiment onto a fluid?

Describe the operator product of (qubit) wave probability distributions and also fluid boundary waves with words? Verbally or visually?

I'll try: "There is diffraction in the light off of it and it's wavy, like <metaphor> but also like a <metaphor>"

> If it gets more abstract than math

There is a symbolic mathematical description of a [double slit experiment onto a fluid], but then sample each point in a CFD simulation and we're back to a frequentist sampling (and not yet a sufficiently predictive description of a continuum of complex reals)

Even without quantum or fluids to challenge language as a sufficient abstraction, mathematical syntax is already known to be insufficient to describe all Church-Turing programs even.

Church-Turing-Deutsch extends Church-Turing to cover quantum logical computers just: any qubit/qudit/qutrit/qnbit system is sufficient to simulate any other such system; but there is no claim to sufficiency for universal quantum simulation. When we restrict ourselves to the operators defined in modern day quantum logic, such devices are sufficient to simulate (or emulate) any other such devices; but observed that real quantum physical systems do not operate as closed systems with intentional reversibility like QC.

For example, there is a continuum of random in the quantum foam that is not predictable with and thus is not describeable by any Church-Turing-Deutsch program.

Gödel's incompleteness theorems: https://en.wikipedia.org/wiki/G%C3%B6del's_incompleteness_th... :

> Gödel's incompleteness theorems are two theorems of mathematical logic that are concerned with the limits of provability in formal axiomatic theories. These results, published by Kurt Gödel in 1931, are important both in mathematical logic and in the philosophy of mathematics. The theorems are widely, but not universally, interpreted as showing that Hilbert's program to find a complete and consistent set of axioms for all mathematics is impossible.

ASM (Assembly Language) is still not the lowest level representation of code before electrons that don't split 0.5/0.5 at a junction without diode(s) and error correction; translate ASM to mathematical syntax (LaTeX and ACM algorithmic publishing style) and see if there's added value


> Even without quantum or fluids to challenge language as a sufficient abstraction, mathematical syntax is already known to be insufficient to describe all Church-Turing programs even.

"When CAN'T Math Be Generalized? | The Limits of Analytic Continuation" by Morphocular https://www.youtube.com/watch?v=krtf-v19TJg

Analytic continuation > Applications: https://en.wikipedia.org/wiki/Analytic_continuation#Applicat... :

> In practice, this [complex analytic continuation of arbitary ~wave functions] is often done by first establishing some functional equation on the small domain and then using this equation to extend the domain. Examples are the Riemann zeta function and the gamma function.

> The concept of a universal cover was first developed to define a natural domain for the analytic continuation of an analytic function. The idea of finding the maximal analytic continuation of a function in turn led to the development of the idea of Riemann surfaces.

> Analytic continuation is used in Riemannian manifolds, solutions of Einstein's [GR] equations. For example, the analytic continuation of Schwarzschild coordinates into Kruskal–Szekeres coordinates. [1]

But Schwarzschild's regular boundary does not appear to correlate to limited modern observations of such "Planc relics in the quantum foam"; which could have [stable flow through braided convergencies in an attractor system and/or] superfluidic vortical dynamics in a superhydrodynamic thoery. (Also note: Dirac sea (with no antimatter); Godel's dust solutions; Fedi's unified SQS (superfluid quantum space): "Fluid quantum gravity and relativity" with Bernoulli, Navier-Stokes, and Gross-Pitaevskii to model vortical dynamics)

Ostrowski–Hadamard gap theorem: https://en.wikipedia.org/wiki/Ostrowski%E2%80%93Hadamard_gap...

> For example, there is a continuum of random in the quantum foam that is not predictable with and thus is not describeable [sic]

From https://news.ycombinator.com/item?id=37712506 :

>> "100-Gbit/s Integrated Quantum Random Number Generator Based on Vacuum Fluctuations" https://link.aps.org/doi/10.1103/PRXQuantum.4.010330

> The theorems are widely, but not universally, interpreted as showing that Hilbert's program to find a complete and consistent set of axioms for all mathematics is impossible.

If there cannot be a sufficient set of axioms for all mathematics, can there be a Unified field theory?

Unified field theory: https://en.wikipedia.org/wiki/Unified_field_theory

> translate ASM to mathematical syntax

On the utility of a syntax and typesetting, and whether it gains fidelity at lower levels of description

latexify_py looks neat; compared to sympy's often-unfortunately-reordered latex output: https://github.com/google/latexify_py/blob/main/docs/paramet...


> If sound, visual and meaning aren't correlated, what can language be?

Ask someone with an articulation disorder. /s

Still, this is super reductive. Language can be feeling, it can be taste, it can be all sorts of things that are ineffable.

It is weird to me that the research into this assumes there is a baseline of repeatability somehow. At best, this is mimicry, imo.


What’s bizarre to me is that the entire field of “AI” research, including the LLM space, seem to be repeating the exact same mistakes of the ecosystem models of the 1920’s and later. Gross oversimplifications of reality not because that’s what is actually observed but because that’s the only way to make reality fit with the desired outcome models.

It’s science done backwards which isn’t really science at all. Not that I think these models have no use cases, they’re simply being used too broadly because the people obsessed with them don’t want to admit their limitations.


I missed all the replies yesterday. My point is that I don't think that learning the correlation between some words and visual concepts qualifies as discovering language. It may be that that's as far as this approach can go so it never discovers more sophisticated constructs of language. In that case it's not different to recognizing a barking dog which is surely below "language". I am not a linguist, so not sure what qualifies as "language" officially, but intuitively this falls short.

I think it's fair. It discerns what sounds are important, languagey sounds. It discovered but did not understand

This is exactly how I taught myself to read at age 3, and by age 5 my reading comprehension was collegiate, so I don't really understand what you mean. Language is literally communication through patterns. Patterns of speech, patterns of thoughts, archetypes, it's all patterns.

Same, I could read before kindergarten just from my parents reading to me. My mom was shocked when she found out I could read, they hadn't started intentionally teaching me yet.

I just instructed my grandmother to read the same kid's book to me over and over again each night for a few weeks and matched the phonemes to the letters. At that point I'd already had a decent grasp of language so it felt like a very natural transition.

It bit me for a while later on in grade school when I took an English class and realized I didn't know anything about how language was constructed, it was all purely intuition. Formalizing language took a concerted effort on my part, and my teachers didn't understand how to translate the material... because I can't just be told things at face value, there is always a nested mass of "but why?" that must be answered or I fundamentally don't understand.

Once I finally got over that hill, it was smooth sailing again. It also made learning foreign languages very hard in school, I failed both foreign language classes I took, until I again took my own time to learn the fundamentals and now it all seems to stick.


Nice work. The authors train a model with two contrastive objectives: (1) predict which video clip corresponds to a given audio file, and (2) predict which audio clip corresponds to a given video clip. Together, these two training objectives induce the model to learn to associate different sounds from a given audio clip, like words forming a spoken sentence, with different objects shown in a video clip, like the pixels depicting the subject of the spoken sentence -- and vice versa.

If I understand the paper correctly, the model was trained on tens of thousands of video and audio clip pairs, randomly sampled 64 million times (80 samples/batch x 0.8 million training steps). For comparison, over the course of a baby's first two years of life, the baby will get trillions of sound samples (measured at 44.1KHz) and billions of frames of visual data (measured at 30fps). It's not too hard to imagine that a large AI model in the near future, trained on comparably vast quantities of audio and video data, can learn to recognize objects and words, like a baby.

The most exciting takeaway, for me, is that a bigger model inside a robot body should be able to integrate learning in this manner across all five human data modalities -- vision, hearing, smell, touch, taste -- as well as any other data modalities for which the robot has sensors -- radar, lidar, GPS position, etc. We sure live in exciting times!

Link to paper, code, demo, and data:

https://mhamilton.net/denseav


first half a year-year babies can't focus their eyes, iirc

still several of OOM difference


> still several of OOM difference

I agree. Nothing I wrote above disagrees with your statement.


Can't they just add more?

(I can't edit find the button to edit my comment)

Is memory really an expected bottleneck in the long term?


It sounds like you're viewing "OOM" as "Out of Memory", when the parent meant "Order of Magnitude" - meaning that a baby's "training data" is several Orders of Magnitude more than what the DenseAV model from the paper used.

To massively anthropomorphize convnets, perhaps this serves the purpose of training up shape / color recognition first before fine detail assessment happens.

Hah, neat. Had a similar idea long long ago. Wanted to run it against the show, Friends, to model the object interactions against text to speech results against a transcript. The idea was to create a speech to text slash plot generator using silly bayes stuff that latches onto accurate text to speech pairings over inaccurate pairings when compared against the transcript. Friends because it's such a popular show that has so much association and accessible transcripts.

I wish Friends was available dubbed in the language I'm learning (Hindi) - I know it well enough in English from repeated watching that I think it'd be the best thing I could do for my vocabulary and listening comprehension speed to watch it in Hindi.

There should be an app that pays people to dub popular shows in a broader way of language, charges viewers, and has some kind of sidebar or subtitle integration to hover/click and learn more, check what a word was and conjugation etc.

The focus on language learning would also allow the translators to make more appropriate choices for that use, whereas subtitles (and probably dubbing) for entertainment often has some variation that is confusing it to learn. I suppose that matters less if there's only one or the other, so no chance of conflict.


> because it's such a popular show that has so much association and accessible transcripts.

Because it's so stupid even some dumb code can understand the patterns. Wouldn't work with Seinfeld.


Fasciating. Language is a type of action evoloved for information exchaging, which maps latent "video", "audio" and "thoughts" into "sentences" and vice versa.

Intersting. How long till we have universal translators and/or can figure out the Voynich manuscript?

Right after you have no mouth and must scream.

The Voynich manuscript is probably too short, but training an LLM on two large text dumps of unrelated languages, where only one is known (the other perhaps an alien message), should make translation possible.

How is this different from what LLMs do (by reading written works)?

It learns to recognize which objects correspond to which sounds, and vice versa. It doesn't really understand linguistic concepts like grammar, sentences, logic, etc.

It doesn't really have much in common with LLMs. If the ideas were combined, the results might be interesting, although probably requiring very significant compute.


You’ll need a whole new class of training data. A camera strapped to someone’s head as they go about their day, scaled across continents, age groups and occupations. You can’t rely on YouTube, since that’s all edited. Not at all how a toddler would see the world.

Have you heard of live streaming and lifestreaming? There's countless hours of this exact type of footage available on twitch. I would bet Google is also already recording when doing mapping, so they should have videos going through every street in the world as well.

A lot of people don't know that twitch started as justin.tv and was specifically a life streaming platform.

I partially agree, but many humans do infact learn new languages from youtube, TV shows etc. There is also practically infinite minutes of long form conversations on youtube at least in English and Japanese. Not edited just 1+hour long videos of uncut talking.

Funnily enough though, the comprehensible input movement for language learning has already started generating the type of content you're suggesting. Long first person view walks, following day to day activities and usually with a monologue about what they are seeing inbetween.


I've been learning a language from Tiktok videos for more than two years and that totally works, the results are there. This format is even more edited than YouTube

You also got trained by ~5k hours of unedited real-time first person footage before you started to speak.

No I wasn't, this is my only exposure to this language, I don't live there.

Imagine an autonomous robot programmed with this algorithm just walking around in the world learning about it. So excited about the future!

All well and good until he rolls into a room where they are doing a Terminator movies marathon

Now, do whales.

> “Another exciting application is understanding new languages, like dolphin or whale communication, which don’t have a written form of communication. Our hope is that DenseAV can help us understand these languages that have evaded human translation efforts since the beginning."

Yay

Seems to work.

I think my son's learning Japanese by watching Anime on Crunchyroll.


After watching about 500 hours of subtitled anime, I can say with total confidence that "nando" is an interrogative. Probably.

I think that one's nanto, "surprisingly". Nani (sometimes reduplicated, nani nani?) is "what?" or "what's this?", and nan demo (literally "what but") is "you see, ..."

Should've started when you were five. I learned English this way.

Linguists e.g. Chomsky: No you can't. Proof is left as an exercise.

I think Chomsky's position is that _humans_ are primed for language, but it doesn't mean there exists no system that can learn language from 0.

Also, the ability to understand language is a spectrum, not a binary quality.


And with that in mind there is an obvious counterargument to the original "No". If humans learn language through observation and innate knowledge encoded into the brain then we can construct an artificial system that discovers language purely through observation by encoding the same innate knowledge into the system.

Algorithms necessarily contain some knowledge about the data they are working with so that isn't cheating. Neural nets for example have to have some architecture and that predisposes them to learning certain patterns.


He does claim there is no system which can learn language from 0 by having just as much or less training data as humans. Which seems very likely false.

Yeah the argument is that there is a poverty of stimulus (not enough training data) in the case of children. The same doesn’t apply to ML models which need an abundance of stimulus.

There was some linguist (Deb Roy?) who videotaped his children growing up. One of his observations was that every learned syllable etc. was tied to observing it as a stimulus and trying to imitate it. Now it is true, children are pretty good learners, they can often learn something the first time they see it, but actually I have seen some LLM stuff about "instant learning" - e.g. the training is only done with one pass over the material. https://www.fast.ai/posts/2023-09-04-learning-jumps/

This is an interesting connection. If I had to guess, the Chomskyist response would be to say, yes but this only applies to LLMs that have been pre-trained already (i.e. have structures in place needed to understand language in general). I think a Chomskyist would say that language learning is precisely like fine-tuning and not the pre-training of foundation models.

Word-object correspondances can obviously be learnt from observation in theory - it is widely accepted that there is enough information to learn such correspondances, even on a limited dataset. The argument by linguists is about grammar, not vocabulary.

Chomsky's argument is exactly the opposite, that vocabulary is learnt from scratch, but that the framework for grammar is innate.


> Proof is left as an exercise.

Rather conclusive one already established in the ML field.


Historically, Chomsky has been wrong about a great many things.

Particularly about things he never said.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: