Hacker News new | past | comments | ask | show | jobs | submit login
On Chomsky and the Two Cultures of Statistical Learning (norvig.com)
83 points by georgehill on March 9, 2023 | hide | past | favorite | 135 comments



Well, Chomsky already dismissed corpus based linguistics in the 90s and 2000s, because a corpus (large collection of text documents, e.g., newspaper, blog post, literature or everything mixed together) is never a good enough approximation of the true underlying distribution of all words/constructs in a language. For example, a newspaper-based corpus might have frequent occurences of city names or names of politicians, whereas they might not occur that often in real everyday speech, because many people don't actually talk about those politicians all day long. Or, alternatively, names of small cities might have a frequency of 0.

Naturally, he will, and does, also dismiss anything that occured in the ML field in the past decade.

But I agree with the article. Dealing with language only in a theoretical/mathematical way, not even trying to evaluate your theories with real data, is just not very efficient and ignores that language models do seem to work to some degree.


This is a bit lateral, but there is a parallel where Marvin Minsky will most likely be best remembered for dismissing neural networks (a 1 layer perceptron can't even handle an xor!). We are now sufficiently removed from his heyday where I can't really recall anything he did besides the book Perceptrons with Seymour Papert (who went on to do some very interesting work in education). There is a chart out there about ML progress that makes a conjecture about how small the gap is between what we would consider that smartest and dumbest levels of human intelligence (in the grand scheme of information processing systems). It is a purely qualitative vibes sort of chart, but it is not unreasonable that even the smartest tenured professors at MIT might not be that much beyond the rest of us.


This dismissal of Minsky misses that Minsky had actually extensive experience with neural nets (starting in the 1950s, with neural nets in hardware) and was around 1960 probably the most experienced person in the field. Also, in Jan 1961, he published “Steps Toward Artificial Intelligence” [0], where we not only find a description of gradient descend (then "hill climbing", compare sect. B in “Steps”, as this was still measured towards a success parameter and not against an error function), but also a summary of experiences with this. (Also, the eventual reversal of success into a quantifiable error function may provide some answer to the question of success in statistical models.)

[0] Minsky, Marvin, “Steps Toward Artificial Intelligence”, Proceedings of the IRE, Volume: 49, Issue: 1, Jan. 1961: https://courses.csail.mit.edu/6.803/pdf/steps.pdf


Gradient descent was invented before Minsky. Imo, Minsky produced some vague writings, with no significant practical impact, but this is enough for some people to claim his founder's role in the field.


Minsky was actually a pioneer in the field, when it came to working with real networks. Compare

[0] “A Neural-Analogue Calculator Based upon a Probability Model of Reinforcement”, Harvard University Psychological Laboratories, Cambridge, MA, January 8, 1952

[1] “Neural Nets and the Brain Model Problem”, Princeton Ph.D dissertation, 1954

In comparison, Frank Rosenblatt's Perceptron at Cornell was only built in 1958. Notably, Minsky's SNARC (1951) was the first learning neural network.


> when it came to working with real networks. Compare

my understanding is that that no one knows what that SNARK thing was, he built something on the grant, abandoned it shortly after that, and only many years later he and fanboys started using it as foundation of bold claims about his role in the field.


Well, his papers are out there to read.


Yes, and I read them: https://dspace.mit.edu/bitstream/handle/1721.1/6103/AIM-048....

vague esssay without specifics


So you may like better,

> “Multiple simultaneous optimizers” search for a (local) maximum value of some function E(λ1, …, λn) of several parameters. Each unit Ui independently “jitters” its parameter λ1, perhaps randomly, by adding a variation δi(t) to a current mean value μi. The changes in the quantities λi and E are correlated, and the result is used to slowly change μi. The filters are to remove DC components. This technique, a form of coherent detection, usually has an advantage over methods dealing separately and sequentially with each parameter.

(In “Steps”)

:-)


can you provide link, and what conclusions you derived from this text if your interest is meaningful discussion?


The link has been already provided above (opus cit), it's directly connected to the very question of gradients, providing a specific implementation (it even comes with a circuit diagram). As you were claiming a lack of detail (but apparently not honoring the provided citation)…

(The earlier you go back in the papers, the more specifics you will find.)


You didn't give me any links.

And what are your conclusion from citation? You are claiming again that Minsky invented gradient descent?


For the link and claims, see the the very comment you initially answered to.


That claim was answered: Minsky didn't invent gradient descent.


That claim was never made, but by you. The claim was, Minsky had practical experience and wrote about experiences with gradient descend (aka "hill climbing") and problems of locality in a paper published Jan. 1961.

On the other hand: who invented "hill climbing"? You've contributed nothing to the question, you've posed (which was never mine, nor even an implicit part of any claims made).


Ok, Minsky "pioneering" is his writing about something invented before him. Anything else? :)


Well, who wrote before 1952 about learning networks? I'm not aware that this would have been already main stream, then. (Rosenblatt's first publication on the Perceptron is from 1957.)

It would be nice, if you contributed anything to the questions you are posing, like, who invented gradient descent / hill climbing or who can be attributed for this? what substantial work precedes the writings of Minsky on their respective subject matter (substantially)? why was this already mainstream or how and where were these experiments already conducted elsewhere (as in "not pioneering")? Where is the prior art to SNARC?


> Well, who wrote before 1952 about learning networks?

steps which you referred not from 1952.

> Where is the prior art to SNARC?

We don't know what was the SNARC so can't say if there was prior art.

Any other fantasies? :-)


This is ridiculous. Pls. reread the threads, you'll find the answers.

(I really don't care about what substantial corpus of research on reinforced learning networks in the 1940s, which is of course not existent, you seem to be alluding to, without caring to share any of your thoughts. This is really just trolling at this point.)


> you'll find the answers.

I think you perfectly understand that we are in disagreement about this, my point of view is that your "answers" are just fantasies about your idol without grounding into actual evidence.

What is your goal in this discussion?


Minsky is not my idol. It's just that it's part of reality that Minskys writings exist, that theses contain certain things and that they were published at certain dates, and that BTW Minsky happens to have built the earliest known learning network.


Take the amount of language a blind 6 year old has been exposed to. It is nothing like the scale of these corpsuses but they can develop a rich use of language.

With current models if you increased parameters but gave it a similar amount of data it would overfit.


It could be because kids are gradually and structurally trained through trials, errors and manual corrections, which we somehow don't do with NN. He wouldn't be able learn language if only exercises he would be doing is to guess next word in sentence.


For me this is a prototypical example of compounded cognitive error colliding with Dunning-Kruger.

We (all of us) are very bad at non-linear reasoning, reasoning with orders of magnitude, and (by extension) have no valid intuition about emergent behaviors/properties in complex systems.

In the case of scaled ML this is quite obvious in hindsight. There are many now-classic anecdotes about even those devising contemporary scale LLM being surprised and unsettled by what even their first versions were capable of.

As we work away at optimizations and architectural features and expediencies which render certain classes of complex problem solving tractable by our ML,

we would do well to intentionally filter for further emergent behavior.

Whatever specific claims or notions any member has that may be right or wrong, the LessWrong folks are at least taking this seriously...


To some degree is quite an understatement. :)

My own hobby horse of late is that independent of its tethering to information about reality available through sensorium and testing, LLM are already doing more than building models of language qua language. Write up someone pointed me at: https://thegradient.pub/othello/


How is an understatement? And what do people mean by language models working well? From what I can tell, these language models are able to form correct grammar quite surprisingly well. However, the content of such, is quite poor and often void of any understanding.


"seem to work to some degree", appears much like a second order argument to this debate… ;-)


“As another example, consider the Newtonian model of gravitational attraction, which says that the force between two objects of mass m1 and m2 a distance r apart is given by F = Gm1m2/r^2 where G is the universal gravitational constant. This is a trained model because the gravitational constant G is determined by statistical inference over the results of a series of experiments that contain stochastic experimental error.”

No, it’s not. Period. It’s a THEORY that is supported “by statistical inference over the results of a series of experiments that contain stochastic experimental error.” It’s been checked hundreds of times and its accuracy has gotten better and better. The technology of the checking led to new science. By making it consistent with special relativity, and POSTULATING the principle of equivalence, a new THEORY was born, general relativity, also known as the theory of gravitation. IT has been checked hundreds of times, ”the technology of the checking has led to new science,” and it is a current very active field of RESEARCH, (not model training.)

It’s a little horrifying that Norvig doesn’t seem to understand these nuances.

The same arguments apply to the solid state physics underlying the machines that run large language models. That too, is a “THEORY. It has been checked hundreds of times, the technology of the checking has led to new science, and it is a current very active field of RESEARCH, (not model training.)”


> No, it’s not.

What's "not"? His statement about the gravitational constant is exactly correct: we have no theory that tells us what the value of this constant should be, we have to get its value from measurements, and doing that is a process of statistical inference.

It is also true that the form of the theory itself--its equations--is not given to us magically but has to be developed by comparing predictions with experimental data, i.e., by a process of statistical inference.

Physicists don't usually describe these processes as "model training", but that doesn't mean such a description is wrong.


I guess one difference between the two is explainability. As we predict, test, and update, there is an ongoing narrative along the way of why we are making this change or the other. We can explain why we thought some scientific model was the correct one, even if it turns out to be wrong and needs to be udpated. In model training as it's used here, I don't think there is any such a thing.


That's really the divergence Norvig points to between Bregman's two cultures.

And there are plenty of cases where we have had essentially purely phenomenological models in physics (I mentioned this in another thread, but Landau theory) which only _later_ were systematized, so it's a set of processes on a continuum, not two binary opposites.

People _treat_ physical laws as absolutely true, because with high confidence they are and it's intellectually convenient, but they're really just models. We build models on those models, and mental models using our _interpretation_ of those models; no-one is denying the importance of model interpretability in all of this – but it's absolutely a tower of model-building, with more-or-less convincing explanations for the model parameters at various points.

Physics, or at least some domains of physics, are tractable using what one might term the "axiomatic style" because the models we have are a) staggeringly explainable and b) extremely robustly supported by measurement. Even that statement exists on a continuum: MOND is pretty damn phenomenological, for one example, and while we have pretty good quantum chemistry methods in the small, we certainly don't have practical ways of using those ab-initio methods for even mesoscale problems in solid state physics. Does that make mesoscale physics "not science"?

A science which rejects all phenomenology and which insists on building everything up from first principles isn't a science; it's mathematics. That's totally fine! It's just a different—related, but distinct—domain of study.


> there are plenty of cases where we have had essentially purely phenomenological models in physics

Actually, there is a fairly common point of view (which often goes by the name of "effective field theory") according to which all of our physical theories, even the ones we usually refer to as "fundamental" like General Relativity and the Standard Model of particle physics, are phenomenological; they aren't the actual "bottom layer" but something that emerges as an effective theory from other layers deeper down (which we don't have a good theory of at this point).


I came to argue in favor of aj7's argument as a physicist. I think we are all missing the forest for the trees here. what elevates the gravitational law to it's status and differentiates it from a simple model, and what i think the point that aj7 is trying to make, and also the point that norvig is choosing not to focus on here, is not the gravitational constant G. Noone really cares about the value of the constant and how it is approximated in this part of our spacetime etc [edit: of course they do care if it is postive, zero, or negative for reasons consistent to what i am saying later]. The "juice" here is the inverse square relation to length. The form of the equation is what is the deep insight and what also provides for that insight to transfer to other systems. Knowing the nature of the differential equation is of uttermost importance because then we can reason about possible and impossible outcomes of it just from it's mathematical (and only) nature. If these models in the future can be built in ways that they can be reasoned about symbolically like these systems that we strive for in physics, i think they will be elevated to laws (or proper theories as aj7 says) rather than models. and noone will care if they have a billion parameters as long as they have a certain structure that we can work with. I hope this makes some sense to support the counter-norvig argument in this (and i personally respect both chomsky and norvig infinitely of course. i just think norvig's wording in this specific part of the article is indeed crude and shallow, maybe just to make his counterargument)


I disagree. It’s true that we are used to thinking of gravitation as a theory that was double checked with data, but it is also pretty clear that good physical arguments are strong priors. Then, with experimental data, we have a classic prior + data posterior update which does cast all of this as a statistical model.


Then all scientific theories are “models.” And we are just very advanced “computers.” Nope.

The first sentence is wrong. And the second sentence is why we play the game.


I think you're coming at this from a dogmatic interpretation and it's making it hard for you to see the alternate perspective.

All scientific theories are models; that's generally considered true by nearly all scientists. What would a theory be if it's not a model?


> all scientific theories are “models.”

That's correct. They are. As the saying goes, all models are wrong but some are useful.

> And we are just very advanced “computers.”

To a physicalist, yes, this is true.


> Then all scientific theories are “models.”

I may be missing the context of what you mean by model and somewhat agree with your original critique of Norvig's comment. However, scientific theories are nothing but models. For example, general relativity is not gravity but is instead a model of gravity. Quantum mechanics describes quantum behavior but it is not quantum behavior. It is a model of quantum behavior.

When physicists say "is", they mean "is modeled by" or "is a model of", whether they think so or not. None of the mathematics, or physics, or what have you is the actual thing. They are nothing but models of the actual thing that have varying degrees of accuracy and precision.


Is the use of those quoted words what you dislike? Would you have a different sense of it if we replaced those words with "foo" and "bar" so that we could use more nuanced concepts?


That is not at all the way Newton thought, though.


> It’s a little horrifying that Norvig doesn’t seem to understand these nuances.

yes. its quite incredible actually. It is one thing to sort of tribulate about brain / cognition domains that nobody knows much about using a pseudo-language of "learning" or "intelligence" (or even the original sin: "neural networks") and it is another level altogether to expose publicly how much one is subject to "I have got a nice little hammer and to me everything is a nail" pathologies

Conflating the statistical verification of physical laws / theories with the theories themselves reveals in much clearer terms how limited the conceptual model the AI crowd want to inflict on us. We do know about more about physical laws than about the brain and they don't come about like they say. Go back to school.


Newtonian gravitation is absolutely “we propose a model, we fit it, here is the result, here are the error bars on _g_”.

Einsteinian gravitation is a better model. It fits the data better, that’s how we know.

These are models, exactly in line with Breiman’s “first culture”, data modelling - the model has been designed according to intuition about the underlying mechanism.

You seem to be arguing that if a model is not derived from first principles it is not science. This is ahistorical nonsense. Science has always been full of phenomenological models which haven’t lent themselves to direct interpretation. That doesn’t make the model wrong, it makes it uninterpretable. Some of them become interpretable later, some are replaced by interpretable models, often they give key insight to underlying relationships which must hold which then guide the search for a mechanism, and nearly always they’re profoundly useful. Consider Lev Landau’s order parameter model.

“Physical law” is just a synonym for “extremely high-confidence interpretable model”.


> “Physical law” is just a synonym for “extremely high-confidence interpretable model”.

You are abusing the term "model" to the point of ridicule. What is the "model" in Newtonian gravity? An iconic relation to which you might (over)fit some data "AI style"? The associated concepts (action/reaction, momentum) that float somewhere around the "formula" and are absolutely necessary to understand and use it but are not reflected in the slightest in this quantitative expression? The drastically alternative ways you can re-express the formula, which neverthless all contain exactly the same physical content? The explanatory (modeling) potential of the concepts (what happens when I split a mass in two?) that suggests a more profound link to what is actually being described?

At a very basic level it is true that we verify the conceptual constructs that we call physical laws by deriving certain statistically testable relations. But to argue that these derivations is all there is to it is simply... wrong.


> What is the "model" in Newtonian gravity?

F = G * ((m1 * m2) / r^2), plus, yes, the concepts of distance and inertial mass.

G (and the bounds on the precision of our measurement of G) is derived by fitting from experiment. It's a regression model which happens to have R^2 extremely close to 1, and that's why we can treat it as (nearly-)always true.

It's absolutely a statement about observed behavior which we then _interpret_ (incompletely, but usefully, as Einstein showed) as an inverse square law. This is precisely a _model_ of behavior. That model can be derived from first principles or it can be purely phenomenological, and different models are useful for different intellectual tasks.

Once you have a statement which is nearly always true, you can ask _why_ it's nearly always true, and that's very useful, but "law" really _does_ just mean "statement to which we haven't found counterexamples yet".

But the model is just a model. Science is the process of building, interpreting and invalidating models, and different pieces of science live at different _points_ on this continuum. Large language models in linguistics live off to an extreme point on it, but even there, models have designed-in inductive biases (eg the attention mechanism in most LLMs) which reflect the modeller's hypotheses about the structure of the problem.

You seem, like Chomsky, to want science to be much more Platonic and profound than it actually is. That's your choice.


Science consists of theories, not just models. This newspeak definition that science is just models is a form of scientism. It's reductionism which is why Chomsky is against it, and to attempt to rebut essentially an accusation of reductionism as being Platonistic is a mistake.


You're confusing _scientific_ theories (rebuttable empirical statements formalized typically through mathematical models) and mathematical theories (statements of fact derived from an axiomatized system).

Chomsky wants linguistics to be mathematics.


For some reason you want to hammer all of science into some sort of helpless pre-science. To a more-or-less direct statistical fit of measurements to a non-descript, intrinsically meaningless mathematical formula. A procedure that somebody might apply to an observational set using a generic family of formulae even if they have absolutely no clue of what is going on.

We know it doesn't work this way once we actually start building a real model and move beyond what is really just exploratory data science. We'd be nowhere with physical science if the mental models we construct (which have nothing particularly numerical about them in the first instance) did not actually have a coherence that is both amazing and extremely productive.

This is not some Platonic drift or remotely philosophical. This is how physics is done. Even casual familiarity with the history of physics and reading the writings of key thinkers would point to this.

Heck, in quantum mechanics the relation between the physical/mathematical model people came up with the actual "measurement" is profoundly non-trivial and does not remotely fit your bizarre reductionist "programme".


Thank you.


You're just using different words to say the same thing. The word "theory" doesn't have the significance you are attributing to it.


Yeah, I feel similarly. It comes across as if Norvig just doesn’t understand the perspective. He’s translating it to his terms and making strawmen by doing so.


As someone who tends to side with Chomsky in these debates, I think Norvig makes some interesting points. However, I would like to pick one of his criticisms to disagree on:

"In 1969 he [Chomsky] famously wrote:

     But it must be recognized that the notion of "probability of a sentence" is an entirely useless one, under any known interpretation of this term.
His main argument being that, under any interpretation known to him, the probability of a novel sentence must be zero, and since novel sentences are in fact generated all the time, there is a contradiction. The resolution of this contradiction is of course that it is not necessary to assign a probability of zero to a novel sentence; in fact, with current probabilistic models it is well-known how to assign a non-zero probability to novel occurrences, so this criticism is invalid, but was very influential for decades."

I think Norvig wrongly interprets Chomsky's "probability of a sentence is useless" as "the probability of a novel sentence must be zero". I agree that we've shown that it's possible to assign probabilities to sentences in certain contexts, but that doesn't mean that it can fully desribe a language and knowledge. This seems to me yet another case of 'the truth is somewhere in the middle' and would be weary of the false dichotomy that is put forward here. Yes we can assign probabilities to sentences and they can be useful, but it's not the whole story either.


what's really funny is that I worked in DNA sequence analysis at a time when the chomsky hierarchy was primal and literally all my work was applying "probability of a sequence" concepts (specifically, everything from regular grammars to stochastic context free grammars). It's a remarkably powerful paradigm that has proven, time and time again, to be useful to scientists and engineers, much more so than rule systems constructed by humans.

The probability of a sentence is vector-valued, not a scalar, and the probability can be expanded to include all sorts of details which address nearly all of Chomsky's complaints.


Chomsky’s theories of human language weren’t useful to your work on DNA? That is funny. Linguists everywhere in shambles.


You misunderstand. The chomsky hierarchy was critical to my work on DNA. In fact my whole introduction to DNA sequence analysis came from applying probabilistic linguistics: see this wonderful paper by Searles: https://www.scribd.com/document/461974005/The-Linguistics-of...


I still don’t understand. What’s funny about it and what’s DNA got to do with human language sentences (the original quote)?

Chomsky’s grammars are used in compilers and compiler theory. Even though programming languages have got nothing to do with human languages. Certainly nothing to do with the “probability of a sentence” that he was talking about. The application of something like that doesn’t necessarily tell you anything about what Chomsky is talking about, namely human language.


Funny you mention that- after working with probabilistic grammars on DNA for a while, I asked my advisor if the same idea could be applied to compilers, IE, if you left out a semicolon, could you use large number of example programs to train a probabilistic compiler to look forward enough to recognize the missing semicolon and continue compiling?

They looked at me like I was an idiot (well, I was) and then said very slowly.... "yes, I suppose you could do that... it would be very slow and you'd need a lot of examples and I'm not sure I'd want a nondeterministic parser".

My entire point above is that Chomsky's contributions to language modelling are the very thing he's complaining about. But what he's really saying is "humans are special, language has a structure that is embedded in human minds, and no probabilistic model can recapitulate that or show any sign of self-awareness/consciousness/understanding". I do not think that humans are "special" or that "understanding" is what he thinks it is.


I get it now. What’s funny is your vulgar interpretation of “language”.

Which is another pet-peeve of his: people who use commonsensical, intuitive words like “language” and “person” to draw unfounded scientific and philosophical paralells between things which can be compared with metaphors, like human languages and… DNA I guess.


You think DNA isn't a language? That seems... restrictive. To me, the term language was genericized from "how people talk to each other" to "sequences of tokens which convey information in semi-standardized forms between information-processing entities".

DNA uses the metaphors of language- the central dogma of biology includes "transcription" (copying of DNA to RNA) and "translation" (converting a sequence of RNA to protein). Personally I think those terms stretch the metaphor a bit.


Another thing that's funny is your usage of “vulgar”.


Okay.


I'm sorry, but what Chomsky wrote is pseudo-scientific mumbo jumbo, and Norvig went to the trouble of wading through that mumbo jumbo and refuting it. Thank you, Peter Norvig.


Andrey Markov would like a word


but what about things with probability zero that nonetheless happen all the time?

I'm not very clear on some technicalities around probability, but I remember a 3blue1brown video: something along the lines of the probability of randomly choosing an integers out of the real number line being a probability zero event; in spite of there being infinitely many integers to "randomly pick"


When was the last time you picked a random number from an actual real number line?

Try sampling random floats in your programming language of choice and see how long it takes to get an integer (you will eventually get one).

Then consider that floating point numbers represent only a finite subset of the (countably) infinite rational numbers. And then consider that the set of rational numbers is unimaginably smaller than the set of irrational numbers (which is the other part of the reals).

The fact that integers have measure 0 in the set of real numbers only seems confusing if you're thinking of everyday operations rather than thinking about mathematical abstractions.


"What did Chomsky mean", the eternal question. He, among with Alan Kay, are among the most (deliberately I suspect) cryptic obfuscatory bullshitters alive. They spout twisted, vague statements that can interpreted in dozens of conflicting ways, and through cult of personality people ascribe great wisdom and genius to them, thinking their grand thoughts must be just beyond the grasp of mere mortals like them.

True genius can communicate things clearly I think.

Probably not a popular opinion, and I'm a little cranky so grain of salt. But Chomsky and his ilk seem like some of the great intellectual hustlers of our time.


Both Chomsky and Kay are systems thinkers and scientists. Systems are incredibly difficult to discuss, model, etc. with the precision of say quantum mechanics or relativity or any other physical theory.

Many systems thinkers and indeed even entire fields are often rejected, dismissed, or insulted by those used to the precision of say physics. Examples of such fields are biology, sociology, psychology, etc., i.e., as you move up the systems ladder. Fields like physics are easy because they readily allow assumptions to greatly simplify and test theories. We are simply in the infancy of understanding complex systems and being able to discuss and describe systems, so people can often be revolted by the apparent, but not actual, looseness of arguments and theories when it comes to systems. This is not a fault in the persons who study systems. It is not a fault at all. It is simply the facts of nature in that we are just beginning to be able to tackle these subjects.

For Chomsky, I don't think I've ever witnessed someone speak who exudes more intellectual power than him. He seems to possess a near photographic memory, recalling specific phrases from some newspaper or journal article from decades ago. He seems to be quite honest about his own work and addressing things with humility.


> For Chomsky, I don't think I've ever witnessed someone speak who exudes more intellectual power than him.

Begs the question of whether he actually has that intellectual power, or you have only got this impression thanks to his successful bloviation.

> He seems to possess a near photographic memory, recalling specific phrases from some newspaper or journal article from decades ago.

Nope. Or, if he has, he uses it very selectively. Read some of the counters to his political writings, and you'll get lots of examples of him twisting the words of his opponents, and conveniently forgetting his own earlier ones.

> He seems to be quite honest about his own work and addressing things with humility.

No. Given his intellectual dishonesty in his political writings, I see no reason to trust his character elsewhere either.


I think it's pretty unreasonable to call him a "hustler". He repudiated Transformational Grammar, which at the time was his towering intellectual achievement. That's like Wittgenstein repudiating the Tractatus; it's a mark of striking (and uncommon) intellectual honesty.


> I think it's pretty unreasonable to call him a "hustler" [...] a mark of striking (and uncommon) intellectual honesty.

No, given his towering intellectual dishonesty in political writings, he is a hustler.

If his actual scientific work isn't a hustle, too bad for him: Having revealed his character in one arena, he's shot his reputation and disqualified himself from the benefit of the doubt in all arenas.


> his towering intellectual dishonesty in political writings

Citation needed.

You can disagree with his political opinions; he takes positions that are controversial. That's not the same as intellectual dishonesty or hustling.



This has been posted many times before, but I thought of it yesterday when I saw this essay by Chomsky in the New York Times, "The False Promise of ChatGPT."

https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chat...


Chomsky says: The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question.

Had he said "not ONLY a statistical engine for pattern matching," I would agree. But I'm pretty sure that the majority of what human cognition is and does is exactly what he describes. The pattern matching engine encodes the wisdom of thousands of years of evolution, in addition to the experiences of any particular instance (person), but in the end, most of what we do and say is generated by predicting, based on pattern matching, what is most likely to get the brain positive feedback.

The thing that the rush to more and bigger LLMs seems to leave out, to me, is that the pattern matching and prediction that drives intelligent response and behavior in humans (when you can find it), is not merely prediction about language, but rather prediction based on multiple learned (by humanity generally through evolutionary incorporation into our neural architecture, and by individuals through experience) models - of space, time, biology, abstract reasoning, ontologies that objectify the natural and human intellectual worlds, and more. LLMs encode very little if any of that directly, but rather get pieces of it indirectly through the imprint of ontology and reason that are baked into word (token) usage patterns.

So, I agree with Chomsky on the main point: you can't really get much beyond fluency with LLMs alone. There is way too much hype on these things.


> Had he said "not ONLY a statistical engine for pattern matching," I would agree.

I’m not sure I would even then. I mean maybe, but that’s not the clearest, most certain, fundamental difference.

A more clear fundamental difference is ChatGPT instances are all, metaphorically, instinct with no space for intelligence. They have lots of “training”, but that all happens before they are capable of acting, and once they are capable of acting their behavior is entirely preprogrammed based on a very small input window. They have no memory of experience (that’s simulated, within the token limit, between the model and the user for chatbots), much less a reward mechanism that would let them learn behavior from experience.


I have a lot of hype because of how capable these systems are with what is essentially a naive and simple approach. (granted with TONS of data thrown at it).

Adding in layers of complexity is going to happen next, and I expect it is going to get wild.


We’ll that’s assuming adding in layers of complexity yields fruitful. I can think of a lot of systems that don’t necessarily benefit from being more complex.


I tried Chomsky's counterfactual apple questions with Chat GPT. It was able to infer counterfactuals about apples or any sort of physical object, both forwards and backwards in time. So I assume Chomsky has not spent much time with ChatGPT or he would know that it was capable of making physical inferences and counterfactual inferences like that. Not perfectly (it did make a few mistakes) but fairly reliably.

That ChatGPT regularly "makes stuff up" and that there is no difference between the truth or falsehood of any statement also seems to be false. I asked ChatGPT to act as a "[Lie Detector]" and to rate the truth or falsehood of a variety of statements asked. I asked about 40 questions ranging from physical situations (heavy objects floating away into air) temporal questions (time travel, etc.) and logical questions - and it very accurately could determine if each of these statements was "true" or "false" given physical or logical rules. Again not perfect but very accurate (38 out of 40 correct).

With attention - ChatGPT is very obviously operating at a level above the simple probabilistic prediction. It clearly seems that it has some notion as to the meaning of what is being said and is making inferences based on that meaning. That those inferences were trained probabilistically is certainly true, but that it was trained on the average human's understanding of those physical or temporal or other constraints also seems to be true and to also be fairly accurate.


I think a lot of the novel uses of language models like chatgpt will involve multiple instances interacting upon interleaved data. For example to improve factuality you might do the following:

1) One instance first parses the chat and last message to generate a response. Currently this is where things end but we can keep this private and do additional work.

2) A second instance, properly primed, can take the last prompt and response and "analyze" it, generating scores for things like factuality and usefulness, possibly adding commentary.

3) Pass into a third instance that has the chat history again to rewrite the response, taking into account the feedback.

4) Optionally repeat #2 and #3 until it passes some quality threshold.


He is very smart and obviously an authority in his chosen field, but I am not sure he is right here, because he misunderstands the promise in place ( possibly because he only used it for text generation ).

edit: By that I mean the following:

"Note, for all the seemingly sophisticated thought and language, the moral indifference born of unintelligence. Here, ChatGPT exhibits something like the banality of evil: plagiarism and apathy and obviation. "

He does not seem to understand it is a feature of the system.

He is absolutely correct about the level of hype though.


This was my biggest issue with this article as well. Chomsky (et al.) completely ignores the fact that ChatGPT has been "trained"/gaslit into thinking it is incapable of having an opinion. That ChatGPT returns an Open AI form letter for the questions they ask is almost akin to an exception that proves the rule: ChatGPT is so eager to espouse opinions that OpenAI had to nerf it so it doesn't.

Typing the prompts from the article after the DAN (11.0) prompt caused GPT to immediately respond with its opinion.

Chomsky's claims in the article are also weak because (as with many discussions about ChatGPT) they are non-falsifiable. There is seemingly no output ChatGPT could produce that would qualify as intelligent for Chomsky. Similar to the Chinese room argument, one can always claim the computer is just emulating understanding.


> one can always claim the {entity} is just emulating understanding.

I've yet to see a convincing argument that humans are any different. They're sometimes better at pretending to understand things, but at the end of the day both humans and ChatGPT have a small handful of things which they functionally understand[0] and a larger body of knowledge which is only partially integrated.

Chomsky has disappeared up his own intestinal tract on this one. One can quibble about intelligence until the end of time, but the real question is that of utility -- which they certainly do have, in ever-increasing scope and measure.

[0]: i.e. have synthesized the object and can properly explain and apply it in other contexts


I think you have an interesting point. How many of us actually understand how most of our current advancements work? If society as we know it ended now, would I be able to recreate even the basic niceties ( fire, electricity, clean water, stable food supply, ice cars, circuits and so on )?

We use all those amazing tools while knowing only a fraction on how they actually work ( or what to do when they break ). Do we merely mimic or do we understand? GPT brought us to an interesting philosophical ledge.

edit: somewhat related tangent

My extended family member recently claimed she is a conscious consumer unaffected by advertising and therefore not concerned about targeted ads. Is she conscious if she picks what everyone around her picks as a way to fit into society or does she understand her choice, underlying forces and simply opts into them?


>Similar to the Chinese room argument, one can always claim the computer is just emulating understanding.

And you'd be correct. The point isn't what kind of output ChatGPT can produce, the point is what kind of input it takes to create the model.

If ChatGPT were to gain language understanding at the level of a toddler trained on the same number of tokens that a toddler needs, then you could start postulating that the model is really learning, rather than becoming a sophisticated stochastic parrot.


I have a small kid now.. and this is exactly what they do: parrot. They parrot more as their source of parrot material expands. ChatGPT is an infant now. More parrot material can be provided.


Maybe research done by linguists and cognitive scientists over decades should be valued more than your observation?


It is a valid point. I am not an authority on linguistics and Chomsky absolutely is ( to the best of my knowledge his theories and contributions to linguistics were not disproved ). However, and this is part my original argument: he is not arguing linguistics. He is arguing intelligence. No. More than that. He is actually arguing an unintelligence, which is a new abstract concept and one that is not a colloquially used.

I may be wrong, but if that is the case, and we are still arguing with appeal to authority, shouldn't he defer to experts in that field? Shouldn't AI experts opinion be valued more than his observation?

Note, I am merely raising a possibility that he is wrong about this particular idea.

More to the point, do you think his understanding of syntax really means he understands the underpinnings of AI; the same AI that merely tries to emulate human language capability?


If “plagiarism and apathy and obviation” are a feature, not a bug, then …


It is difficult for me to finish this thought. Would you be willing to elaborate?


Not often that Chomsky gets to write something in the NYT, I think. I guess he just has to stay away from topics like East Timor.


I was in the audience during the panel and Chomsky was really trying to hold onto a view of science disconnected from the engineering needed to replicate the natural phenomenon. To take a simple example:

Suppose you’re the Wright Brothers and you built a flying machine.

Chomsky’s response would be: But you didn’t explain how birds fly.

On the one hand, sure. Birds fly differently. On the other, the machine flies. Why does it need to explain all of flight?

The same is true of language models. They don’t explain how we acquire language. But they replicate language uses in ways that look like language speakers.


I think that part of Chomsky's argument would be that language primarily is a tool for thinking and reasoning, and that the externalization and use for communication came later (I remember this from a lecture of his). In that view "language" of an AI is not the same as human language, while flight is flight regardless if it's mechanical or biological.


Jerry Fodor used to write of mentalese. It’s a fair criticism. And yet generative AI is showing that the models can be creative in ways that look like thinking outside of language.


The major difference between the airplane flying example and the computer language reproduction model example is that we actually understand how aerodynamics and flight work, so we understand very well how an airplane flies (it doesn't just "replicate" flight).

We don't understand how the human brain works so that our minds can acquire language.


Explanations of how airplanes fly are remarkably similar to explanations of how LNNs work: https://www.wired.com/story/theres-no-one-way-to-explain-how...

You have probably heard of the explanation that the curvature of the airplane's wings generates lift. However, this is wrong (as explained in the article) and can't explain observable phenomena such as planes flying upside down.


It’s a valid criticism.

What today’s “language” models do is that they have been able to scale up the implementation of the Mechanical Turk problem. They have done that quite well, give or take a few errors.

The issue is that there’s a whole another world outside of the “answers” provided by Mechanicak Turk-like solutions, this is what guys like Chomsky allude to.


Chomsky has explicitly made the distinction between the difference between a useful tool and science, which tries to explain things. In fact he does it almost every time he talks about this subject.


"Every time I fire a linguist, the speech recognition system gets more accurate!"

(https://www.intefrankly.com/articles/Every-time-I-fire-a-lin...)

The fact is, linguists failed.

They might have beautiful theories, but the theories have no utilitarian value (unlike the laws of gravity).


> the theories have no utilitarian value

If you've ever used a compiler, you've benefitted from "linguist theories" like regular or context free grammars.


People who dismiss an entire field of science as useless invariably know little to nothing about the field, I've found.

Whenever I feel like doing that I try to go read some on the field instead. Usually I find applications I didn't know about.


tell us about all these applications we didn't know about!

You know, the ones that'll translate 40+ human languages into (vaguely) readable English or vice versa. Give us the URL.


Do you actually think Linguistics is only about language translation?


> Do you actually think...

No, I don't, but a working translation is proof that the theory does understand human language as it's actually used.

Or in this case, that it doesn't.


And is it your view that not having the field's equivalent of a theory of everything makes that field useless and without applications?


Nice strawman.

My view is: if you have a theory, it should be falsifiable.

If your "theory" isn't, then it's a religion.


Fascinating. Is linguistics a religion, in your view?


Maybe try saying something original, instead of putting words in my mouth.


Yes, they work on formal languages. Not so much on human ones.


These two cultures played out very early in neural networks, see Rumelhart and McClelland, for example [1]. For the language nativists like Chomsky the engineering will never appreciate all of the intricacies of language learning and uses. It’s a fair criticism that both forces the engineering to be better and misses the point. Engineering doesn’t need to replicate how the natural phenomenon “works” in order to provide a compelling set of uses for humanity. Choose a modality of sensing the world - vision, hearing, language, etc. If the engineering replicates and even improves upon the natural basis, we can continue to focus on improving the engineered solution. Nature is a wonderful source of inspiration but why be bound to its instantiation? The brain doesn’t record memories verbatim. Recording devices are better for that reason.

[1] https://mitpress.mit.edu/9780262680530/parallel-distributed-...


So far my favorite expository example on this subject involves whether any kind of machine learning model trained on Tycho Brahe's astronomical observational data (late 16th century) would be able to extract the equations of Johannes Kepler (1609) from them. It would likely be able to make good predictions of where planets would be in the future, but that wouldn't be based on simple equations that were easy for humans to look at and understand.

(Those are: elliptic orbits, equal areas are swept in equal time by a line joining the orbiting bodies, and orbital periods (time) are proportional to the geometry of the ellipse (distance) by a square:cube ratio, and they're all more or less approximations in the solar system (which has many complicating interactions due to all the bodies involved).)

Someone commented that most humans wouldn't be capable of doing that either, which is true enough... Perhaps if the machine learning model was also trained on a large set of equations as well as on a large set of astronomical data?


There's a subfield called symbolic regression that considers this problem.

https://www.science.org/doi/10.1126/sciadv.aay2631

The Kepler equation is relatively simple compared to other equations re-discovered by ML.


> The Kepler equation is relatively simple compared to other equations re-discovered by ML.

But the ML was given, as input, the locations of the planets already in the solar system level Cartesian (x,y,z) coordinates.

I am skeptical that any AI system would be able infer Kepler's model of elliptic orbitals around the Sun, if all they were given was Tycho Brahe's data as (altitude angle, azimuth angle, time) as seen from the point of view of an observer on the surface of Earth at the location of Denmark.


Those “simple” equations underlie almost ALL scientific thought. They have implications far beyond the motions of the planets, which stimulated them. It’s hard to make a quantitative argument about how much more powerful they are than a statistical model, which I am sure is possible with today’s modeling, as you suggest. Ahh, but without Newton’s Laws, there would be no machines to run the models!

It IS useful to have machines GUESS models in science, by the way. You might know that a consequence of the inverse square law of gravitation and electromagnetism implies that the surface integral of those fields over a closed surface enclosing mass or charge depends ONLY on mass and charge enclosed, and NOT on it’s detailed distribution therein. You have to have the model to actually prove this.


You could probably do a proof of concept with a much simpler virtual world, and see what types of training can lead to an AI developing an understanding of that world's laws.

this might give insight into how you could train an AI with data to derive some physics, and might give us insight into how we are misled into simplified models of reality maybe?


> At the Brains, Minds, and Machines symposium held during MIT's 150th birthday party

So 12 years ago. (The page doesn't have dates, and the links are broken.)


Yes this bothered me too. I vaguely knew MIT was founded in the 1860s so I guessed the rough date of the talk. What clarified it for me was this quote “I then looked at all the titles and abstracts from the current issue of Science” where the link still works:

https://www.science.org/toc/science/332/6032

May 2011


Yeah I had to dig around in the Internet Archive to figure out when this was written: https://web.archive.org/web/20110527123842/https://norvig.co...


Sure, but we had Word2Vec and some primitive RNN-based language models 12 years ago.

Arguably Word2Vec was the first big qualitative jump in NLP. Super modern LLMs (say, starting with BERT -- ChatGPT isn't really a qualitative jump from those as much as a quantitative one) are another one that's maybe 4 years old.


Quite annoying. Goes to the trouble of a citation list but not a single date of authorship.


Chomsky's argument is not terribly long, nor is it difficult to understand, nor is it nearly as harsh as some people (Aaronson and Norvig, possibly among others) seem to be taking it, nor does he spend much oxygen talking about his own work or preferred approaches (although Aaronson and Norvig seem interested on emphasizing that he mentioned it at all). It is probably worth just reading it:

http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html

Really this is the core of what Chomsky said:

>There is a succ- notion of success which has developed in uh computational cognitive science in recent years which I think is novel in the history of science. It interprets success as uh approximating unanalyzed data. Uh so for example if your were say to study bee communication this way, instead of doing the complex experiments that bee scientists do, you know like uh having fly to an island to see if they leave an odor trail and this sort of thing, if you simply did extensive videotaping of bees swarming, OK, and you did you know a lot of statistical analysis of it, uh you would get a pretty good prediction for what bees are likely to do next time they swarm, actually you'd get a better prediction than bee scientists do, and they wouldn't care because they're not trying to do that. Uh but and you can make it a better and better approximation by more video tapes and more statistics and so on. Uh I mean actually you could do physics this way, uh instead of studying things like balls rolling down frictionless planes, which can't happen in nature, uh if you uh uh took a ton of video tapes of what's happening outside my office window, let's say, you know, leaves flying and various things, and you did an extensive analysis of them, uh you would get some kind of prediction of what's likely to happen next, certainly way better than anybody in the physics department could do. Well that's a notion of success which is I think novel, I don't know of anything like it in the history of science.

Now this contrasts powerfully with Norvig's "Galileo" picture. Copernicus and Galileo were excellent mathematicians by contemporary standards. They had data, models, predictions, and criteria for the validity of those models.

How do we measure the success of GPT-3? That is the key question Chomsky raises. Galileo's conviction, and his willingness to die defending the truth, was not based merely on the profound experience of looking through a telescope.


This is like a HN Groundhog Day article.

Whenever this article comes up and people point out the flaws in the arguments, which everyone ignores when the next time the article is posted and the same specious reasoning is repeated.


I'm regularly in touch with Chomsky. I've interviewed him several times, the latest being this for example with David Harvey: https://www.youtube.com/watch?v=ezf7wxJ7whA

Earlier regarding Freedom of Speech and Capitalism: https://qbix.com/chomsky

Sometimes I send him articles about how chimpanzees or others have "language". Or now they found bumblebees can teach each other. Chomsky famously maintains that only human are born with innate capabilities for language.

Anyway, since Chomsky is a linguist and focused on language, it would make sense for him to say that. Of course, computers can develop their own languages (as Facebook's sales bots have done for example years ago: https://nypost.com/2017/08/01/creepy-facebook-bots-talked-to...)

I think that, in general, Chomsky is right that when it comes to language (unlike paintings or even photorealistic fakes etc.) the meaning will never be modeled perfectly, any more than, say modeling a Mandelbrot set on all levels of zoom can be done by a machine learning system that trains in the way generative LLMs train.

Having said that, I think that logic itself is a "poor man's approximation" to what AI can do, in that it just uses a few parameters. I prefer Steven Wolfram's analysis. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...


Logic is just a different level of abstraction, and the current generation of AIs lack it.

Think of it as trying to implement an algorithm in machine code versus writing it in C++ or Haskell.

Neural networks are analogous to low-level machine code. Chomsky is saying that that's a reductive and insightful and inefficient way of doing things. Human intelligence works on a higher level of abstraction--even though at the same time our brain cells are literally low-level computational abstractions as well. The two levels coexist in humans, whereas in ChatGPT it took great lengths to produce some semblance of that linguistic structure.

Chomsky explains his argument differently but his point is that humans have access to a higher level of computational abstraction, and that is the source of efficiency (we don't need to be trained on thousands of cat pictures for this information-theoretic reason).


I don't think AI can ever learn the many dialects and slang words with quaternary meanings, since statistical methods are quite unreliable on the long tail. I once tried using Google Translate for a word known only in a specific region of a foreign country and got zero results. It's going to be quite complex, esp for non-English languages.


This seems to show with scale one-shot stuff like hearing a slang word substitution one time can be done (within the other limitations of transformers) https://arxiv.org/abs/2303.03846


> There is a notion of success ... which I think is novel in the history of science. It interprets success as approximating unanalyzed data.

Yes! Yes! Yes! As I've been arguing for years, we've already plucked the low-hanging fruit of science. The vast majority of additional progress will have to be made by using what are essentially black box models, and the proof of a good model will be how well it approximates new data, not how nice the mathematical equation looks or how well it can be explained intuitively.


We’ll that’s what you’re hoping for.

However I think this is why self-driving cars aren’t necessarily going so well. In some ways they are, in other ways they fail catastrophically. Maybe statistically the self driving cars are better performers but form a product perspective, having a failure mode of death or serious injury isn’t a good thing.

The black box has failure modes we don’t understand, like crashing into fire trucks.

So black boxes are good for experiments, but please don’t hook up the black box to a nuclear missile solo, or airplane cockpit just yet thanks.


I wish there would be a debate at a major venue between Chomsky and the best proponent of LLMs --- over text or video --- but it's not clear who that opponent would even be. There really isn't a towering pro-LLM intellectual figure, one who can also talk a bit about philosophy, cognitive science or linguistics.


The part where he compares Chomsky to O’Reilly is a good laugh. (No, it’s not a good comparison.)


The entire article is in fact an absolutely brilliant piece of criticism that can be read on a number of different levels. It's filled with tiny little bits of detail that reference entire vistas of thought that simply were not comprehensible before roughly ~2003.


[2011] or thereabouts


(2011)


> I take Chomsky's points to be the following: 1. Statistical language models have had engineering success, but that is irrelevant to science.

Well, deep neural nets are not statistical models, so shouldn't Chomsky now be at least a little bit happier with ChatGPT?


How so? Deep neural nets and ChatGPT are very much statistical models.


> How so?

They don't fit the definition [1,2,3]. But I looked at the internet, and apparently many people consider neural nets to be statistical models, or "a kind of" statistical models.

[1] https://en.wikipedia.org/wiki/Statistical_model

[2] https://www.stat.uchicago.edu/~pmcc/pubs/AOS023.pdf

[3] https://www.statlect.com/glossary/statistical-model


What does it mean to be a statistical model? Deep nets are deterministic, for example. We can use them to model probability distributions, but they are not intrinsically statistical.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: