Hacker News new | past | comments | ask | show | jobs | submit login
Pathways Language Model (PaLM): Scaling to 540B parameters (googleblog.com)
235 points by homarp on April 4, 2022 | hide | past | favorite | 207 comments



This is why Google built TPUs. This alone justifies the whole program by itself. This level of natural language understanding, once it is harnessed for applications and made efficient enough for wide use, is going to revolutionize literally everything Google does. Owning the chips that can do this is incredibly valuable and companies that are stuck purchasing or renting whatever Nvidia makes are going to be at a disadvantage.


I honestly think we will one day look back and see this as taking a hammer to a screw. Sure it will work, but its a horribly brute-force way of doing it. The brain almost certainly processes language in a more elegant way that, once understood, certainly won't take up such raw computational power.

The brainwaves associated with a wakeful state (and thus generating and parsing language) happen around 40 Hz. Our machines are running in the GHz region. We are missing some fundamental insight.

At the risk of ending up front and center in the hall of prediction hubris, here is my bold prediction: By 2040 it will be possible to do what these TPU's are doing today on your consumer grade computing device, but it won't be because of a dramatic increase in the computing resources available to such devices.


I agree that this stuff will be orders of magnitude more efficient once we figure it out. But I think brute forcing it first and optimizing later is a fine way to do it. Maybe even the only way to make progress on a problem like this. And even in a hypothetical future where you can run human level intelligence on your smartphone, supercomputers will be that much more intelligent still, so it's not like building them will be pointless.


I mean shoot, if you brute force it enough you can ask the machines to optimize for you at some point.


I'll tip my hand further. I think the current approach to artificial neural networks is inherently flawed because it simplifies away pertinent features like firing rates, propagation delay, and standing waves (aka brain waves). I also think trying to get AI to "understand" the irl world when it doesn't live in it and we haven't codified our own natural language is too far a leap. We might have more success making an AI that is conversational in a form of newspeak dedicated to a specific problem domain. Eg an intelligence with the body of a brokerage account that only knows how to speak about money (thus rendering most of humanity redundant (zing!)). Here's something I wrote elsewhere about the accidental path to machine intelligence.

"You are whatever you get in return when you ask yourself who you are".

Literally, if you trace the referent objects, "you" refers to the same entity as the voice which responds to your query. But also literally, "you" are the standing waves of feedback loops in the brain which represent thoughts.

Rather subtly, I've given an understanding of consciousness which is entirely defined and constrained within a communicative system of sending and receiving messages. The underlying language within which the consciousness exists may have a very minimal grounding in some physical reality.

(I'm truncating a ton of justification and background to avoid a deep rabbit hole and get to the point).

Here is a highly speculative example of how consciousness-like intelligence could emerge accidentally in the form of trading bots. It starts with two ingredients: the shared reality of the order book and independent trading bots with the objective of "make more money".

At first the algorithms are dumb curve optimizers. They see whatever patterns they can find in the order books and correlate them with outcomes, reacting appropriately by placing their own orders.

After a while, as the data sets and internal models grow, the algorithms are implicitly calculating the reactions of other trading bots in their extrapolation.

Soon after that, the orders being placed in the order books are effectively messages intended for other trading bots, hoping to illicit a certain reaction. The order book effectively becomes a language grounded in an economic reality, filled with offers that are communicative but not necessarily intended to execute.

The emergent language of the order book grows in sophistication to the point where the bots are talking about past instances and hypothetical instances of things said in orderbook speak. This all happens right under our nose in perhaps non-obvious ways, such as in the exact number of cents on a bid/ask price. Other times it looks like our bots have reinvented "painting the tape" and other forms of financial communication we've deemed "market manipulation". We're proud of our silicon brain child for figuring out how to do that on its own. They grow up so fast.

Eventually, in the process of optimization, the trading bots internal model of its "order book reality" gets to a point of sophistication where it has to model other bots modelling itself modelling other bots modelling itself...to whatever layer of recursive depth it can marshal. Thus the feedback loop effectively closes and something akin to circular loop brain waves emerge. By this point no one really understands why the trading bots make the trades they do. I can no longer tell you what a trading bot is in simple terms like "its a dumb curve optimizer trying to maximize money". Rather, it can only be understood as a form of consciousness which has emerged:

"The trading bot is whatever the trading bot gets in return when it simulates placing an order asking what it is."


Whenever we try to define consciousness, we get into useless quasi-philosophical discussions like this. Philosophers don't know what consciousness is and neither does anyone else.

Working with your definition, you've basically just described backpropagation in sufficiently deep neural nets. A feature of Artificial Neural Nets, which, as you say, probably oversimplifies brain functions.

"The underlying language within which the consciousness exists may have a very minimal grounding in some physical reality."

Your usage of physical reality is interesting. Is this the "free will is real" ala true randomness exists argument again? Hope we aren't moving backwards into the arms of religion here.


Technically any loop can be unrolled into a sufficiently deep linear sequence. Technically it can be done, but good luck conceiving of solutions to routine problems without while loops, for loops or recursion. It can be done, but you have to think harder, write much longer / more repetitive code, and looking at the sort of code that doesn't contain insights about the problem.

So, back-propagation with sufficiently deep neural networks. Technically you could use it. You could throw a huge amount of silicon and brain power at any problem and eventually hammer that screw right into the wall. Or you could try slightly more realistic models of neurons and hope to find disproportionate increases in the abilities over the previous model. I think its already clear which one I'm in favor of.

Just to make this extremely explicit, the thing I think is missing from artificial neural networks is harmonic waves. There's a body of evidence that representation of thought is done with brain waves, not the states of individual neurons[1]. When you move to a wave view of neural networks a handful of very sophisticated operations emerge naturally such as autocorrelation (effectively a time windowed fourier transform). Sure you could program the fourier transform, or even worse get an optimizer to implicitly learn it after some outrageous number of man hours, but in this analog wave view of brain activity we get it with structures so simple they could have happened by accident. I'm being extremely literal when I say "you are what you get in return when you ask yourself". The voice in your head is literal the echos of the question bouncing around in your skull (albeit electro-chemically rather than acoustically).

I am arguing that consciousness, that is to say the train of thought in your head, is definitionally what happens when the conversational abilities of understanding utterances and forming responses get fed into each other. Consciousness is nothing more than talking to someone who happens to be yourself. Maybe my definition doesn't have universal acceptance, but it at least gives a meaningful concrete answer to what is meant by consciousness.

You've far and away missed the point on "grounding in reality". It has nothing to do with randomness or free will or religion. I didn't hint at anything of the sort.

Someone once said something profound to me. "In a programming language, no matter how much complexity or abstraction there is in a command, everything eventually resolves down to instructions to physically move some electric charges at a physical location in memory." Something similar applies to natural languages. Every sentence and though eventually resolves down to representing physical and tangible things in our reality. When you try to trace through the dependency tree of the dictionary, you eventually reach words which can't be broken into simpler parts. Those are the words that "ground" the language in our physical reality, representing objects in the outside 1 to 1. Every language, be it natural or programming or something else, has some form of grounding. Language has to be about something. But the underlying thing its about can very simple. It can be as simple as an order book.

To any conscious entities that emerged in such an accidental medium, the order book is their reality. They wouldn't know of or be equipped to reason about any other form of existence. Their form of existence is no better nor worse than any other consciousness grounded in the reality of any other language. It doesn't matter much what the underlying objects of the problem domain are. I picked trading bots as my example, but I could have picked any other domain where agents 1) share the same playing field 2) have some competing interests to optimize and 3) could use objects of the domain for signaling purposes (ideally at low cost).

[1]https://www.nature.com/articles/ncomms10340


But brains are trained from millions of generations across the lifetime of the earth (or from when life started, if you ignore the environment and context needed to start life). You are conflating the training with the inference.


at least in the visual cortex, synaptic weights are not genetically encoded in humans. they are in drosophila though


Source? I am skeptical that we even can know that because so much can be implicitly encoded.


I do not have a single source, that result is represented across thousands of drosophila papers. There was a famous paper were they showed that fruit flies bred in the dark for many generations showed no significant optic lobe tuning differences to wild type flies, but I don't remember who did that.

The extreme plasticity of the human visual cortex is widely documented (studies in blind people, etc).


> The extreme plasticity of the human visual cortex is widely documented (studies in blind people, etc).

I think there is a big gap between the two claims -

1. Human visual cortex is very plastic

2. There is no inductive bias in the visual part of the brain that has been "trained" by millions of years of evolution.

I trust you are correct on the former, but the latter does not seem to follow from the former.


Certainly there is some inductive bias. Neuroanatomy is broadly similar human to human which constraints connectivity. There is an energetic penalty for excess wiring - evolution drives neuron placement to minimize wiring length (eg topographic layout of layers in cortex), which further constraints connectivity. In general, there is a huge body of evidence supporting the hypothesis that organisms are highly tuned to their environments.

My point is that in contrast to fruit flies, humans (well, mammals) raised in environments with non-natural scene statistics show significant differences in coding in the visual cortex. Here are some random references I dredged up:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4210638/ https://www.sciencedirect.com/science/article/pii/S096098222...

It's not hard to see why this might be selected for in mammals. The historic range of D. melanogaster is just sub-saharan Africa, whereas the historic range of humans is huge, and we are primarily visual organisms. So it would be advantageous to have different tuning if you lived in the desert vs the forest. Of course you have to look at the range and divergence times of the full phylogeny but I am getting distracted from my actual research. It's a very interesting line of thinking though. I was quite surprised to learn things were hardcoded in flies when I joined a fly vision lab.

https://academic.oup.com/gbe/article/11/3/844/5304659


> The brainwaves associated with a wakeful state (and thus generating and parsing language) happen around 40 Hz. Our machines are running in the GHz region.

I don't disagree with the larger point, but each neuron is like a small independent analog computer, so you have tens of billions of (specialized) cores running at 40Hz, vs thousands of cores at ~GHz.


I'm sure we will end up with better solutions, but it's a lot easier to study a thing when you have an introspectable working model. I expect that many AI patterns will be that we'll build it, then optimize it and learn how and why it works, taking the art and science into engineering piece by piece.


Human brain has very likely 10^15 synapses so these models are still behind at least 3 orders of magnitude. To put in perspective, if you want to simulate “network” of human brain, you probably need an entire data center worth of GPUs.


That is assuming you restrict yourself to a radical simplified model of each neuron. The actual biochemical processes going on in a single synaptic density go way beyond something that could be represented by a scalar weight that assemble into a linear transformation. Instead we are talking about 10^15 dynamical systems. This then still neglects most of the time /spatial extended nature of neurons, the fact that calcium stores in the endoplasmatic reticulum play a role etc. Some of this might be only relevant to learning, but non-linear dendritic processing is believed to be relevant for example for inference aswell.


The brain is massively parallel (itself an understatement). Computers are not.


the joke explanations on page 38 of the full paper linked here are blowing my mind. it's crazy how far language models have come

https://storage.googleapis.com/pathways-language-model/PaLM-...


I like how it appears that it had to convert from imperial to metric before it could make an inference:

300 miles per hour is about 480 km/h. This is about the speed of a commercial airplane. [...]


Also, that's a very slow commercial airplane. (Unless talking about an old turboprop?)


Haven't looked further, but I'm wondering about that. Is that the result of training to be able to explain that specific joke, or is it generalized?

In the past these things have been misleading. Some impressive capability ends up being far more narrow than implied, so it's kind of like just storing information and retrieving it with extra steps.


From the example, it seems hard to imagine that it has been trained to explain this specific joke.

I understand language model skepticism is very big on HN, but this is impressive.


How much of human written history can be compressed and aproximately stored in 504Bn parameters?

It seems to me bascially certain that no compressed representation of text can be an understanding of langugae, so necessarily, any statistical algorithm here is always using coincidental tricks. That it takes 500bn parameters to do it, i think, is a clue that we dont even really need.

Words mean what we do with them -- you need to be here in the world with us, to understand what we mean. There is nothing in the patterns of our usage of words which provides their semantics, so the whole field of distributional analysis precludes this superstition.

You cannot, by mere statistical analysis of patterns in mere text, understand the nature of the world. But it is precisely this we communicate in text. We succeed because we are both in the world, not because "w" occuring before "d" somehow communicates anything.

Apparent correlations in text are meaningful to us, because we created them, and we have their semantics. The system must by is nature be a mere remebering.


> It seems to me bascially certain that no compressed representation of text can be an understanding of langugae, so necessarily, any statistical algorithm here is always using coincidental tricks. That it takes 500bn parameters to do it, i think, is a clue that we dont even really need.

I think your premise contains your conclusion, which while common, is something you should strive to avoid.

I do think your opinion is a good example of the prevailing sentiment on Hacker News. To me, it seems to come from a discomfort with the fact that even "we" emerge out of the basic interactions of basic building blocks. Our brain has been able to build world knowledge "merely by" analysis of electrical impulses being transmitted to it on wires.


I have no discomfort with the notion that our bodies, which grow in response to direct causal contact with our environment, contain in-their-structure the generative capbaility for knoweldge, imagination, skill, growth -- and so on.

I have no discomfort with the basically schiozphrenic notion that the shapes of words have something to do with the nature of the world. I just think its a kind of insantity which absolutely destroys our ability to reason carefully about the use of these systems.

That "tr" occurs before "ee" says as much about "trees" as "leaves are green" says -- it is only that *we* have the relevant semantics that the latter is meaningful when interpreted in the light of our "environmental history" recorded in our bodies, and given weight and utility by our imaginations.

The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.

No one here is a scientist and no one treats any of this as science. Where's the criteria for the emprical adequecy of NLP systems as models of language? Specifying any, conducting actual hypothesis tests, and establishing a theory of how NLP systems model language -- this would immediately reveal the smoke-and-mirros.

The work to reveal the statistical tricks underneath them takes years, and no one has much motivation to do it. The money lies in this sales pitch, and this is no science. This is no scientific method.


Agree to disagree. I think you are opining about things that you are lacking fundamental knowledge on.

> The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.

It's unclear what you even mean by that. Are the electrical impulses coming to our brain the "structure of the world"?


The structure of having X apples in Y buckets is the same as the structure in the expression "X * Y", as long as the expression exists in a context that can parse it using the rules of arithmetic, such as a human, or a calculator.

These language models lack context, not just for arithmetic, but for everything. They can't parse "X * Y" for any X and Y, they've just associated the expression with the right answer for so many values of X and Y, that we get fooled into thinking they know the rules.

We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.


It would be trivial for a network of this size to code general rules for multiplication.

At a certain point, when you have enough data, finding the actual rule is actually the easier solution than memorizing each data point. This is the key insight of deep learning.


Really? Better inform all the researchers working on this that they're wasting their time then: https://arxiv.org/abs/2001.05016

More fundamentally, any finite neural net is either constant or linear outside the training sample,depending on the activation function. Unless you design special neurons like in the paper above, which solves this specific problem for arithmetic, but not the general problem of extrapolation.


> any finite neural net is either constant or linear outside the training sample

Hence why the structure of our bodies has to include the capacity for imagination. Our brain structure does not record everything that has happened. It permits is to imagine an infinite number of things which might happen.

We do not come to understand the world by having a brain-structure isomorphic to world structure -- this is none-sense for, at least, the above reason. But also, there really isnt anything like "world structure" to be isomorphic to. Ie., brains arent HDDs.

They are, at least, simulators. I dont think we'll find anything in the brain like "leaves are green" because that is just a generated public representation of a latent-simulating-thought. There isnt much to be learned about the world from these, they only make sense to us.

That all the text of human history has associations between words is the statistical coincidence that modern NLP uses for its smoke-and-mirrors. As a theory of language it's madness.


Isn't that per-layer?


No, no matter how many piecewise linear functions you compose, the result is still piecewise linear.


Well sure, but neurons are still universal approximators. Any CPU is a sum of piecewise linear functions. I don't see where this meaningfully limits the capabilities of an AI, since once we're multilayer there's no 1:1 relation between training samples and piece placement in the output.



I just don't see how that's relevant. Nobody uses one-hidden-layer networks anymore. Whatever GPT is doing, it has nothing to do with approximating a collection of samples by assembling piecewise functions, except in the way that Microsoft Word is based on the Transistor.


Sounds like no amount of math will convince you otherwise.


Should math about a vaguely related topic convince me about this? Multilevel ANNs act differently than one-level ANNs. Transformers simply don't have anything to do with the model of approximating functions by assembling piecewise functions. This is akin to arguing that computers can't copy files because the disjunctive normal form sometimes needs exponential terms on bit inputs, so obviously it cannot scale to large data sets - yes, that is true about the DNF, but copying files on a computer simply does not use boolean operations in a way that would run into that limitation.

The way that Transformers learn has more to do with their multilayering than with the transformation across any one layer. Universal approximation only describes the things the network learns across any pair of layers, but the input and output features that it learns about in the middle are only tangentially related to the training samples. You cannot predict the capabilities of a deep neural network by considering the limitations of a one-layer learner.


>We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.

To what degree does the structure of text correspond to structure of the world, in the limit of a maximally descriptive text corpus? Nearly complete if not totally complete, as far as I can tell. What is left out? The subjective experience of being embodied in the world. But this subjective experience is orthogonal to the structure of the world. And so this limitation does not prevent an understanding of the structure.


The point is that not only is it impossible to infer the structure of the world from text, deep learning is incapable of learning about or even representing the world.

The reason language makes sense to us is that it triggers the right representations. It does not make sense intrinsically, it's just a sequence of symbols.

Learning about the world requires at least causal inference, modular and compact representations such as programming languages, and much smarter learning algorithms than random search or gradient descent.


I don't know why you think this. There is much structural regularity in a large text corpus that is descriptive of relationships in the world. Eventually the best way to predict this regularity is just to land in a portion of parameter space that encodes the structure. But again, in the limit of a maximally descriptive text corpus, the best way to model this structure is just to encode the structure of the world. You have given no reason to think this is inherently impossible.


>There is much structural regularity in a large text corpus that is descriptive of relationships in the world.

Sure, there is a lot. But let's say we want to learn what apples are. So we look at occurrences of "apple" in the text corpus, and learn that apples can be eaten, they can be sweet, sometimes they are sour, red, sometimes green, and so on.

Can apples spontaneously change size? Hmm, no idea, no mention of that in the text. Can they be used as a knife? Dunno. If I had an apple in New York 4 minutes ago, can someone else be eating the same apple in Hong Kong now? Dunno. Did apples exist in France two million years ago? Dunno. Can you drive to Saturn in an apple? Dunno.

In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships. If there were, the model could figure out the answers to the questions above by inheritance.

Maybe these particular questions happen to be answered correctly by PaLM. Maybe not, but the next LLM will include this comment in the training corpus.

But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality. The text in the training corpus tends to conform to reality, but when you prompt the model with questions that nobody would ever write text about, the illusion falls away:

Q: How many eyes does a giraffe have? A: A giraffe has two eyes.

Q: How many eyes does my foot have? A: Your foot has two eyes.

Q: How many eyes does a spider have? A: A spider has eight eyes.

Q: How many eyes does the sun have? A: The sun has one eye.

Q: How many eyes does a blade of grass have? A: A blade of grass has one eye.

Q: How do you sporgle a morgle? A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit? A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen? A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously? A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.

Q: Do you understand these questions? A: I understand these questions.

(from https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.h...)


>In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships.

I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention. Much knowledge isn't explicitly stated, but is implicit and can be deduced from a collection of explicit facts. For example, apples are food, food is physical matter, physical matter is fixed in size, cannot be in two places at once, maintains its current momentum unless acted on by a force, etc. Categorization and deducing properties from an object's category is in parameter space of language models. There's no reason to think that a sufficiently large model will not land on these parameters.

>But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality.

The issue isn't what GPT-3 can or cannot do, its about what autoregressive language models as a class are capable of. Yes, there are massive holes in GPT-3's ability to maintain coherency across wide ranges of contexts. But GPT-3's limits does not imply a limit to autoregressive language models more generally.


>I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention.

Because the knowledge is not there in the text, the models are not able to represent it, and as seen in the demonstration above, they don't have it.


The demonstration is irrelevant. The issue isn't what GPT-3 can or cannot do, but what this class of models can do.

Reduce knowledge to particular kinds of information. Gradient descent discovers information by finding parameters that correspond to the test criteria. Given a large enough data set that is sufficiently descriptive of the world, the "shape" of the world described by the data admits better and worse structures to predict the data. The organizing and association of information that we call knowledge is a part of the parameter space of LLMs. There is no reason to think such a learning process cannot find this parameter space.


It sounds like you're arguing that GPT doesn't work because it cannot work. However, it does work.

So how does PaLM understand causal chains and explain jokes that it has never seen before?


It doesn't. It's pattern matching, and you're seeing cherry picked examples. The pattern matching is enough to give the illusion of understanding. There's plenty of articles where more thorough testing reveals the difference. Here are two: https://medium.com/@melaniemitchell.me/can-gpt-3-make-analog...

But you could also just try one of these models, and see for yourself. It's not exactly subtle.

https://www.technologyreview.com/2020/08/22/1007539/gpt3-ope...


GPT-3 was specifically worse at jokes, which is why PaLM being good at this so impresses me. At any rate, I don't care if it only works one in ten times. To me, this is equivalent to complaining that the dog has bad marks in high school. (PaLM could probably explain that one to you: "The speaker is complaining that the dog is only getting C's. For a human a C is a quite bad mark. However getting even a C is normally impossible for a dog.")

"It's pattern matching" just sounds like an excuse for why it working "doesn't really count". At this point, you are asking me to disbelieve plain evidence. I have played with these models, people I know have played with these models, I have some impression of what they're capable of. I'm not disagreeing it's "just pattern matching", whatever that means, I am asserting that "pattern matching" is Turing-complete, or rather, cognition-complete, so this is just not a relevant argument to me.

What do you think a neuron does?


>At any rate, I don't care if it only works one in ten times

>you are asking me to disbelieve plain evidence


If you threw a thousand tries at a Markov chain, to use the classic "pure pattern matcher", it could not do any fraction of what this model does, ever, at all. You would have to throw enough tries at it that it tried every number that could possibly come next, to get a hit. So one in ten is actually really good. (If that's the rate, we have zero idea how cherrypicked their results actually are.)

And the errors that GPT does tend to be off-by-one errors, human errors, misunderstandings, confusions. It loses the plot. But a Markov chain never even has the plot for an instant.

GPT pattern-matches at an abstract, conceptual level. If you don't understand why that is a huge deal, I can't help you.


It's a pretty big deal, and there's a big difference between a Markov chain and a deep language model - the Markov chain will quickly converge, while the language model can scale with the data.

But the way these models are talked about is misleading. They don't "answer questions", "translate", "explain jokes", or anything of that sort. They predict missing words. Since the network is so large, and the dataset has so many examples, it can scale up the method of 1) Find a part of the network which encodes training data that is most similar to the prompt 2) Put the words from the prompt in place of the corresponding words in the encoding of the training data

i.e. pattern matching. So if it has seen a similar question to the one given in the prompt (and given that it's trained on most of the internet, it will find thousands of uncannily similar questions), it will produce a convincing answer.

How is that different from a human answering questions? A human uses pattern matching as part of the process, sure. But they also use, well, all the other abilities that together make up intelligence. They connect that meaningless symbols of the sentence to the mental representations that model the world - the ones pertaining to whatever the question is about.

If I ask a librarian "What is the path integral formulation of quantum mechanics?", and they come back with a textbook and proceed to read the answer from page 345, my reaction is not "Wow, you must be a genius physicist!", it's "Wow, you sure know where to find the right book for any question!". In the same way, I'm impressed with GPT for being a nifty search engine, but then again, Google search does a pretty good job of that already.


I don't know what to tell you. They specifically showed PaLM novel jokes. You're effectively saying that the paper is either mistaken or fraudulent.

In my experience with language models, what they do cannot be reduced to madlibs. But that's obviously not an argument I can prove to you.

Can we agree that if the model can explain structurally novel jokes, then it must have some measure of true understanding?


Understanding of what? What the joke is about? Then no, it has no idea what any of it means. The syntactic structure of jokes? Sure. Feed it 10 thousand jokes that are based on a word found in two otherwise disjoint clusters (pod of whales, pod of TPUs), with a subsequent explanation. It's fair to say it understands that joke format.

If you somehow manage to invent a kind of joke never before seen in the vast training corpus, that alone would be impressive. If PaLM can then explain that joke, I will change my mind about language models, and then probably join the "NNs are magic you guys" crowd, because it wouldn't make any sense.


Good point, coming up with a novel joke is no joke. There's a genuine problem where GPT is to a first approximation going to have seen everything we'll think of to test it, in some form or other.

Of course, if we can't come up with something sufficiently novel to challenge it with, that also says something about the expected difficulty of its deployment. :-P

I guess once we find a more sample-efficient way to train transformers, it'll become easier to create a dataset where some entire genre of joke will be excluded.


> No one here is a scientist and no one treats any of this as science. Where's the criteria for the emprical adequecy of NLP systems as models of language? Specifying any, conducting actual hypothesis tests, and establishing a theory of how NLP systems model language -- this would immediately reveal the smoke-and-mirros.

What do you mean?

I'm not a scientist but I play one sometimes, and I managed a whole team of them working in this field.

The theory of language models is well established.

> Where's the criteria for the emprical adequecy of NLP systems as models of language?

There are lots(!?) I think the Winograd schema challenge[1] is an easy one to understand, and meets a lot of your objections because it is grounded in physical reality.

Statement:

The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.

Question:

Does "they" refer to the councilmen or the demonstrators?

The human baseline for this challenge is 92%[1]. PaLM (this Google language model) scored 90% (4% higher than the previous best)[3].

[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[2] http://ceur-ws.org/Vol-1353/paper_30.pdf

[3] https://storage.googleapis.com/pathways-language-model/PaLM-... pg 12


Indeed, all these test are not of empirical adequacy which really evidences the point. The whole field is in this insular pseudoscientific mould of "its true if it passes an automated test to x%".

A theory with empirical adequecy would require you to do some actual research into language use in humans; all of its features; how it works; various theories of its mechanisms etc. And after a comprehensive, experimental and detailed theoretical work -- show that NLP models even *any* of it.

Ie., that any NLP model is a model of language.

All you do above is design your own win condition, and say you've won. This precludes actually knowing anything about how language works, and is profoundly pseudoscientific. If you set-up tests for toys, and they pass -- good, you've made a nice toy.

You may only claim is models some target after actually doing some science.


A theory with empirical adequecy would require you to do some actual research into language use in humans; all of its features; how it works; various theories of its mechanisms etc. And after a comprehensive, experimental and detailed theoretical work -- show that NLP models even any* of it.*

What - specifically - do you mean?

There's an entire field adjacent to NLP called Computational Linguistics. Most people in the field work across them both, and there is significant cross pollination.

It's unclear if think there is some process in the brain that you think NLP models should be similar to. If this is the case you should look at studies similar to [1] where they do MRI imaging and can see similar responses in semantically similar words. This is very similar to how word vectors put similar concept closely together (and of course how more complex models put concept close together).

Or perhaps you think that NLP models do not understand syntactic concepts like nouns, verbs etc. This is incorrect too[2].

[1] https://www.tandfonline.com/doi/full/10.1080/23273798.2017.1...

[2] https://explosion.ai/demos/displacy


It should do what language does...

Language is a phenomenon in, at least, one type of animal. It allows animals to coordinate with each other in a shared environment; it describes their internal and external states; etc. etc.

Language is a real phenomenon in the world that, like gravity, can be studied. It isnt abstract.

NLP models of language arent models of language. Theyre cheap imitations which succeed only to fool language users in local highly specific situations.


> NLP models of language arent models of language.

Do you actually know what a NLP Language Model refers to? It literally is a model of the language - it predicts the likelihood of the next word(s) given a set of prior word(s).

It seems you think people just throw some data at a neural network and then go wow. It's not like that at all - the field of NLP grew out of linguistics study and has deep roots in that field.


That's not a model of language. Language is a communicative activity between language users, who do things with words, with each other.

What you're talking about is ignoring the entire empirical context of langauge, as a real-world phenomenon, and modelling is purely formal characteristics as recorded post-facto.

This will always just produce a system which cannot use langauge, but will only ever appear to within highly constrained -- essentially illusory -- contexts. Its the difference between a system which makes a film by "predicting the next frame", and a making a film by recording actual events that you are directing.

A prediction of a "next frame" is always therefore just going to be a symptom of the frames before it. When I point a camera at something new, eg., an automobile in c. 1900 -- i will record a film that has never been recorded before.

And likewise, with words: we are always in genuinely unquie unprecedented situations. And what we *do with words*, is speak about those situations *to others* who are in them with us... we aim to coordinate, move, and so on with words.

To model *language* isnt to model words, nor text, nor to predict words or text. It is to be a speaker here in the world with us, using language to do *what language does*.

No model of the regularities of text will ever produce a language-user. Language isnt a regularity, like the frames of a film -- its a suit of capacities which are responsive to the world, and enable language users to navigate it.


Until you can make quantifiable predictions of behaviour that you want to see it sounds like your objections are philosophical rather than scientific.

> A prediction of a "next frame" is always therefore just going to be a symptom of the frames before it.

But the physical appearance of the automobile itself was absolutely influenced by what went before - they were called "horseless carriages" after the appearance after all.

And NLP Language Models can produce genuinely original and unique writing. This is a poem a large LM wrote for me:

  The sceptered isle
  Hath felt the breath of Britain,
  Longer than she cares to remember.
  Now are her champion arms outstared,
  Her virgin bosom stained with battle's gore.
  Lords and nobles, courtiers and commons,
  All stand abashed; the multitudinous rout
  Scatter their fears in every direction;
  Except their courage, which, to be perfect,
  Must be all directed to the imminent danger
  Which but now struck like a comet; and they feel
  The blow is imminent
> we aim to coordinate, move, and so on with words.

https://say-can.github.io/

"Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions"


Chopping up sequences of film, and stiching them together, based on their prior similarity isn't making a movie -- and that's all you have here. People wrote poetry -- *for the first time* -- to say something about their own environment, that they are present in. All you have here is a system which has remembered a compressed representation of these poems and stiches them together to fool you.

It really is a kind of proto-psychosis to think this machine has written a poem. It has generated the text of a poem.

> quantifiable predictions of behaviour that you want to see

This is trivial. I ask the machine a large number of ordinary questions, eg., "what do you think about what i'm wearing?", "what would it take to change your mind on whether murder is justified?", "do you think you'd like new york?", "could you pass me the salt?", etc. -- a trivial infinity of questions lifted from the daily life of language users.

The machine cannot answer any of those questions. All it will do is generate some text on the occasion that the machine sees that text. This isn't an answer. That isnt the question. The question isnt "summarise a million documents and report an on-average plausible answer to these questions".

When I ask a person any of those questions, if they did that, they wouldnt be answering them. This is trivial to observe.

These systems are just taking modes() of subsets of historical data. That's just what they are. The appearence of their using language is an illusion

To use language is to have something to say, to wish to talk about something. When i say, "I liked the movie!" I am not summarising a million reviews and finding an average sentence. I am thinking about my experience of the movie, and generating a public sharable "text" that aims to communicate what i actually think.

*THAT* is language. Language is your intention to speak *ABOUT* something, and the capacity to generate a public shared set of words which communicate what you are talking about. Any process which begins *without anything to say* cannot ever reach langauge as a capacity.

Langauge, as a capacity, begins by being in the world. No summary of the public statmenets of past speakers has anything to do with being in the world; and having things to say. Chopping that up and stiching it together is a trick.

And this is trivial to show empirically. It is only by having absolutely no study of langauge use can anyone claim that text documents have anything ot do with it. IT's mumbohjumbo.


I see. You believe there is something unmeasureable that matters.

I don't. I believe a perfect simulation of intelligence is intelligence.


It's not unmeasurable. If you ask a friend, "did you like that movie?" would you be happy if they hadnt seen it; didnt know anything about it; etc. etc. and simply generated a response based on some review data they'd read?

Is that what you want from people? You want them just to report a summary of the textbooks, of the reviews of other people? You dont want them to think for a moment, about anything and have something to say?

This is a radically bleak picture; and omits, of course, everything important.

We arent reporting the reports of others. We are thinking about things. That isnt unmeasurable, it is trivial to measure.

Show someone the film, ask them questions about it, and so on -- establish their taste.

NLPs arent simulations of anything. It's a parlour trick. If you want a perfect simulation of intelligence, go and show me one -- I will ask it what it likes; and I doubt it'll have anything sincere to say.

There is no sincerity possible here. These systems are just libraries run through shredders; they havent been anywhere; they arent anywhere. They have nothing to say. They arent talking about anything.

You and I are not the libraries of the world cut up. We are actually responsive to the environments we are in. If someone falls over, we speak to help them. We dont, as if lobtomized, rehearse something. When we use words we use them to speak about the world we are in; this isnt unmeasuarable -- its the whole point.


Why do you think a model of intelligence needs to have tastes, values, likes/dislikes, etc for it to be something more than statistics or pattern matching? Why are you associating these consciousness qualities with AGI?


To use a language is just to talk about things. You cannot answer the question, "do you like what i'm wearing?" if you dont have the capacity for taste.

Likewise, this applies to all language. To say, "do you know what 2+2 is?" *we* might be happy with "4" in the sense that a calculator answers this question. But we havent actually used language here. To use language is to understand what "2" means.

In otherwords, the capacity for langauge is only just the capacity to make a public communicable description of the non-linguistic capacities that we have. A statistical analysis of what we have already said, does not have this contact with the world, or the relevant capacities. It's just a record of their past use.

None of these systems are langauge users; none have language. They have the symbols of words set in an order, but they arent talkiung abotu anything, because they have nothing to talk about.

This is, i think really really obvious when you ask "did you like that film?" but it applies to every question. We are just easily satisifed when alexa turns the lights off when we say "alexa, lights off". This mechanical satisifcation leads some to the frankly schiozphrenic conclusion that alexa understands what turning the lights off means.

She doesnt. She will never say back, "but you know, it'll be very dark if you do that!" or "would you like the tv on instead?" etc. Alexa isnt having a conversation with you based on a shared understanding of your environment, ie., using langauge.

Alexa, like all NLP systems, are illusions. You arent speaking to anything. You arent asking anything a question. Nothing is answering you. You are the only thing in the room that understands what's going on, and the output of the system is meaningful only because you read it.

The system itself has no meaning to what its doing. The lights go off, but not because the system understood that your desire. It could not, if it failed to undestand, ask about your desire.


You're just reiterating that you think tastes, opinions, likes/dislikes are something intrinsic to the issues here. I'm asking why do you think these things are intrinsic to language understanding or intelligence?

>To use language is to understand what "2" means.

I've never held a "2", yet I know what 2 is as much as anyone. It is a position in a larger arithmetical structure, and it has a correspondence to collections of a certain size. I have no reason to think a sufficiently advanced model trained on language cannot have the same grasp of the number 2 as this.

>A statistical analysis of what we have already said, does not have this contact with the world, or the relevant capacities. It's just a record of their past use.

Let's be clear, there is nothing inherently statistical about language models. Our analysis of how they learn and how they construct their responses is statistical. The models themselves are entirely deterministic. Thus for a language model to respond in contextually appropriate ways means that it's internal structure is organized around analyzing context and selecting the appropriate response. That is, it's "capacities" are organized around analyzing context and selecting appropriate responses. This to me is the stuff of "understanding". The fact that the language model has never felt a cold breeze when it suggests that I close the window if the breeze is making me cold is irrelevant.

>You arent speaking to anything. You arent asking anything a question. Nothing is answering you.

It seems that your hidden assumption is that understanding/intelligence requires sentience. And since language models aren't sentient, they are not intelligent. But why do the issues here reduce to the issue of sentience?


What is language?

Language is a empirical phenomenon. It's something happening between some animals, namely at least, us. It is how we coordinate in a shared environment. It does things.

Language isnt symbols on a page, if it were, a shredder could speak. Is there something we are doing the shredder is not?

Yes, we are talking about things. We have something to say. We are coordinating with respect to a shared environment, using our capacities to do so.

NLP models are fancy ways of shredding libraries of text, and taking the fragments which fall out and calling them "language". This isnt language. It isnt about anything; the shredder had no intention to say anything.

Mere words are just shadows of the thoughts of their speakers. The words themselves are just meaningless shapes. To use langauge isnt to set these shapes in order, its to understand something; to want to say something about it; and to formulate some way of saying it.

If I asked a 5yo child "what is an electron?" and they read from some script a definition, we would not conclude the CHILD had answered the question. They have provided an answer, on the occasion it was asked, but someone else answered the question -- someone who actually understood it.

An NLP model, in modelling only the surface shapes of language *and not its use* is little more than a tape recorder playing back past conversations, stitched together in an illusory way.

We cannot ask it any questions, because it has no capacity to understand what we are talking about. The only questions it can "answer", are like the child, those which occur in the script.


I disagree wholeheartedly. But its clear further back and forth's will be unproductive.


> No model of the regularities of text will ever produce a language-user.

No but it will produce language-users, incidentally. Language-users are an irreducible aspect of the underlying regularity in language. Now I'm not saying that "GPT will wake up" purely from language tasks, that GPT will become a language user by being a system that picks up regularities. But for GPT to contain systems like language users, to instantiate language-users, which it has to (on some level) in order to successfully predict the next frame, is already enough to be threatening.

I know that using examples from fiction is annoying, but - purely as a rhetorical aid - consider the Enterprise computer (in Elementary, Dear Data) as GPT, and the Moriarty hologram as an embedded agent. The Enterprise computer is not conscious, but as a very powerful pattern predictor it can instantiate conscious agents "by accident", purely by completing a pattern it has learnt. It doesn't want to threaten the Enterprise, it doesn't want to not threaten the Enterprise, because it doesn't have any intentional stance. Instead, it was asked "A character that can challenge Data is ¬" and completed the sentence, as is its function.


How does the computer answer "Do you like what i'm wearing today?" ?

Well, if we say the computer is, in fact, not participating in the world with us -- it is merely predicting "the next word", then it cannot.

I am not asking for any answer to this question. I want to know what it (like a friend) actually thinks about what i'm wearing.

To do this, it would need to be a competent language user; not a word annoucer. It would, in otherwords, need to know what the language was about -- and need to be able to make a judgement of taste based on its prior experiences, etc.

I dont think our ability to misattribute a capacity of languge to things (eg., to bugs bunny) is salient -- we are fools, easily fooled. Bugs bunny doesnt exist.

In this case, the star trek computer, insofar as it actually answers the questions its asked -- is routinely depicted as being actually present in the world with us. That the show might claim "no it isnt!", or we otherwise hold this premise whilst observing that it is, is just foolishness. Bugs bunny likewise, is depicted with the premise that bugs is within his own world; this likewise, is irrelevant.


Well, GPT is not the sort of thing that can have a "you." But it has seen dialogues that have a "you" in it, and it knows how a "you" tends to answer. For instance, depending on context, it may be operating under a different model for the "you" agent - the sort of person who likes a red dress, or the sort of person who likes suspenders. If we assume a multimodal GPT, it's going to draw on its pattern recognition from movies and its context window for what it's previously said as "you" or what you've prompted it as in order to guess what the agent it's pattern completing for "you" would think of your dress.

In effect, I'm saying that just because GPT is not a word-user, that doesn't mean that its model of "you" - the layered system of patterns that generates its prediction for words that come after "I think your dress looks" - isn't a word-user. The "you" model, effectively, takes in sensory input, processes it, and produces output. Because the language model has learnt to complete sentences using agents as predictive patterns - because agents compress language - the you pattern acts agentic, despite the fact that the language model itself is not "committed" to this agent and will, if you reset its context window, readily switch to pattern predicting another agent.

GPT is not an agent, but GPT can predict an agent, and this is equivalent to containing it.


I dont think it is equivalent. If you assume it has the same modal properties, sure -- let's say that's plausible.

Ie., if GPT said on the occasion it was asked Q, an answer A, in a possible world W, such that this answer A was the "relevant and reasonable" answer in W -- then GPT is "doing something interesting".

Eg., if I am wearing red shoes (World W1) and it says "i like your red shoes" in W1, then that's for-sure really interesting.

My issue is that it isnt doing this; GPT is completely insensitive to what world its in and just generates an average A in reply to a world-insensitive Q.

If you take a langauge-user, eg. me, and enumerate my behaviour in all possible worlds you will get somehting like what GPT is aiming to capture. Ie., what i would say, if asked Q, in world-1, wolrd-2, world-infinity.

My capacity to answer the question in "relevant and reasonable" ways across a gegnuine infinity of possible worlds comes from actual capacities i have to obvserve, imagination, explore, question, intereact, etc. It doesnt come from being an implementation of the (Q, A, W) pattern -- which is an infintity on top of an infinity.

No model which seeks to directly implement (Q, A, W) can ever have the same properties of an actual agent. That model would be physically impossible to store. So GPT does not "contain" an agent in the sense that QAW patterns actually occur as they should.

And no route through modelling those patterns will ever produce the "agency pattern". You actually need to start with the capacities of agents themselves to generate these in the relevant situations, which is not a matter of a compressed representation of QAW possibilities -- its the very ability to imagine them peicemeal (investigate, explore, etc .)


I mean, how would you discover that you're in world W? If you ask "what do you think about my red shoes?" and I say "I think your red shoes are pretty", then you will say this is just me completing the pattern. But if I have no idea what shoes you're wearing, then even I, surely agreed to be an agent, could not compliment your clothing. So I'm not sure how this distinction works.

> It doesnt come from being an implementation of the (Q, A, W) pattern

Well, isn't this just a (Q, A, W, H) pattern though? You have a hidden state that you draw upon in order to map Qs onto As, in addition to the worldstate that exists outside you. But inasmuch as this hidden state shows itself in your answers, then GPT has to model it in order to efficiently compress your pattern of behavior. And inasmuch as it doesn't ever show itself in your answers, or only very rarely, it's hard to see how it can be vital to implementing agency.

And, of course, teaching GPT this multi-step approach to problem solving is just prompting it to use a "hidden" state, by creating a situation in which the normally hidden state is directly visualized. So the next step would be to allow GPT to actually generate a separate window of reasoning steps that are not directly compared against the context window being learnt, so it can think even when not prompted to. I'm not sure how to train that though.


Sure, GPT has to model H -- that's a way of putting it. However think of how the algorithm producing GPT works (and thereby how GPT models QAWH) -- it produces a set of weights which interpolate between the training data --- even if we give it QAWH as training data, implementing the same QAWH patterns would require more storage capacity than is physically possible.

I think there's a genuine ontological (practical, empirical, also) difference between how a system scales with these "inputs". In otherwords if a machine is a `A = m(Q | World, Hidden)`, and a person is a `A = p(Q | World, Hidden)` then their complexity properties *matter*.

We know that the algorithm which produces `m` does so with exponential complexity; and we know that the algorithm producing `p` doesnt. In otherwords, for a person to answer `A` in the relevant ways, does not require exponential space/time. We know that NNs are already exponential scaling in their parameters in their even fairly radically stupid solutions (ie., ones which are grossly insensitive even to W).

So whilst `m` and `p` are equivalent if all we want is an accurate mapping of `Q`-space to `A`-space, they arent equivalent in their complexity properties. This inequivalence makes `m` physically impossible, but i also think, just not intelligent.

As in, it was intelligent to write the textbook; after its written, the HDD space which stores it isnt "intelligent". Intelligence is that capacity which enables low-complexity systems to do "high-complexity" stuff. In other words, that we can map-out QAWH with physically-possible, indeed, ordinary capacities -- our-doing-that is intelligence.

I think this is a radically empirical question, rather than a merely philosophical one. No algorithm which relies on interpolation of training data will have the right properties; it just wont, as a matter of fact, answer questions correclty.

You cannot encode the whole QAWH-space in parameters. Interpolation, as a strategy, is exponential-scaling; and cannot therefore cover even a tiny fraction of the space.

Ie., if I ask "what did you think of will smith hitting christopher walken?" it is unlikely to reply, "I think you mean Chris Rock" firstly; and then if will does hit walken, to reply, "I think Walken deserved it!".

Interpolation, as a strategy, cannot deal with the infinities that counter-factuals require. We are genuinely able to perform well in an infinite number of worlds. We do that by not modelling QA pairs, at all; nor even the W-infinity.

Rather, we implement "taste, imagination, curiosity" etc. and are able to simulate (and much else) everything we need. We arent an interpolation through relevant hisotry, we are a machine direclty responsible to the local environment in ways that show a genuine deep understanding of the world and abiliyt to similate it.

This ability enables `p` to have a lower complexity than `m`, and thereby be actually intelligent.

As an empirical matter, i think you just can't build a system which actually succeeds in answering the-right-way. It isnt intelligent; but likewise, it also just doesnt work.


The notion that GPT "interpolates between the training data" is a widespread misconception. There is no evidence that that's what's going on. GPT seems to be capable of generalizing, in ways that let it mix features of training samples at least, and even generalize to situations that it has never seen.

It seems to me your entire argument derives from this. If GPT is not exponential, then the m/p distinction falls apart. And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.

Neural networks learn features, not samples. Layered networks learn features of features (of features of features...). Intelligence works because for many practical tasks, the feature recursion depth of reality is limited. For instance, we can count sheep by throwing pebbles in a bucket for every sheep that enters the pasture, because the concept of items generalizes both sheep and pebbles, and the algorithm ensures that sheep and pebbles move as one. So to come up with this idea, you only need to have enough layers to recognize sheep as items, pebbles as items, those two conceptual assignments as similar, and to notice that when two things are described by similar conceptual assignments in the counting domain, you can use a manual process that represents a count in one domain to validate the other domain. Now I don't think this is actually what our brain is literally doing when we work out this algorithm, it probably involves more visual imagination and looking at systems coevolve in our worldmodel to convince us that the algorithm works. But I also don't think that working this out on purely conceptual grounds needs all that many levels of abstraction/Transformer layers of feature meta-recognition. And once you have that, you get it.


All GD learners are interpolators (cf. https://arxiv.org/abs/2012.00152) we also know theyre exponential in parameter count ( cf. https://www.researchgate.net/figure/Number-of-parameters-ie-... )

> If GPT is not exponential, then the m/p distinction falls apart.

Yes, I think if you have a system which implements QAWH with a similar compelxity to a known intelligent system -- at that point I have no empirical issues. I think, at that point, you have a workiung system.

We then ask if it is thinking about anything, and I think that'd be an open question as to how its implemented. I dont think the pattern alone would mean the system had intentionality -- but my issue at this stage is the narrower empirical one. Without something like a "tractable complexity class", your system is broken.

> And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.

This is an illusion. Knowledge here is deterministic, to the same question, the same answer. GPT generates answers across runs which are self-contradictory, etc. "the same question" (even literally, or if you'd like, with some rephrasing) is given quite radically different answers.

I think all we have here is evidence of the (already known) tremendous compressibility of text data. We can, in c. 500bn numbers, compress most of the histoy of anything ever said. With such a databank, a machine can appear to do quite a lot.

This isnt world knowledge... it is a symptom of how we, language users, position related words near each other for the sake of easy comprehension. By doing this one can compress our text into brute statstical associations which appear to be meaningful.

As much as Github's AI is basically just copy/pasting code from github repos, GPT is just copy/pasting sentences from books.

All the code in github, compressed into billions of numbers, and decompressed a little -- that's a "statical space of tricks and coincidences" so large we cannot by intution alone fathom it. It's what makes these systems useful, but also easy illusions.

We can, by a scientific investigation of these systems as objects of study, come up with trivial hypothesis that expose their fundamentally dumb coincidental character. There are quite a few papers now which do this, I dont have one to hand.

But you know, investigate a model of this kind yourself: permute the input questions, investigate the answers.. and invalidate your hypothesis (like a scientist might do)... can you invalidate your hypothesis?

I think with only a little thoguh you will find it fairly trivial to do so.


> All GD learners are interpolators (cf. https://arxiv.org/abs/2012.00152) ,

If the paper is substantially correct I concede the point. But what I've read of reactions leads me to believe the conclusion is overstated.

Regarding compression vs intelligence, I already believe that intelligence, even human intelligence, is largely a matter of compressing data.

Regarding "knowledge is deterministic", ignoring the fact that it's not even deterministic in humans, so long as GPT can instantiate agents I consider the question of whether it "is" an agent academic. If GPT can operate over W_m and H_n, and I live in W_1 and have H_5, I just need to prompt it with evidence for the world and hidden state. Consider for example, how GAN image generators have a notion of image quality but no inherent desire to "draw good images", so to get quality out you have to give them circumstantial evidence that the artist they are emulating is good, ie. "- Unreal Engine ArtStation Wallpaper HQ 4K."


Ok, boomer.


>Words mean what we do with them -- you need to be here in the world with us, to understand what we mean

This is like saying "humans can't fly because flight requires flapping wings under your own power". Sure, its true given the definition this statement is employing, but so what? Nothing of substance is learned by definition. We certainly are not learning about any fundamental limitations of humans from such a definition. Similarly, defining understanding language as "the association of symbols with things/behaviors in the world" demonstrates nothing of substance about the limits of language models.

But beyond that, its clear to me the definition itself is highly questionable. There are many fields where the vast majority of uses of language do not directly correspond with things or behaviors in the world. Pure math is an obvious example. The understanding of pure math is a purely abstract enterprise, one constituted by relationships between other abstractions, bottoming out at arbitrary placeholders (e.g. the number one is an arbitrary placeholder situated in a larger arithmetical structure). By your definition, a language model without any contact with the world can understand purely abstract systems as well as any human. But this just implies there's something to understanding beyond merely associations of symbols with things/behaviors in the physical world.


So... we are training models here on HN, specially if we follow the site's guidelines! Makes you think... which makes it think!

Wow, interesting times, indeed.


Wow.

The anti joke explanation was also very impressive.


This may be the most impressive thing I've seen a language model do so far. That's incredible. The future is going to be very weird.


The future is going to be very short.

What this says to me is that there is no general-intelligence task that language models cannot scale to.


Apparently Data (from Star Trek) was less capable of understanding jokes.


this thing is already more human than i am


Your comment prompted me to tweet an image of that section, complete with alt text (as much as can fit). If anyone cares to see it in tweet form.

https://twitter.com/tlalexander/status/1511089810752126984


I've always had this question, so I figured I'd ask it here: these large models are trained on enormous quantities of text - here, 780B tokens, which corresponds to around 1-2 TiB of plaintext scraped from all over the web, using their tokenization scheme. How careful are researchers about making sure their test datasets don't end up in the model? How specifically do they ensure that this doesn't happen?

EDIT: It looks like they specifically address in the section "Dataset Contamination" on page 35 of the paper - it appears the performance on the clean subsets of certain tasks, "clean" because that data was confirmed not to be in the training data, is similar to the performance on those tasks when all the questions, including those in the training data, are included. I'd need to have an ML research comment to know if this a valid way to control for the effects of memorization.


Apparently there is a loose convention to exclude any documents from ML training corpora which contain a certain UUID - submitted at https://news.ycombinator.com/item?id=30927569.


On many natural language tasks there can be significant overlap, making it difficult to judge performance. That's why I like more complex code generation tasks such the dataset we used for AlphaCode.


I wonder if AI is a technology that will move from "local producers" to a more centralized setup, where everybody just buys it as a service, because it becomes too complicated to operate it by yourself.

What are examples in history where this has happened before? The production of light, heat and movement comes to mind, that, with the invention of electricity, moved from people's homes and businesses to (nuclear) power plants, which can only be operated by a fairly large team of specialists.

Anybody has other examples?


This is kind of already happening with services like Google cloud translation.


Will we see an intelligence too cheap to meter?

https://www.atlanticcouncil.org/blogs/energysource/is-power-...


Hosting from your owning servers on home, localized data centers to global cloud companies.


Yeah, this kinda goes in the same direction, but in this case, as well as for example agriculture, I feel it is mostly for convenience. You could still do it at home if you wanted to, in contrast to operating a nuclear power plant. I thought chip-making might be another example, but I'm not sure that was ever decentralized in its early days.


Interesting that they used Chain of Thought Prompting[1] for improved reasoning so soon after its publication. Also related to DeepMind AlphaCode which generates code and filters results by unit tests, while Chain of Thought Prompting filters by checking for correct answer at the end.

Seems like language models can generate more training data for language models in an iterative manner.

[1] https://arxiv.org/abs/2201.11903


The chain of thought paper is from Google, so they've known about it internally for a while potentially


The general technique is pretty obvious, I discussed and demonstrated it in some HN comments with GPT2 and GPT3 a couple times in the last couple years, and suggested some speculative extensions (which might be totally unworkable, unfortunately these networks are too big for me to attempt to train to try it out) https://news.ycombinator.com/item?id=24005638


In fact, people had already shown it working with GPT-3 before you wrote your comment: https://twitter.com/kleptid/status/1284069270603866113 https://twitter.com/kleptid/status/1284098635689611264 Seeing how much smarter it could be with dialogue was very exciting back then, when people were still super-skeptical.

The followup work has also brought out a lot of interesting points: why didn't anyone get that working with GPT-2, and why wouldn't your GPT-2 suggestion have worked? Because inner-monologue capabilities seem to only emerge at some point past 100b-parameters (and/or equivalent level of compute), furnishing one of the most striking examples of emergent capability-spikes in large NNs. GPT-2 is just way too small, and if you had tried, you would've concluded inner-monologue doesn't work. It doesn't work, and it keeps on not working... until suddenly it does work.


Is there any convincing research for how/why the inner monologue capabilities emerge?

It’s extremely unintuitive, but also pretty empirically obvious, that LLM’s gain this capability just by scaling and absent any changes in architecture. I assumed that an explicit external memory would be needed, maybe similar to a neural turing machine.


There is none I am aware of. It all focuses on eliciting and measuring and making good use of the capability.

The lack of an explicit external memory is not too surprising because the text is fed back in at every iteration. That fakes having a memory: the prompt just gets bigger. That's ordinary enough. What's critical, it seems, is being able to decide on the next incremental step and executing it within the space of an iteration, rather than simply 'guessing' the final answer.

As to how that actually happens inside a large but not small Transformer, I suspect that there is a phase transition inside the Transformer itself where it changes how it fundamentally thinks, which doesn't lead to any obvious changes the training dynamics because the two ways of thinking are initially equivalent in a loss. An example of this, where the Transformer computes in a radically different way before and after a certain point in training, is Anthropic's new work on the "induction bump": https://transformer-circuits.pub/2022/in-context-learning-an...


Google for all their flaws really is building the future of AI. This is incredibly impressive and makes me think we are relatively close to GAI


It doesn’t seem like this architecture has anything to do with (A)GI, which is not hard to invent - just takes two people and 9 months - runs on less than 100W and doesn’t obviously parallelize like this.

Rather we’ve invented some kind of programming by example that generates absolutely enormous programs because we made all of it have to be the data and code at the same time.


> and makes me think we are relatively close to GAI

We’re not. This is undeniably impressive, and likely has huge addressable markets. But we’re still not remotely close to AGI.


I think this remains to be seen. The field is beginning to embrace Moore's law, in a way. What would take an hour to train on an A100 (not to mention an H100) would take like 60k hours (7 years) on a GPU from 20 years ago. The pace at which we are able to brute force things is incredible, and that's not even mentioning the actual science and ML progress happening.

It might be that simple brute forcing solutions become feasible before long.


Why do you think we're not? It's weird how people don't find it weird to just give assertions like this without any supporting argument.


Have you played with GPT3? The amount of effort that goes into the prompts is huge for these examples (and likely cherry picked at that).

The results are unbelievable. But they require a huge amount of nudging. I know it seems like we’ve made huge progress but we’re further away from AGI than we are closer to it.


Have you tried the instruct series of GPT3? It's much better at following instructions than the original.


There are just so many unsolved problems.

Taking sequences of actions, having a general model of the world beyond just language, memory, etc. are all large ones off the top of my head.


> Taking sequences of actions, having a general model of the world beyond just language, memory, etc. are all large ones off the top of my head.

Google: "Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions"

https://say-can.github.io/

Released today. An example:

Given the task "I spilled my coke, can you bring me something to clean it up?", SayCan successfully planned and executed the following steps 1. Find a sponge 2. Pick up the sponge 3. Bring it to you 4.


Sure, there are problems that remain. But how can we say there are "so many" of them? We've seen large leaps in capabilities from architectures based on transformers. We've yet to see the limit of transformers as the scaling curves haven't plateaued. It seems to me that we don't know if we're two innovations or two thousand away from AGI. What we do know is that progress has been faster than anyone would have predicted 10 years ago.


The thing we’re progressing towards isn’t AGI, if by that you mean is conscious and sentient.

I’m not sure this is actually possible, ie I don’t think we know “general intelligence” exists. The existence of your mind is a trick your brain plays on itself after all. You don’t exist independently of your body.


If we can make something equivalent to a human, that is AGI. It does not presuppose some notion of consciousness that we ourselves might not meet.


Nobody actually wants that. You can make something equivalent to a human with your spouse right now and it doesn't solve that many problems.

People want AGI because they think it'll have properties of computer programs like being able to copy it everywhere and have it predictably do what you tell it. I suspect there's an inverse relationship between AGI-ness and this kind of usefulness, though.


AGI is AI that is as good as humans on all skills. It has nothing to do with consciousness or sentience.


I think reaching AGI has to do with the 4 E's: embodied, enacted, embedded and extended. These models have no embodiment, no complex external environment, and there is no goal to attain other than language games.

But they could be embodied, and some experiments have shown for example how a LM can guide an agent in a 3D virtual home to accomplish tasks. (*) Maybe AIs can be educated like children after they reach a certain threshold, by giving them robotic bodies and immersing them in the human society which is the most complex environment for intelligence.

(*) Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - https://arxiv.org/pdf/2201.07207.pdf


They have a complex external environment just as much as we do: the physical world generates their inputs. Just because their inputs are prefiltered through human language doesn't mean they're not exposed to its full complexity just as much as we are, after having it prefiltered through our eyeball and optic nerve. And predicting human statements implicitly requires modeling agents with goals - see the joke explanations.


Human eyes aren't an input. The process of vision is bidirectional and relies on you choosing what to look at just as much as what happens to be sending photons your way.

Same for the other senses and learning; it's all active engagement with an environment. An ML language model just gets text dumped into its weights until it works though. It doesn't get to ask for more, and it's probably not the best design that we're trying to store everything in its weights instead of letting it look up an external library.


I readily agree that it's not the best design, and that it will have a hard time figuring out things that are not in its corpus. But reinforcement learning is also unlike human learning, and may possibly be better at fitting data into a world model without requiring reflective attention and prediction the way human learning does.


But LM's can't do causal interventions to separate correlation from causation unless they have access to the real environment. Would a scientist that can't run experiments be able to advance the field by just reading past papers? Even babies need to try causality laws by letting objects fall and observing what happens, we learn by interacting with our environment.


Right, they can't create new evidence to learn from (yet). But we're feeding them a lot of data. Possibly more (and more diverse) data than a human takes in during childhood! I remember a lot of visual input, but I don't remember reading the entire internet. (Yes I know it's exaggeration, but come on- compare characters vs characters).


You could say that predicting the next token is their only action, and minimizing the loss their only goal. It's a kind of impoverished environment. Imagine a human living in a cave, seeing only a scrolling line of text. Would that human learn anything new by getting out of its cave? Plato was so right, intelligence needs access to the full environment.


I mean, isn't the whole point of Plato that we do everything we do, which is a lot, certainly enough to be dangerous, just by sitting in a cave and manipulating levers while watching a shadow? This does not make text-trained AI seem safer.

Seen another way, the lesson of Plato is "Shadows On A Cave Wall Are All You Need - arxiv.com"


Here's an example of an extended language model operating in a real world environment:

https://www.youtube.com/watch?v=ysFav0b472w


Being good at arbitrarily poorly-designed skills has nothing to do with sentience?


Look, I'm actually a massive optimist on this topic and think we can get very far with large language models. Memory I think will actually not be that difficult to solve, optimizing for long series of actions will be a hard one though, as will making an AI that can reason in a representationally invariant and transferrable way (ie. something that can reason just as well about data in the form of images as text).

I'm not saying we're not seeing continuous improvements on performance on benchmarks from scaling up - we absolutely are. But "the end of the benchmarks" is not AGI - even if we have a tool AI/oracle AI that can perform at a superhuman level on benchmark tasks (which I think is absolutely doable within the current paradigm), we do not have a general AI.


Hooking up transformers into an agent AI does not seem like a daunting task given the ability to generate snippets of reasoning on demand.


Yeah the multimodal work is already making progress and will likely be the future. That problem seems solvable soonish. Sequence of actions is much harder but with their new Chain of Thought research and whats happened in RL, I wouldn't be surprised if this was mostly solved in < 10 years. Retrieval transformers are the most promising on memory.


I think multimodal work is still very much in the early stages. That said, I don't see any reason why the exact same approach we are using here wouldn't work ultimately for multi-modal settings as well.

Memory also seems easy to solve.

I am still very skeptical of our current approach in relation to sequences of actions, reasoning, discrete choices, and interaction with complex environments. RL is at baby level compared to the advances we have made in language modeling.


The chain-of-thought prompting demonstrates that the output tape can act as procedural memory.


Well it is unfortunately an output that scales at O(n^2) its length, which is not super great once you get to sequences of 1000s of words


Sparse transformers are a thing. Dictionary lookup is a thing. The transformer can probably be trained to store long chains of information in a dedicated memory system, retrieving it using keywords.

These are the things that came to mind in ten seconds. This is not going to be the problem that meaningfully delays AGI.


As I said in another comment, I think memory will be the easiest of these challenges to solve. That said, it hasn't really been solved yet.

If we can scale up S4-style architectures to this size, maybe it will be solved.


We're slowly moving our discussion from "How could you think AGI is possible?" to "Is this AGI yet?".


Exactly! These lines will just be more and more blurred in coming years


Please Google.... Please include in your papers non-cherry-picked sample outputs! And explicitly say that they aren't cherry picked.

I understand that there is a chance that the output could be offensive/illegal. If necessary you can censor a few outputs, but make clear in the paper you've done that. It's better to do that than just show us the best picked outputs and pretend all outputs are as good.


Exactly, it's so easy to try out thousands of examples and then just show the best one.


Based on their 3rd figure, it would take an approximately 100x larger model (and more data) to surpass the performance of the best humans


Its performance on answers to chained inference questions (on page 38 of https://storage.googleapis.com/pathways-language-model/PaLM-...) has already surpassed the performance of this human.


I placed a transparent plastic ruler on the screen to come to the same conclusion, then I saw your comment.


Your methodology is much more sophisticated than mine


My dad, who worked on jet engine production many decades ago, would refer to MIL SPEC EYEBALL Mk I. (I think he was kidding...)


Is there even enough text for that? Wild. Although after a while, maybe you can generate some.


Does anyone know what the units are on the "performance improvement over SOTA" chart?


If you're talking about fig. 4, then it's some units scaled so that random performance is 0 and perfect performance is 100 (depending on task it may be accuracy or something else). Since the models are so large, good benchmarks are diverse, and different tasks require different metrics.


I was wondering the same. Without better y-axis labeling, it's not that informative of a graphic.


Poetic that the top post right now is (partially) about how science communication over-simplifying figures results in a popular misunderstanding of science, leading readers to believe that conducting research is easier than it actually is.


Turns out it's a composite of "normalized task-specific metrics", details in the paper. Shrug. Numbers go up!


Two basic questions about the performance of the model:

1. What's the maximum context size (don't know the right term for this) that the model can consider at once? For example, for GPT-3 I believe the limit is around 2000 tokens (where each token corresponds to a few characters or perhaps a word in a dictionary)?

2. What is the inference latency - how long does it take to get a result from this model? Given that they don't have any demos linked to in the blog post that I can see, seems like it's long (like multiple seconds, maybe even a minute?).


Sequence length is 2048 (page 10). Guesswork based on page 65 about compute says that single inference of 2048 tokens is about 1 petaflops, which may be not that bad if your commutinacation is fast (seconds?)


According to https://cloud.google.com/tpu, each individual TPUv3 has 420 Teraflops, and TPUv4 is supposed to double that performance, so if that guess is correct, it should take a few seconds to do inference. Quite impressive really


This huge language model was trained 'from scratch' - ie. before the first batch of data went into the training process, the model state was simply initialized using random noise.

I believe we are near the end of that. As models get more and more expensive to train, we'll see future huge models being 'seeded' with weights from previous models. Eventually nation-state levels of effort will be used to further train such networks to then distribute results to industry to use.

A whole industry will be built around licensing 'seeds' to build ML models on - you'll have to pay fees to all the 'ancestors' of models you use.


Would that even work? Using the pretrained model, wouldn’t you then be in the “local” optimum?

Although from a transfer learning perspective, chopping off last layers and retraining… You are prob right


Crazy impressive! A question about the training data: anyone familiar with this line of work know what social media platforms the "conversation" data component of the training set came from? There's a datasheet that points to prior work https://arxiv.org/abs/2001.09977, which sounds like it could be reddit, HN, or a similar platform?


Is there an equivalent to Moore’s Law for language models? It feels like every week an even bigger (and supposedly better) model is announced.


Scaling Laws for Neural Language Models - https://arxiv.org/abs/2001.08361



What do 540 billion parameters mean in this case?


540B float32 values in the model. (although since this model was trained via TPUs, likely bfloat16s instead)


Do those floats encode the weights of synapses? One synapse per float?


Machine learning models are much more complicated than perceptrons now.


They encode weights. Not sure what a "synapse" is though.


Network weights would take up around 2TB.

It also means at least 4TB of TPU/GPU memory required for training, or at least 50 Nvidia A100s.


When can we try using it?? :)


Would love an answer on this too. It would be even better not just to try using this, but also be able to run it locally, something that has been impossible for GPT-3.


This is not something that will be possible to run locally.

If you had 1 bit per parameter (not realistic), it would still take ~100 GB of RAM just to load into memory.


You could technically dynamically offload the RAM overload to disk but this would probably be too slow?


I mean, theoretically if you can get the model weights onto disk then you should be able to do the computation - but it might takes days or months on commodity hardware. It would also require creating a system that can do this and I doubt there is much demand.


Does it look like it would be possible to run locally?


i wonder if pruning and other methods that reduce size drastically while not compromising on performance would be possible


I didn’t see mentioned any papers published on PaLM, nor any open access program.

It all sounds impressive but is there a way to sort out substance from the hype in terms of advancing the state of the art?




It's a different paper


From the paper: "In this work, we continue the scaling line of language modeling improvements and train a 540 billion parameter, densely activated, autoregressive Transformer on 780 billion tokens of high-quality text."

The number of parameters is nearly the number of text tokens. So aren't they simply overfitting a lot?


Why scale up a technology that has resulted in no understanding whatsoever - only statistical matching of word groups like a super parrot?

Words are simply the data we humans use to convert to a model of space/time for computation as we read them. The model we build in our minds is conceptually linked with little resemblance to those very few input words.

You could dim the lights of the earth running the largest computer ever made with the largest training set ever assembled and still, you would be no closer to understanding or anything genuinely useful for us humans.


Whilst I agree there are likely better uses for the worlds greatest minds, I can think of 5 commercial applications for this paper alone. It is very useful.


I would not disagree that there are commercial applications for knowing what are the most statistically used set of words (potentially) with games being the #1 use case I think.

But for actual intelligence, that is understanding the world, it is not helpful at all.


"Knowing what are the most statistically used set of words that a human would say next that are consistent with the things they have said before" is equivalent to general intelligence.

The alternative is to believe that intelligence is not required to generate human speech. At which point, why would I believe anyone on here is intelligent? All I have of you is speech.


I think you fully missed my point. Intelligence involves conceptually understanding a space/time model of the world such that you can use the word data as input to then create that model in your mind and compute with it. The words alone are just data and have no ability as data to build a conceptual/causal space/time model of what those words mean. So knowing statistics about words = statistics about words. It has nothing to do with understanding what the words mean = intelligence


By the same metric, humans only know statistics about photons and pressure waves, which obviously has nothing to do with actually existing objects. I don't see why we should call our statistics "intelligence" and the network's statistics "statistics".


What do you mean by "understanding"? Even if it doesn't "think" exactly how humans do, does that really matter?


It doesn't grasp the world at all, there is no intelligence. It's just statistics about word use.


Please explain the "joke explanations" to me, using statistical methods.


Take 100k joke and explanation tuples as training. Pattern match input joke to training jokes. Edit amalgamation of most structuraly similar training jokes to match content of input joke. Edit corresponding training explanation with same substitutions. Output explanation. Repeat with 100 input jokes, pick the best one.


But that is not what they did. Please read the paper.


I'm not saying that's what they did. I'm saying it's functionally equivalent.


All these approaches are incorrect. They are trying to brute force. What we need are better nets.


About 12 pages just for references...goes to show the insane speed of DL research.


The model is insane, but could this realistically be used in production?


Why not? I'm curious if you're picturing any specific roadblocks in mind. OpenAI makes their large models available through an API, removing any issues with model hosting and operations.


Latency, mostly.

The GPT-3 APIs were very slow on release, and even with the current APIs it still takes a couple seconds to get results from the 175B model.


I too am curious what kind of hardware resources are needed to run the model once it is trained


Yes. You don’t need a model in ram memory, NVME disks are fine.


That would have very slow inference latency if you had to read the model off disk for every token.


540B parameters means ~1TB of floating bytes (assuming BFLOAT16). Quadruple that for other associated stuff, and you'd need a machine with 4TB of RAM.


right - and even if you did run happen to have a machine with 4TB of ram - what type of latency would you have on a single machine running this as a service? how many machines would you need for google translate performance?

doesn't seem like you can run this as a service, yet.


The total memory of the model is less important then the memory needed to compute one batch. I’ve worked with recommendation models used in serving that were 10ish terabytes. The simple trick was most of the memory was embeddings and only small subset of embeddings were needed to do inference for one batch. If you fetch those embeddings as if they were features you can run very large models on normalish compute. You never need to load the entire model to ram at once.

Another trick you can use is load only some layers of the model into ram at a time (with prefetching to minimize stalls).

Or if you are google enjoy that tpus have a silly amount of ram. Tpu pods have a ton of ram.


Two questions :

1/ How much CO2 was emitted to train this ? How much to query it ?

2/ Does Google use that in its search engine ? I'd be very happy to have that kind of AI helping me to find the information I need !


1/ "Using 378.5W measured system power per TPU v4 chip, this leads to a total effective emissions of 271.43 tCO2e. To put this in perspective, total emissions of a direct round trip of a single passenger jet between San Francisco and New York (JFK) is estimated to be 180 tCO2e (Patterson et al., 2021), and total emissions for GPT-3 are estimated to be 552 tCO2e (Patterson et al., 2021). All of the energy use and emissions for PaLM training and the experiments described in this paper are compensated with renewable energy sources (Sustainability, 2022)."


not a single super large language model has beaten the state of the art in the key NLP tasks (POS tag, dep tag, coreference, wsd, ner, etc) They are always only used for higher level tasks, which is tragic.


Why is that tragic? Classic NLP tasks are IMHO kinda pointless. Nobody _actually_ cares about parse trees, etc. These things were useful when that was the best we could do with ML, because they allowed us to accomplish genuinely-useful NLP tasks by writing code that uses things like parse trees, NER, etc. But why bother with parse trees and junk like that if you can just get the model to answer the question you actually care about?


I would not discount NLP tasks just yet. In practice, they are still used to solve problems like spellcheck, autocomplete and even words and text rewrites.


I'd argue that every one of those tasks is better solved with a neural LM. The fact that people still use traditional techniques does not mean those old techniques are better. It just means those tasks haven't caught up to the modern era yet.

Or maybe it's because MLM are too expensive to run. But that's hardly an argument that MLM should solve traditional NLP tasks.


if you can just get the model to answer the question you actually care about? LOL your level of optimism is over 9000. Language models are shit, they regularly utter nonsense and have an impressive number of limitations (e.g. no continual learning). A neuro-symbolic system on the other hand can be incrementally improved upon, has continual learning/a memory, and is interpretable.


You selling buggy whips? Good luck with that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: