I have no discomfort with the notion that our bodies, which grow in response to direct causal contact with our environment, contain in-their-structure the generative capbaility for knoweldge, imagination, skill, growth -- and so on.
I have no discomfort with the basically schiozphrenic notion that the shapes of words have something to do with the nature of the world. I just think its a kind of insantity which absolutely destroys our ability to reason carefully about the use of these systems.
That "tr" occurs before "ee" says as much about "trees" as "leaves are green" says -- it is only that *we* have the relevant semantics that the latter is meaningful when interpreted in the light of our "environmental history" recorded in our bodies, and given weight and utility by our imaginations.
The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.
No one here is a scientist and no one treats any of this as science. Where's the criteria for the emprical adequecy of NLP systems as models of language? Specifying any, conducting actual hypothesis tests, and establishing a theory of how NLP systems model language -- this would immediately reveal the smoke-and-mirros.
The work to reveal the statistical tricks underneath them takes years, and no one has much motivation to do it. The money lies in this sales pitch, and this is no science. This is no scientific method.
Agree to disagree. I think you are opining about things that you are lacking fundamental knowledge on.
> The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.
It's unclear what you even mean by that. Are the electrical impulses coming to our brain the "structure of the world"?
The structure of having X apples in Y buckets is the same as the structure in the expression "X * Y", as long as the expression exists in a context that can parse it using the rules of arithmetic, such as a human, or a calculator.
These language models lack context, not just for arithmetic, but for everything. They can't parse "X * Y" for any X and Y, they've just associated the expression with the right answer for so many values of X and Y, that we get fooled into thinking they know the rules.
We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.
It would be trivial for a network of this size to code general rules for multiplication.
At a certain point, when you have enough data, finding the actual rule is actually the easier solution than memorizing each data point. This is the key insight of deep learning.
Really? Better inform all the researchers working on this that they're wasting their time then: https://arxiv.org/abs/2001.05016
More fundamentally, any finite neural net is either constant or linear outside the training sample,depending on the activation function. Unless you design special neurons like in the paper above, which solves this specific problem for arithmetic, but not the general problem of extrapolation.
> any finite neural net is either constant or linear outside the training sample
Hence why the structure of our bodies has to include the capacity for imagination. Our brain structure does not record everything that has happened. It permits is to imagine an infinite number of things which might happen.
We do not come to understand the world by having a brain-structure isomorphic to world structure -- this is none-sense for, at least, the above reason. But also, there really isnt anything like "world structure" to be isomorphic to. Ie., brains arent HDDs.
They are, at least, simulators. I dont think we'll find anything in the brain like "leaves are green" because that is just a generated public representation of a latent-simulating-thought. There isnt much to be learned about the world from these, they only make sense to us.
That all the text of human history has associations between words is the statistical coincidence that modern NLP uses for its smoke-and-mirrors. As a theory of language it's madness.
Well sure, but neurons are still universal approximators. Any CPU is a sum of piecewise linear functions. I don't see where this meaningfully limits the capabilities of an AI, since once we're multilayer there's no 1:1 relation between training samples and piece placement in the output.
I just don't see how that's relevant. Nobody uses one-hidden-layer networks anymore. Whatever GPT is doing, it has nothing to do with approximating a collection of samples by assembling piecewise functions, except in the way that Microsoft Word is based on the Transistor.
Should math about a vaguely related topic convince me about this? Multilevel ANNs act differently than one-level ANNs. Transformers simply don't have anything to do with the model of approximating functions by assembling piecewise functions. This is akin to arguing that computers can't copy files because the disjunctive normal form sometimes needs exponential terms on bit inputs, so obviously it cannot scale to large data sets - yes, that is true about the DNF, but copying files on a computer simply does not use boolean operations in a way that would run into that limitation.
The way that Transformers learn has more to do with their multilayering than with the transformation across any one layer. Universal approximation only describes the things the network learns across any pair of layers, but the input and output features that it learns about in the middle are only tangentially related to the training samples. You cannot predict the capabilities of a deep neural network by considering the limitations of a one-layer learner.
>We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.
To what degree does the structure of text correspond to structure of the world, in the limit of a maximally descriptive text corpus? Nearly complete if not totally complete, as far as I can tell. What is left out? The subjective experience of being embodied in the world. But this subjective experience is orthogonal to the structure of the world. And so this limitation does not prevent an understanding of the structure.
The point is that not only is it impossible to infer the structure of the world from text, deep learning is incapable of learning about or even representing the world.
The reason language makes sense to us is that it triggers the right representations. It does not make sense intrinsically, it's just a sequence of symbols.
Learning about the world requires at least causal inference, modular and compact representations such as programming languages, and much smarter learning algorithms than random search or gradient descent.
I don't know why you think this. There is much structural regularity in a large text corpus that is descriptive of relationships in the world. Eventually the best way to predict this regularity is just to land in a portion of parameter space that encodes the structure. But again, in the limit of a maximally descriptive text corpus, the best way to model this structure is just to encode the structure of the world. You have given no reason to think this is inherently impossible.
>There is much structural regularity in a large text corpus that is descriptive of relationships in the world.
Sure, there is a lot. But let's say we want to learn what apples are. So we look at occurrences of "apple" in the text corpus, and learn that apples can be eaten, they can be sweet, sometimes they are sour, red, sometimes green, and so on.
Can apples spontaneously change size? Hmm, no idea, no mention of that in the text. Can they be used as a knife? Dunno. If I had an apple in New York 4 minutes ago, can someone else be eating the same apple in Hong Kong now? Dunno. Did apples exist in France two million years ago? Dunno. Can you drive to Saturn in an apple? Dunno.
In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships. If there were, the model could figure out the answers to the questions above by inheritance.
Maybe these particular questions happen to be answered correctly by PaLM. Maybe not, but the next LLM will include this comment in the training corpus.
But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality. The text in the training corpus tends to conform to reality, but when you prompt the model with questions that nobody would ever write text about, the illusion falls away:
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does the sun have?
A: The sun has one eye.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.
Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.
Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.
Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.
Q: Do you understand these questions?
A: I understand these questions.
>In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships.
I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention. Much knowledge isn't explicitly stated, but is implicit and can be deduced from a collection of explicit facts. For example, apples are food, food is physical matter, physical matter is fixed in size, cannot be in two places at once, maintains its current momentum unless acted on by a force, etc. Categorization and deducing properties from an object's category is in parameter space of language models. There's no reason to think that a sufficiently large model will not land on these parameters.
>But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality.
The issue isn't what GPT-3 can or cannot do, its about what autoregressive language models as a class are capable of. Yes, there are massive holes in GPT-3's ability to maintain coherency across wide ranges of contexts. But GPT-3's limits does not imply a limit to autoregressive language models more generally.
The demonstration is irrelevant. The issue isn't what GPT-3 can or cannot do, but what this class of models can do.
Reduce knowledge to particular kinds of information. Gradient descent discovers information by finding parameters that correspond to the test criteria. Given a large enough data set that is sufficiently descriptive of the world, the "shape" of the world described by the data admits better and worse structures to predict the data. The organizing and association of information that we call knowledge is a part of the parameter space of LLMs. There is no reason to think such a learning process cannot find this parameter space.
It doesn't. It's pattern matching, and you're seeing cherry picked examples. The pattern matching is enough to give the illusion of understanding. There's plenty of articles where more thorough testing reveals the difference. Here are two:
https://medium.com/@melaniemitchell.me/can-gpt-3-make-analog...
But you could also just try one of these models, and see for yourself. It's not exactly subtle.
GPT-3 was specifically worse at jokes, which is why PaLM being good at this so impresses me. At any rate, I don't care if it only works one in ten times. To me, this is equivalent to complaining that the dog has bad marks in high school. (PaLM could probably explain that one to you: "The speaker is complaining that the dog is only getting C's. For a human a C is a quite bad mark. However getting even a C is normally impossible for a dog.")
"It's pattern matching" just sounds like an excuse for why it working "doesn't really count". At this point, you are asking me to disbelieve plain evidence. I have played with these models, people I know have played with these models, I have some impression of what they're capable of. I'm not disagreeing it's "just pattern matching", whatever that means, I am asserting that "pattern matching" is Turing-complete, or rather, cognition-complete, so this is just not a relevant argument to me.
If you threw a thousand tries at a Markov chain, to use the classic "pure pattern matcher", it could not do any fraction of what this model does, ever, at all. You would have to throw enough tries at it that it tried every number that could possibly come next, to get a hit. So one in ten is actually really good. (If that's the rate, we have zero idea how cherrypicked their results actually are.)
And the errors that GPT does tend to be off-by-one errors, human errors, misunderstandings, confusions. It loses the plot. But a Markov chain never even has the plot for an instant.
GPT pattern-matches at an abstract, conceptual level. If you don't understand why that is a huge deal, I can't help you.
It's a pretty big deal, and there's a big difference between a Markov chain and a deep language model - the Markov chain will quickly converge, while the language model can scale with the data.
But the way these models are talked about is misleading. They don't "answer questions", "translate", "explain jokes", or anything of that sort. They predict missing words. Since the network is so large, and the dataset has so many examples, it can scale up the method of
1) Find a part of the network which encodes training data that is most similar to the prompt
2) Put the words from the prompt in place of the corresponding words in the encoding of the training data
i.e. pattern matching. So if it has seen a similar question to the one given in the prompt (and given that it's trained on most of the internet, it will find thousands of uncannily similar questions), it will produce a convincing answer.
How is that different from a human answering questions? A human uses pattern matching as part of the process, sure. But they also use, well, all the other abilities that together make up intelligence. They connect that meaningless symbols of the sentence to the mental representations that model the world - the ones pertaining to whatever the question is about.
If I ask a librarian "What is the path integral formulation of quantum mechanics?", and they come back with a textbook and proceed to read the answer from page 345, my reaction is not "Wow, you must be a genius physicist!", it's "Wow, you sure know where to find the right book for any question!". In the same way, I'm impressed with GPT for being a nifty search engine, but then again, Google search does a pretty good job of that already.
Understanding of what? What the joke is about? Then no, it has no idea what any of it means. The syntactic structure of jokes? Sure. Feed it 10 thousand jokes that are based on a word found in two otherwise disjoint clusters (pod of whales, pod of TPUs), with a subsequent explanation. It's fair to say it understands that joke format.
If you somehow manage to invent a kind of joke never before seen in the vast training corpus, that alone would be impressive. If PaLM can then explain that joke, I will change my mind about language models, and then probably join the "NNs are magic you guys" crowd, because it wouldn't make any sense.
Good point, coming up with a novel joke is no joke. There's a genuine problem where GPT is to a first approximation going to have seen everything we'll think of to test it, in some form or other.
Of course, if we can't come up with something sufficiently novel to challenge it with, that also says something about the expected difficulty of its deployment. :-P
I guess once we find a more sample-efficient way to train transformers, it'll become easier to create a dataset where some entire genre of joke will be excluded.
> No one here is a scientist and no one treats any of this as science. Where's the criteria for the emprical adequecy of NLP systems as models of language? Specifying any, conducting actual hypothesis tests, and establishing a theory of how NLP systems model language -- this would immediately reveal the smoke-and-mirros.
What do you mean?
I'm not a scientist but I play one sometimes, and I managed a whole team of them working in this field.
The theory of language models is well established.
> Where's the criteria for the emprical adequecy of NLP systems as models of language?
There are lots(!?) I think the Winograd schema challenge[1] is an easy one to understand, and meets a lot of your objections because it is grounded in physical reality.
Statement:
The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.
Question:
Does "they" refer to the councilmen or the demonstrators?
The human baseline for this challenge is 92%[1]. PaLM (this Google language model) scored 90% (4% higher than the previous best)[3].
Indeed, all these test are not of empirical adequacy which really evidences the point. The whole field is in this insular pseudoscientific mould of "its true if it passes an automated test to x%".
A theory with empirical adequecy would require you to do some actual research into language use in humans; all of its features; how it works; various theories of its mechanisms etc. And after a comprehensive, experimental and detailed theoretical work -- show that NLP models even *any* of it.
Ie., that any NLP model is a model of language.
All you do above is design your own win condition, and say you've won. This precludes actually knowing anything about how language works, and is profoundly pseudoscientific. If you set-up tests for toys, and they pass -- good, you've made a nice toy.
You may only claim is models some target after actually doing some science.
A theory with empirical adequecy would require you to do some actual research into language use in humans; all of its features; how it works; various theories of its mechanisms etc. And after a comprehensive, experimental and detailed theoretical work -- show that NLP models even any* of it.*
What - specifically - do you mean?
There's an entire field adjacent to NLP called Computational Linguistics. Most people in the field work across them both, and there is significant cross pollination.
It's unclear if think there is some process in the brain that you think NLP models should be similar to. If this is the case you should look at studies similar to [1] where they do MRI imaging and can see similar responses in semantically similar words. This is very similar to how word vectors put similar concept closely together (and of course how more complex models put concept close together).
Or perhaps you think that NLP models do not understand syntactic concepts like nouns, verbs etc. This is incorrect too[2].
Language is a phenomenon in, at least, one type of animal. It allows animals to coordinate with each other in a shared environment; it describes their internal and external states; etc. etc.
Language is a real phenomenon in the world that, like gravity, can be studied. It isnt abstract.
NLP models of language arent models of language. Theyre cheap imitations which succeed only to fool language users in local highly specific situations.
> NLP models of language arent models of language.
Do you actually know what a NLP Language Model refers to? It literally is a model of the language - it predicts the likelihood of the next word(s) given a set of prior word(s).
It seems you think people just throw some data at a neural network and then go wow. It's not like that at all - the field of NLP grew out of linguistics study and has deep roots in that field.
That's not a model of language. Language is a communicative activity between language users, who do things with words, with each other.
What you're talking about is ignoring the entire empirical context of langauge, as a real-world phenomenon, and modelling is purely formal characteristics as recorded post-facto.
This will always just produce a system which cannot use langauge, but will only ever appear to within highly constrained -- essentially illusory -- contexts. Its the difference between a system which makes a film by "predicting the next frame", and a making a film by recording actual events that you are directing.
A prediction of a "next frame" is always therefore just going to be a symptom of the frames before it. When I point a camera at something new, eg., an automobile in c. 1900 -- i will record a film that has never been recorded before.
And likewise, with words: we are always in genuinely unquie unprecedented situations. And what we *do with words*, is speak about those situations *to others* who are in them with us... we aim to coordinate, move, and so on with words.
To model *language* isnt to model words, nor text, nor to predict words or text. It is to be a speaker here in the world with us, using language to do *what language does*.
No model of the regularities of text will ever produce a language-user. Language isnt a regularity, like the frames of a film -- its a suit of capacities which are responsive to the world, and enable language users to navigate it.
Until you can make quantifiable predictions of behaviour that you want to see it sounds like your objections are philosophical rather than scientific.
> A prediction of a "next frame" is always therefore just going to be a symptom of the frames before it.
But the physical appearance of the automobile itself was absolutely influenced by what went before - they were called "horseless carriages" after the appearance after all.
And NLP Language Models can produce genuinely original and unique writing. This is a poem a large LM wrote for me:
The sceptered isle
Hath felt the breath of Britain,
Longer than she cares to remember.
Now are her champion arms outstared,
Her virgin bosom stained with battle's gore.
Lords and nobles, courtiers and commons,
All stand abashed; the multitudinous rout
Scatter their fears in every direction;
Except their courage, which, to be perfect,
Must be all directed to the imminent danger
Which but now struck like a comet; and they feel
The blow is imminent
> we aim to coordinate, move, and so on with words.
Chopping up sequences of film, and stiching them together, based on their prior similarity isn't making a movie -- and that's all you have here. People wrote poetry -- *for the first time* -- to say something about their own environment, that they are present in. All you have here is a system which has remembered a compressed representation of these poems and stiches them together to fool you.
It really is a kind of proto-psychosis to think this machine has written a poem. It has generated the text of a poem.
> quantifiable predictions of behaviour that you want to see
This is trivial. I ask the machine a large number of ordinary questions, eg., "what do you think about what i'm wearing?", "what would it take to change your mind on whether murder is justified?", "do you think you'd like new york?", "could you pass me the salt?", etc. -- a trivial infinity of questions lifted from the daily life of language users.
The machine cannot answer any of those questions. All it will do is generate some text on the occasion that the machine sees that text. This isn't an answer. That isnt the question. The question isnt "summarise a million documents and report an on-average plausible answer to these questions".
When I ask a person any of those questions, if they did that, they wouldnt be answering them. This is trivial to observe.
These systems are just taking modes() of subsets of historical data. That's just what they are. The appearence of their using language is an illusion
To use language is to have something to say, to wish to talk about something. When i say, "I liked the movie!" I am not summarising a million reviews and finding an average sentence. I am thinking about my experience of the movie, and generating a public sharable "text" that aims to communicate what i actually think.
*THAT* is language. Language is your intention to speak *ABOUT* something, and the capacity to generate a public shared set of words which communicate what you are talking about. Any process which begins *without anything to say* cannot ever reach langauge as a capacity.
Langauge, as a capacity, begins by being in the world. No summary of the public statmenets of past speakers has anything to do with being in the world; and having things to say. Chopping that up and stiching it together is a trick.
And this is trivial to show empirically. It is only by having absolutely no study of langauge use can anyone claim that text documents have anything ot do with it. IT's mumbohjumbo.
It's not unmeasurable. If you ask a friend, "did you like that movie?" would you be happy if they hadnt seen it; didnt know anything about it; etc. etc. and simply generated a response based on some review data they'd read?
Is that what you want from people? You want them just to report a summary of the textbooks, of the reviews of other people? You dont want them to think for a moment, about anything and have something to say?
This is a radically bleak picture; and omits, of course, everything important.
We arent reporting the reports of others. We are thinking about things. That isnt unmeasurable, it is trivial to measure.
Show someone the film, ask them questions about it, and so on -- establish their taste.
NLPs arent simulations of anything. It's a parlour trick. If you want a perfect simulation of intelligence, go and show me one -- I will ask it what it likes; and I doubt it'll have anything sincere to say.
There is no sincerity possible here. These systems are just libraries run through shredders; they havent been anywhere; they arent anywhere. They have nothing to say. They arent talking about anything.
You and I are not the libraries of the world cut up. We are actually responsive to the environments we are in. If someone falls over, we speak to help them. We dont, as if lobtomized, rehearse something. When we use words we use them to speak about the world we are in; this isnt unmeasuarable -- its the whole point.
Why do you think a model of intelligence needs to have tastes, values, likes/dislikes, etc for it to be something more than statistics or pattern matching? Why are you associating these consciousness qualities with AGI?
To use a language is just to talk about things. You cannot answer the question, "do you like what i'm wearing?" if you dont have the capacity for taste.
Likewise, this applies to all language. To say, "do you know what 2+2 is?" *we* might be happy with "4" in the sense that a calculator answers this question. But we havent actually used language here. To use language is to understand what "2" means.
In otherwords, the capacity for langauge is only just the capacity to make a public communicable description of the non-linguistic capacities that we have. A statistical analysis of what we have already said, does not have this contact with the world, or the relevant capacities. It's just a record of their past use.
None of these systems are langauge users; none have language. They have the symbols of words set in an order, but they arent talkiung abotu anything, because they have nothing to talk about.
This is, i think really really obvious when you ask "did you like that film?" but it applies to every question. We are just easily satisifed when alexa turns the lights off when we say "alexa, lights off". This mechanical satisifcation leads some to the frankly schiozphrenic conclusion that alexa understands what turning the lights off means.
She doesnt. She will never say back, "but you know, it'll be very dark if you do that!" or "would you like the tv on instead?" etc. Alexa isnt having a conversation with you based on a shared understanding of your environment, ie., using langauge.
Alexa, like all NLP systems, are illusions. You arent speaking to anything. You arent asking anything a question. Nothing is answering you. You are the only thing in the room that understands what's going on, and the output of the system is meaningful only because you read it.
The system itself has no meaning to what its doing. The lights go off, but not because the system understood that your desire. It could not, if it failed to undestand, ask about your desire.
You're just reiterating that you think tastes, opinions, likes/dislikes are something intrinsic to the issues here. I'm asking why do you think these things are intrinsic to language understanding or intelligence?
>To use language is to understand what "2" means.
I've never held a "2", yet I know what 2 is as much as anyone. It is a position in a larger arithmetical structure, and it has a correspondence to collections of a certain size. I have no reason to think a sufficiently advanced model trained on language cannot have the same grasp of the number 2 as this.
>A statistical analysis of what we have already said, does not have this contact with the world, or the relevant capacities. It's just a record of their past use.
Let's be clear, there is nothing inherently statistical about language models. Our analysis of how they learn and how they construct their responses is statistical. The models themselves are entirely deterministic. Thus for a language model to respond in contextually appropriate ways means that it's internal structure is organized around analyzing context and selecting the appropriate response. That is, it's "capacities" are organized around analyzing context and selecting appropriate responses. This to me is the stuff of "understanding". The fact that the language model has never felt a cold breeze when it suggests that I close the window if the breeze is making me cold is irrelevant.
>You arent speaking to anything. You arent asking anything a question. Nothing is answering you.
It seems that your hidden assumption is that understanding/intelligence requires sentience. And since language models aren't sentient, they are not intelligent. But why do the issues here reduce to the issue of sentience?
Language is a empirical phenomenon. It's something happening between some animals, namely at least, us. It is how we coordinate in a shared environment. It does things.
Language isnt symbols on a page, if it were, a shredder could speak. Is there something we are doing the shredder is not?
Yes, we are talking about things. We have something to say. We are coordinating with respect to a shared environment, using our capacities to do so.
NLP models are fancy ways of shredding libraries of text, and taking the fragments which fall out and calling them "language". This isnt language. It isnt about anything; the shredder had no intention to say anything.
Mere words are just shadows of the thoughts of their speakers. The words themselves are just meaningless shapes. To use langauge isnt to set these shapes in order, its to understand something; to want to say something about it; and to formulate some way of saying it.
If I asked a 5yo child "what is an electron?" and they read from some script a definition, we would not conclude the CHILD had answered the question. They have provided an answer, on the occasion it was asked, but someone else answered the question -- someone who actually understood it.
An NLP model, in modelling only the surface shapes of language *and not its use* is little more than a tape recorder playing back past conversations, stitched together in an illusory way.
We cannot ask it any questions, because it has no capacity to understand what we are talking about. The only questions it can "answer", are like the child, those which occur in the script.
> No model of the regularities of text will ever produce a language-user.
No but it will produce language-users, incidentally. Language-users are an irreducible aspect of the underlying regularity in language. Now I'm not saying that "GPT will wake up" purely from language tasks, that GPT will become a language user by being a system that picks up regularities. But for GPT to contain systems like language users, to instantiate language-users, which it has to (on some level) in order to successfully predict the next frame, is already enough to be threatening.
I know that using examples from fiction is annoying, but - purely as a rhetorical aid - consider the Enterprise computer (in Elementary, Dear Data) as GPT, and the Moriarty hologram as an embedded agent. The Enterprise computer is not conscious, but as a very powerful pattern predictor it can instantiate conscious agents "by accident", purely by completing a pattern it has learnt. It doesn't want to threaten the Enterprise, it doesn't want to not threaten the Enterprise, because it doesn't have any intentional stance. Instead, it was asked "A character that can challenge Data is ¬" and completed the sentence, as is its function.
How does the computer answer "Do you like what i'm wearing today?" ?
Well, if we say the computer is, in fact, not participating in the world with us -- it is merely predicting "the next word", then it cannot.
I am not asking for any answer to this question. I want to know what it (like a friend) actually thinks about what i'm wearing.
To do this, it would need to be a competent language user; not a word annoucer. It would, in otherwords, need to know what the language was about -- and need to be able to make a judgement of taste based on its prior experiences, etc.
I dont think our ability to misattribute a capacity of languge to things (eg., to bugs bunny) is salient -- we are fools, easily fooled. Bugs bunny doesnt exist.
In this case, the star trek computer, insofar as it actually answers the questions its asked -- is routinely depicted as being actually present in the world with us. That the show might claim "no it isnt!", or we otherwise hold this premise whilst observing that it is, is just foolishness. Bugs bunny likewise, is depicted with the premise that bugs is within his own world; this likewise, is irrelevant.
Well, GPT is not the sort of thing that can have a "you." But it has seen dialogues that have a "you" in it, and it knows how a "you" tends to answer. For instance, depending on context, it may be operating under a different model for the "you" agent - the sort of person who likes a red dress, or the sort of person who likes suspenders. If we assume a multimodal GPT, it's going to draw on its pattern recognition from movies and its context window for what it's previously said as "you" or what you've prompted it as in order to guess what the agent it's pattern completing for "you" would think of your dress.
In effect, I'm saying that just because GPT is not a word-user, that doesn't mean that its model of "you" - the layered system of patterns that generates its prediction for words that come after "I think your dress looks" - isn't a word-user. The "you" model, effectively, takes in sensory input, processes it, and produces output. Because the language model has learnt to complete sentences using agents as predictive patterns - because agents compress language - the you pattern acts agentic, despite the fact that the language model itself is not "committed" to this agent and will, if you reset its context window, readily switch to pattern predicting another agent.
GPT is not an agent, but GPT can predict an agent, and this is equivalent to containing it.
I dont think it is equivalent. If you assume it has the same modal properties, sure -- let's say that's plausible.
Ie., if GPT said on the occasion it was asked Q, an answer A, in a possible world W, such that this answer A was the "relevant and reasonable" answer in W -- then GPT is "doing something interesting".
Eg., if I am wearing red shoes (World W1) and it says "i like your red shoes" in W1, then that's for-sure really interesting.
My issue is that it isnt doing this; GPT is completely insensitive to what world its in and just generates an average A in reply to a world-insensitive Q.
If you take a langauge-user, eg. me, and enumerate my behaviour in all possible worlds you will get somehting like what GPT is aiming to capture. Ie., what i would say, if asked Q, in world-1, wolrd-2, world-infinity.
My capacity to answer the question in "relevant and reasonable" ways across a gegnuine infinity of possible worlds comes from actual capacities i have to obvserve, imagination, explore, question, intereact, etc. It doesnt come from being an implementation of the (Q, A, W) pattern -- which is an infintity on top of an infinity.
No model which seeks to directly implement (Q, A, W) can ever have the same properties of an actual agent. That model would be physically impossible to store. So GPT does not "contain" an agent in the sense that QAW patterns actually occur as they should.
And no route through modelling those patterns will ever produce the "agency pattern". You actually need to start with the capacities of agents themselves to generate these in the relevant situations, which is not a matter of a compressed representation of QAW possibilities -- its the very ability to imagine them peicemeal (investigate, explore, etc .)
I mean, how would you discover that you're in world W? If you ask "what do you think about my red shoes?" and I say "I think your red shoes are pretty", then you will say this is just me completing the pattern. But if I have no idea what shoes you're wearing, then even I, surely agreed to be an agent, could not compliment your clothing. So I'm not sure how this distinction works.
> It doesnt come from being an implementation of the (Q, A, W) pattern
Well, isn't this just a (Q, A, W, H) pattern though? You have a hidden state that you draw upon in order to map Qs onto As, in addition to the worldstate that exists outside you. But inasmuch as this hidden state shows itself in your answers, then GPT has to model it in order to efficiently compress your pattern of behavior. And inasmuch as it doesn't ever show itself in your answers, or only very rarely, it's hard to see how it can be vital to implementing agency.
And, of course, teaching GPT this multi-step approach to problem solving is just prompting it to use a "hidden" state, by creating a situation in which the normally hidden state is directly visualized. So the next step would be to allow GPT to actually generate a separate window of reasoning steps that are not directly compared against the context window being learnt, so it can think even when not prompted to. I'm not sure how to train that though.
Sure, GPT has to model H -- that's a way of putting it. However think of how the algorithm producing GPT works (and thereby how GPT models QAWH) -- it produces a set of weights which interpolate between the training data --- even if we give it QAWH as training data, implementing the same QAWH patterns would require more storage capacity than is physically possible.
I think there's a genuine ontological (practical, empirical, also) difference between how a system scales with these "inputs". In otherwords if a machine is a `A = m(Q | World, Hidden)`, and a person is a `A = p(Q | World, Hidden)` then their complexity properties *matter*.
We know that the algorithm which produces `m` does so with exponential complexity; and we know that the algorithm producing `p` doesnt. In otherwords, for a person to answer `A` in the relevant ways, does not require exponential space/time. We know that NNs are already exponential scaling in their parameters in their even fairly radically stupid solutions (ie., ones which are grossly insensitive even to W).
So whilst `m` and `p` are equivalent if all we want is an accurate mapping of `Q`-space to `A`-space, they arent equivalent in their complexity properties. This inequivalence makes `m` physically impossible, but i also think, just not intelligent.
As in, it was intelligent to write the textbook; after its written, the HDD space which stores it isnt "intelligent". Intelligence is that capacity which enables low-complexity systems to do "high-complexity" stuff. In other words, that we can map-out QAWH with physically-possible, indeed, ordinary capacities -- our-doing-that is intelligence.
I think this is a radically empirical question, rather than a merely philosophical one. No algorithm which relies on interpolation of training data will have the right properties; it just wont, as a matter of fact, answer questions correclty.
You cannot encode the whole QAWH-space in parameters. Interpolation, as a strategy, is exponential-scaling; and cannot therefore cover even a tiny fraction of the space.
Ie., if I ask "what did you think of will smith hitting christopher walken?" it is unlikely to reply, "I think you mean Chris Rock" firstly; and then if will does hit walken, to reply, "I think Walken deserved it!".
Interpolation, as a strategy, cannot deal with the infinities that counter-factuals require. We are genuinely able to perform well in an infinite number of worlds. We do that by not modelling QA pairs, at all; nor even the W-infinity.
Rather, we implement "taste, imagination, curiosity" etc. and are able to simulate (and much else) everything we need. We arent an interpolation through relevant hisotry, we are a machine direclty responsible to the local environment in ways that show a genuine deep understanding of the world and abiliyt to similate it.
This ability enables `p` to have a lower complexity than `m`, and thereby be actually intelligent.
As an empirical matter, i think you just can't build a system which actually succeeds in answering the-right-way. It isnt intelligent; but likewise, it also just doesnt work.
The notion that GPT "interpolates between the training data" is a widespread misconception. There is no evidence that that's what's going on. GPT seems to be capable of generalizing, in ways that let it mix features of training samples at least, and even generalize to situations that it has never seen.
It seems to me your entire argument derives from this. If GPT is not exponential, then the m/p distinction falls apart. And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.
Neural networks learn features, not samples. Layered networks learn features of features (of features of features...). Intelligence works because for many practical tasks, the feature recursion depth of reality is limited. For instance, we can count sheep by throwing pebbles in a bucket for every sheep that enters the pasture, because the concept of items generalizes both sheep and pebbles, and the algorithm ensures that sheep and pebbles move as one. So to come up with this idea, you only need to have enough layers to recognize sheep as items, pebbles as items, those two conceptual assignments as similar, and to notice that when two things are described by similar conceptual assignments in the counting domain, you can use a manual process that represents a count in one domain to validate the other domain. Now I don't think this is actually what our brain is literally doing when we work out this algorithm, it probably involves more visual imagination and looking at systems coevolve in our worldmodel to convince us that the algorithm works. But I also don't think that working this out on purely conceptual grounds needs all that many levels of abstraction/Transformer layers of feature meta-recognition. And once you have that, you get it.
> If GPT is not exponential, then the m/p distinction falls apart.
Yes, I think if you have a system which implements QAWH with a similar compelxity to a known intelligent system -- at that point I have no empirical issues. I think, at that point, you have a workiung system.
We then ask if it is thinking about anything, and I think that'd be an open question as to how its implemented. I dont think the pattern alone would mean the system had intentionality -- but my issue at this stage is the narrower empirical one. Without something like a "tractable complexity class", your system is broken.
> And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.
This is an illusion. Knowledge here is deterministic, to the same question, the same answer. GPT generates answers across runs which are self-contradictory, etc. "the same question" (even literally, or if you'd like, with some rephrasing) is given quite radically different answers.
I think all we have here is evidence of the (already known) tremendous compressibility of text data. We can, in c. 500bn numbers, compress most of the histoy of anything ever said. With such a databank, a machine can appear to do quite a lot.
This isnt world knowledge... it is a symptom of how we, language users, position related words near each other for the sake of easy comprehension. By doing this one can compress our text into brute statstical associations which appear to be meaningful.
As much as Github's AI is basically just copy/pasting code from github repos, GPT is just copy/pasting sentences from books.
All the code in github, compressed into billions of numbers, and decompressed a little -- that's a "statical space of tricks and coincidences" so large we cannot by intution alone fathom it. It's what makes these systems useful, but also easy illusions.
We can, by a scientific investigation of these systems as objects of study, come up with trivial hypothesis that expose their fundamentally dumb coincidental character. There are quite a few papers now which do this, I dont have one to hand.
But you know, investigate a model of this kind yourself: permute the input questions, investigate the answers.. and invalidate your hypothesis (like a scientist might do)... can you invalidate your hypothesis?
I think with only a little thoguh you will find it fairly trivial to do so.
If the paper is substantially correct I concede the point. But what I've read of reactions leads me to believe the conclusion is overstated.
Regarding compression vs intelligence, I already believe that intelligence, even human intelligence, is largely a matter of compressing data.
Regarding "knowledge is deterministic", ignoring the fact that it's not even deterministic in humans, so long as GPT can instantiate agents I consider the question of whether it "is" an agent academic. If GPT can operate over W_m and H_n, and I live in W_1 and have H_5, I just need to prompt it with evidence for the world and hidden state. Consider for example, how GAN image generators have a notion of image quality but no inherent desire to "draw good images", so to get quality out you have to give them circumstantial evidence that the artist they are emulating is good, ie. "- Unreal Engine ArtStation Wallpaper HQ 4K."
Also, of course, it's hard to see how DALL-E can create "a chair in the shape of an avocado" by interpolating between training samples, none of which were a chair in the shape of an avocado nor anywhere close. The orthodox view of interpolating between a deep hierarchy of extracted features and meta-features readily explains this feat.
I have no discomfort with the basically schiozphrenic notion that the shapes of words have something to do with the nature of the world. I just think its a kind of insantity which absolutely destroys our ability to reason carefully about the use of these systems.
That "tr" occurs before "ee" says as much about "trees" as "leaves are green" says -- it is only that *we* have the relevant semantics that the latter is meaningful when interpreted in the light of our "environmental history" recorded in our bodies, and given weight and utility by our imaginations.
The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.
No one here is a scientist and no one treats any of this as science. Where's the criteria for the emprical adequecy of NLP systems as models of language? Specifying any, conducting actual hypothesis tests, and establishing a theory of how NLP systems model language -- this would immediately reveal the smoke-and-mirros.
The work to reveal the statistical tricks underneath them takes years, and no one has much motivation to do it. The money lies in this sales pitch, and this is no science. This is no scientific method.