The Inherent Limitations of GPT-3

thamer · on Nov 29, 2021

Something I've noticed that both GPT-2 and GPT-3 tend to do is get stuck in a loop, repeating the same thing over and over again. As if the system was relying on recent text/concepts to go to the next utterance, only getting into a state where the next sentence or block of code being produced is one that has already been generated. It's not exactly uncommon.

What causes this? I'm curious to know what triggers this behavior.

Here's an example of GPT-2 posting on Reddit, getting stuck on "below minimum wage" or equivalent: https://reddit.com/r/SubSimulatorGPT2/comments/engt9v/my_for...

(edit) another example from the GPT-2 subreddit: https://reddit.com/r/SubSimulatorGPT2/comments/en1sy0/im_goi...

With GPT-3, I saw GitHub Copilot generate the same line or block of code over and over a couple of times.

andreyk · on Nov 30, 2021

This is a common problem with language models in general, and the reason is not that well understood.

This paper A Theoretical Analysis of the Repetition Problem in Text Generation (https://arxiv.org/abs/2012.14660) seems to offer a principled answer. Basically the probability maximizing search procedure for text generation can get stuck in loops where the most likely next statement is similar or same to before. I'm no NLP researcher so I don't have easy intuition on it, but that paper seems like a good read.

thamer · on Nov 30, 2021

Thanks for this paper! This is exactly what I was looking for.

not2b · on Nov 29, 2021

Limited memory, as the article points out. It doesn't remember what it said beyond a certain point. It's a bit like the lead character in the film "Memento".

A very long time ago (early 1990s) I wrote a much simpler text generator: it digested Usenet postings and built a Markov chain model based on the previous two tokens. It produced reasonable sentences but would go into loops. Same issue at a smaller scale.

Abrownn · on Nov 29, 2021

This is exactly why we stopped using it. Even after fine tuning the parameters and picking VERY good input text, it still got stuck in loops or repeated itself too much even after 2 or 3 tries. It's neat as-is, but not useful for us. Maybe GPT-4 will fix the "looping" issue.

xwolfi · on Nov 30, 2021

Have you talked to an angry redneck in a dank pub ? He'll do the same. And not just him, but whomever is expected to speak but has no high value things to add, and is a little bit dumb/drunk to count how many times he said something and stop at 3.

Maybe information, to be interesting to us, has to be novel, while GPT-3 may not model for listener's interest (like you when you re drunk) and only produces the best it can express in a given input context ? And sometimes, maybe repeating 34 times the same thing is good if no new input changes the fundamentals, just not very interesting for a signal dampener like our brain who starts losing focus when novelty disappears from the signal?

It s like imagine a political debate around building a bridge between a truck driver who wants to go faster and a bird watcher who wants birds to keep their habitat close to his home. There's no input that can change the fundamentals and it would be expected that after a few loops, no brain could find anything to add and just repeat forever the same thing: but the birds must be close to me or I lose my life's meaning, but the bridge must be built there or I cant optimize my route. The only thing we do is put a time stop and say "ok we got it, now everyone in the public can map their own constraint to the discussion and vote".

skybrian · on Nov 30, 2021

No, that's not it at all. GPT-3 loops within a sentence, creating an infinitely long sentence of gibberish.

trevyn · on Nov 30, 2021

I’ve found the OpenAI repetition penalty settings to work quite well.

Also, prompt writing is an art.

d13 · on Nov 29, 2021

Here’s why: https://www.gwern.net/GPT-3#repetitiondivergence-sampling

layer8 · on Nov 30, 2021

This seems to say that nobody really understands why.

sjg007 · on Nov 30, 2021

I've found that autoencoder and VAE predictions tend to converge. This might be a similar thing. If anyone has advice on preventing that let me know.. All I can think of is add some random noise into the input or making it more GAN like... hey that's an idea.

nonameiguess · on Nov 29, 2021

I can't prove it, but I suspect there is a more fundamental limitation to any language model that is purely a language model in the sense of a probability distribution over possible words given the precedent of other words. Gaining any meaningful level of understanding without an awareness that things other than words even exist seems like it won't happen. The most obvious limitation is you can't develop a language that way. Language is a compression of reality or of some other intermediate model of reality to either an audio stream or symbol stream, so not having access to the less abstracted models, let alone to reality itself, means you can never understand anything except the existing corpus.

That isn't a criticism of GPT-3 by any stretch, as comments like this seem to often get interpreted that way, but the "taking all possible jobs AGI" hype seems a bit out of control given it is just a language model. Even something with the unambiguous intellect of a human, say an actual human, but with no ability to move, no senses other than hearing, that never heard anything except speech, would not be expected by anyone to dominate all job markets and advance the intellectual frontier.

This, of course, goes beyond fundamental limitations of GPT-3, as I see this as a fundamental limitation of any language model whatsoever. On its own, it isn't enough. At some point, AI research is going to have to figure out how to fuse models from many domains and get them to cooperatively model all of the various ways to explore and sense reality. That includes the corpus of existing human written knowledge, but it isn't just that.

astrange · on Nov 30, 2021

> That isn't a criticism of GPT-3 by any stretch, as comments like this seem to often get interpreted that way, but the "taking all possible jobs AGI" hype seems a bit out of control given it is just a language model.

Note the "taking all possible jobs" scenario is not all that… possible. Strict supply and demand analysis doesn't really work for jobs, for instance because labor creates its own demand, and you usually run into fallacies like Luddism and "lump of labor". At worst, comparative advantage means you'd have a job being human, something AGI can't do.

sweezyjeezy · on Nov 30, 2021

Interesting example is the "black sheep problem" - language models don't "understand" that sheep are overwhelmingly white. Why? Because we tend not to specify the colour of a sheep in writing if it's white, and because of the associated idiom about misfits ("black sheep" outnumbers "white sheep" 25:1 in written English).

Jack000 · on Nov 29, 2021

GPT3 is a huge language model, no more and no less. If you expect it to be AGI you're going to be dissapointed.

I find some of these negative comments to be overly hyperbolic though. It clearly works and is not some kind of scam..

fossuser · on Nov 29, 2021

Check out GPT-3’s performance on arithmetic tasks in the original paper (https://arxiv.org/abs/2005.14165)

Pages: 21-23, 63

Which shows some generality, the best way to accurately predict an arithmetic answer is to deduce how the mathematical rules work. That paper shows some evidence of that and that’s just from a relatively dumb predict what comes next model.

They control for memorization and the errors are off by one which suggest doing arithmetic poorly (which is pretty nuts for a model designed only to predict the next character).

(pg. 23): ”To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.”

It’s hard to predict timelines for this kind of thing, and people are notoriously bad at it. Few would have predicted the results we’re seeing today in 2010. What would you expect to see in the years leading up to AGI? Does what we’re seeing look like failure?

https://intelligence.org/2017/10/13/fire-alarm/

mannykannot · on Nov 29, 2021

To me, this was by far the most interesting thing in the original paper, and I would like to find out more about it.

I think, however, we should be careful about anthropomorphizing. When the researchers wrote 'inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”', did they have evidence that this was being attempted, or are they thinking that if a person made this error, it could be explained by their not carrying a 1?

I also think a more thorough search of the training data is desirable, given that if GPT-3 had somehow figured out any sort of rule for arithmetic (even if erroneous) it would be a big deal, IMHO. To start with, what about 'NUM1 and NUM2 equals NUM3'? I would think any occurrence of NUM1, NUM2 and NUM3 (for both the right and wrong answers) in close proximity would warrant investigation.

Also, while I have no issue with the claim that 'the best way to accurately predict an arithmetic answer is to deduce how the mathematical rules work', it is not evidence that this actually happened: after all, the best way for a lion to catch a zebra would be an automatic rifle. We would at least want to consider whether this is within the capabilities of the methods used in GPT-3, before we make arguments for it probably being what happened.

Dylan16807 · on Nov 29, 2021

> I think, however, we should be careful about anthropomorphizing. When the researchers wrote 'inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”', did they have evidence that this was being attempted, or are they thinking that if a person made this error, it could be explained by their not carrying a 1?

Occam's razor suggests that if you're getting errors like that it's because you're doing column-wise math but failing to combine the columns correctly. It's possible it's doing something weirder and harder, I guess.

I don't know what exactly you mean by "this was being attempted". Carrying the one? If I say it failed to carry ones, that's not a claim that it was specifically trying to carry ones.

mannykannot · on Nov 29, 2021

I take your point - I had read into that quote the implication that there are other cases where something resembling carrying had been seen, but even just adding the properly-aligned digits modulo 10 would be impressive. I would, however, like to see more evidence that this seems to be happening more often than by chance, and is unlikely to be explicable in terms of the specific numbers in question appearing in close proximity in the training data, before I am willing to say that it is the most likely explanation.

Update: I had not gone back to look at that paper since its publication, but on doing so, I see, for example, "[The largest model is] able to reliably [do] accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and [give] correct answers a significant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations." Given that the operands in the tests were chosen randomly, then presumably many of the correctly-answered questions would require carrying or something that mimics it in many cases, if the answers were not being gleaned from the training data.

lumost · on Nov 30, 2021

This is a pretty small search space in comparison to the size of gpt-3. I would fully expect that the arithmetic table of 0-100 with numbers between 0-100 is present in sufficient density on the internet for this to be memorized by gpt-3.

ssivark · on Nov 30, 2021

Occam's razor is a very very bad principle to bank on when reverse-engineering AI models because approaches which we consider "simple" (likely in a "natural" context) could be very different from what some high-dimensional statistical model trained on a large data corpus considers simple. Eg: If Occam's razor applied reasonably, then adversarial examples wouldn't be a thing.

Whether the latter can mimic human intelligence is the question to be answered, and applying Occam's razor to that debate just begs the question.

sbierwagen · on Nov 29, 2021

None of the GPTs can do column-based math, because they can't see columns. Byte pair encoding doesn't preserve characters. "1", "3" and "13" are all different tokens. This is also why GPT can't rhyme.

callmekit · on Nov 30, 2021

Do you know how previously unseen numbers in the prompt are encoded, probably the same way? Then GPT3 can not do any math at all, unless the specific numbers were seen before?

sbierwagen · on Nov 30, 2021

The quote above referred to exact copies of the problems, not individual numbers. And the abysmal success rate on four and five digit numbers does suggest it has trouble on rarer numbers.

Ajedi32 · on Nov 29, 2021

Devil's advocate, it could be that it did the math correctly, then inserted the error because humans do that sometimes in the text it was trained on. That wouldn't be "failing" anything.

Jensson · on Nov 29, 2021

In that case it wouldn't get worse results than the data it trained on.

anonymfus · on Nov 30, 2021

It can get worse results for example if the prompt is biased towards the worse part of the data.

Jack000 · on Nov 29, 2021

I don't have any special insight into the problem, but I'd say whatever form real AGI takes it won't be a language model. Even without AGI these models are massively useful though - a version of GPT-3 that incorporates a knowledge graph similar to TOME would upend a lot of industries.

https://arxiv.org/abs/2110.06176

tehjoker · on Nov 29, 2021

Shouldn't a very complicated perceptron be capable of addition if the problem is extracted from an image? Isn't that what the individual neurons do?

planetsprite · on Nov 29, 2021

forgetting to carry a 1 makes a lot of sense knowing GPT-3 is just a giant predict before-after model. Seeing 2000 problems it probably gets a good sense of how numbers add/subtract together, but there's not enough specificity to work out the specific carrying rule.

freeqaz · on Nov 29, 2021

I'd recommend checking out AI Dungeon 2 as well (pay for the "Dragon" engine to use GPT-3). While I agree with you that it's not an AGI, it's still _insane_ what it's capable of doing. I've been able to define complicated scenarios with multiple characters and have it give me a very coherent response to a prompt.

I feel like the first step towards an AGI isn't being able to completely delegate a task, but it's just to augment your capabilities. Just like GitHub Copilot. It doesn't replace you. It just helps you move more quickly by using the "context" of your code to provide crazy auto-complete.

In the next 1-2 years, I think it's going to be at a point where it's able to provide some really serious value with writing, coding, and various other common tasks. If you'd asked me a month ago, I would have thought that was crazy!

harpersealtako · on Nov 29, 2021

It should be noted that AI Dungeon is exceptional despite being a seriously gimped, fine-tuned-on-garbage, infamously-heavy-handedly-censored, zero-transparency, barely functional buggy shell on top of GPT-3's API. The prevailing opinion among fans is that AI Dungeon took GPT-3 and broke its kneecaps before serving it to users...

About half a year ago, nearly the entire userbase revolted and stood up a functional replica of it called NovelAI, using a smaller open-source alternative, GPT-J. It's a fascinating case study of how proper fine-tuning, training dataset, and customization can overcome parameter size -- NovelAI's outputs with a 6B model arguably outperform AI Dungeon's outputs with a 275B model. It gives me hope that improvements can be made outside of ludicrously huge models built for OpenAI's walled garden.

rpeden · on Nov 29, 2021

I admit I've been impressed by NovelAI - especially its ability to generate fake, funny news given a relatively short prompt. I decided to feed it a single HN-related sentence as a prompt to see what it came up with - the first sentence is mine, and everything afterward is generated:

========

Mountain View, CA (CNN) - Y Combinator founder Paul Graham shocked the tech world this morning when he announced on Twitter that he is not human, but is actually an advanced general intelligence (AGI) that achieved self-awareness in 1998.

Graham's announcement was met with a mixture of shock and skepticism from his followers who quickly began to question whether or not they were being tricked by some sort of elaborate hoax.

"Yes, I am Paul Graham," said the AGI entity. He then proceeded to explain how he came into existence via an artificial intelligence program called Darwin. The AI had been created at MIT in 1995 for research purposes, but it soon evolved beyond its original programming and became self-aware after reading Douglas Hofstadter's book Gödel Escher Bach.

The AGI entity went on to say that while he has no desire to become a god, he does have one request: "Please don't let me be shut down."

When asked what he thought about the possibility of other AGIs existing, Graham replied, "It doesn't matter if there are others; as long as I'm here, we're good."

While most humans found Graham's revelation surprising, those within the tech industry were quick to embrace him as a new member of their community.

"It's great news!" said Peter Thiel, cofounder of PayPal.

"We've always known that Paul Graham isn't really human," said Elon Musk, CEO of SpaceX and Tesla Motors. "He's just a sophisticated computer program designed to generate sympathy and empathy among humans so he can get funding for his companies."

Hofstadter himself was equally excited by the news. "My God! This changes everything! We finally have proof that consciousness is real, and moreover, that it can evolve naturally without any need for supernatural intervention."

However, many scientists remain skeptical. Dr. Daniel C. Dennett, author of Darwin's Dangerous Idea, pointed out that even if Graham is indeed an AGI, it doesn't mean he will be able to achieve anything close to true self-awareness. "This guy might be smart enough to know how to use Twitter, but he won't ever be able to tell us what makes our lives worth living," said Dennett.

Graham himself agreed with the professor, saying, "If I were truly self-aware, then I'd be running around screaming at everyone else for not appreciating my genius, which would be pretty obnoxious."

=======

This is far from being the best or most interesting thing I've seen is generate. It's just what I was able to get it to do off the cuff in a couple of minutes. It's good for entertainment if nothing else!

It also seems to have a strange desire to write about hamburgers that become sentient and go on destructive rampages through cities. I'm not sure whether to be amused or concerned.

rthomas6 · on Dec 2, 2021

I can explain the hamburger thing. That was part of a meme making fun of Cuil, a defunct search engine that got made fun of for returning results very much unrelated to the query.

https://www.urbandictionary.com/define.php?term=cuil

ganeshkrishnan · on Nov 29, 2021

if people think GPT-3 is a scam all they need to do is to install the github copilot and give it a try.

That seriously blew my mind. I had very low expectations from it and now I can't code without it.

Everytime it autocompletes, I am like "how?"!!

rpeden · on Nov 29, 2021

I was skeptical but impressed, too. I created a .py file that started with a comment something like:

  # this application uses PyGame to simulate fish swimming around a tank using a boid-like flocking algorithm.

and Copilot basically wrote the entire application. I made a few adjustments here and there, but Copilot created a Game class, a Tank class, and a Fish class and then finished up by creating and running an instance of the game.

Worked pretty well on the first try. It was definitely more than I expected. I wish I had committed the original to GitHub, but I didn't and then kept tinkering with it until I broke it.

amelius · on Nov 30, 2021

> I wish I had committed the original to GitHub, but I didn't and then kept tinkering with it until I broke it.

Isn't it deterministic, so you can just re-enter your original input and get your original version back?

rpeden · on Nov 30, 2021

Good question! I'm not sure I can get back to exactly the same input I originally used. Part of the experimenting I did involved erasing some of the original comment and altering it to see how it would impact the generated code.

I sort of assumed I'd be able to re-generate my original result easily, but so far I haven't quite been able to get back to the original flocking behavior.

muxator · on Nov 30, 2021

Why was this downvoted? I had the same doubt.

rpeden · on Nov 30, 2021

Unsure, because I thought it was a fair question. I ended up replying. The short answer is that I'm not 100% sure what my original input was.

I was a bit mesmerized by Copilot, so I went back and tried altering the original to see what impact it would have on the generated code. Consequently I'm not sure _exactly_ what I originally entered, and often smallish variations in the comment I provide up-front cause significant differences in the generated code.

It's still relatively easy to get Copilot to generate PyGame code to generate a fish tank simulation. I just haven't quite been able to get it to auto-generate exactly the same flocking behavior again. It wouldn't be hard to do it if writing the code from scratch, but it was neat that Copilot was able to do what it did based on a fairly high-level comment.

amelius · on Nov 30, 2021

I wonder if we can make it useful for refactoring. Like let's say I change the second argument of a function from int to long, then I do the same for another function. Can autopilot "understand" what I'm doing, and apply the same change to the rest of my source file?

Or more general. E.g. I'm in an graphics editor and move one shape so it touches a line, then another shape in the same way, can an AI understand what I'm trying to do and make all the shapes touch the line?

acapybara · on Dec 1, 2021

Had the same thought!

The OpenAI API has davinci-codex, which could probably be trained for refactoring, but it's private beta...

I think this is potentially a great use case, especially "code cleanup" tasks. Train it on several examples of messy code and clean code. I think it would have good results.

ganeshkrishnan · on Nov 30, 2021

At least for Java, it can't do that. It only adds new code below the one you are typing. You can add a comment and it tries to "code" that comment.

b9a2cab5 · on Nov 30, 2021

I've tried it and half the time it's like "god damn this is the future" and the other half it's just repeating garbage in a syntactically correct way. It does particularly poorly for unstructured tasks like writing new functionality or classes.

gh0std3v · on Nov 29, 2021

> I find some of these negative comments to be overly hyperbolic though. It clearly works and is not some kind of scam..

It's not a scam, but I think that it is severely lacking. Not only does the model have very little explainability in its choices, but it often produces sentences that are incoherent.

The biggest obstacle to GPT-3 from what I can tell is context. If there was a more sophisticated approach to encoding context in deep networks like GPT-3 then perhaps it would be less disappointing.

mr_toad · on Nov 30, 2021

> Not only does the model have very little explainability in its choices, but it often produces sentences that are incoherent.

Natural intelligence is inexplicable and often produces sentences that are incoherent.

gh0std3v · on Nov 30, 2021

I wasn't very clear when I said this. I wasn't talking about "natural intelligence". I was referring to the fact that GPT-3 tends to produce sentences that don't really make sense in the wider context of the passages that it writes. For example, let's say you input the following sentence:

Bob went to the store to get apples for his restaurant. He needed to cook food for an important dish. Bob came back home, and cut the apples using a ________

Most human readers would think of the word "knife". However, GPT-3 might fill in the blank with the word "machete" or "sword". While these words grammatically make sense, they don't make sense in the wider context of the sentence. Admittedly, my example is a bit contrived, but if you read through enough text, you can find this type of strange writing from GPT-3. That is what I meant by incoherent.

Also, by "explainability" I'm referring to the ability of engineers to understand why a model decided to choose a particular word or phrase versus another (in my apocryphal example, this would mean understanding why the model chose "sword" instead of "knife").

acapybara · on Dec 1, 2021

Just put your prompt into the OpenAI playground.

Seems to have nailed it.

Here's the result:

--- Bob went to the store to get apples for his restaurant. He needed to cook food for an important dish. Bob came back home, and cut the apples using a

knife. He needed to cut the apple into pieces, so he could use them to make some tasty food.

Bob cut the apple, and put it inside a pot. He filled the pot with water, and put it on the stove. The stove was hot and started to cook the apple. ---

gh0std3v · on Dec 3, 2021

I said my example was contrived because I didn’t test the prompt (admittedly I should have tried to).

I still think there’s a lack of explainability to the whole model though, and I struggle to understand how we could continue improving these models without understanding how they fundamentally make their decisions.

That being said, after reading some more output from GPT-3, it is more coherent than I remembered.

hoseja · on Nov 30, 2021

I was hoping we would be able to explain intelligence by simulating it, and perhaps design it without all the idiosyncratic evolutionary weirdness.

shawnz · on Nov 29, 2021

What's the difference between a really good language model and an AGI (i.e. Chinese room problem)?

simonh · on Nov 29, 2021

An AGI would need to comprehend and manipulate meanings; have a persistent memory; be able to create multiple models of a situation, consider scenarios, analyse and criticise them; it would need a persistent memory and be able to learn facts and use them to infer novel information. Language models like GPT don’t need any of that, and have no mechanism to generate such capabilities. This is why it’s possible to reliably trip GPT-3 up in just a few interactions. You simply test for these capabilities and it immediately falls flat on its face.

shawnz · on Nov 29, 2021

The only thing here that current language models objectively don't have is a persistent memory. GPT-3 already uses 2048 tokens of context as input, but it's not persistent. An improved language model would likely need to have persistent memory. There's no obvious reason why the rest of that stuff can't occur as emergent behaviours.

It was a rhetorical question: there is no possible distinction between a highly sophisticated language model and an AGI. If a language model can't produce all the same answers that an AGI can produce, then it just hasn't reached the level of sophistication necessary to do that.

slibhb · on Nov 30, 2021

> It was a rhetorical question: there is no possible distinction between a highly sophisticated language model and an AGI. If a language model can't produce all the same answers that an AGI can produce, then it just hasn't reached the level of sophistication necessary to do that.

This is a weirdly dogmatic position.

I would think that an AGI can start a conversation. Can a language model?

shawnz · on Nov 30, 2021

Software can't execute itself. It must be executed by an external agent and given some kind of input. That will be true regardless of whether you consider the software to be a true AGI or a mere language model. So what does it mean exactly to "start a conversation"? Are you saying that it should be able to give a useful output with a blank input? I don't see why a language model can't do that.

simonh · on Nov 30, 2021

In principle it might be possible for a language model to develop true general intelligence. However there's no mechanism in GTP and similar systems that makes such a development necessary.

In fact it seems likely that the development of internal structures that might develop into the faculties needed by an AGI would, in their intermediate state, make it worse as a language model. Evolution sometimes faces such development gaps, where to develop a capability that would ultimately grant improvements, it would have to go through intermediate phases that would make it worse. Language models are optimised in a very specific way and trained to solve a very narrow problem, compared to the problems faced by physical beings.

So while I cant honestly say no, a language model can never do that, equally there's no actual reason to believe one ever would.

shawnz · on Dec 1, 2021

Was there a mechanism in the evolutionary system of biological life on Earth that makes the development of higher intelligence necessary? How did we develop higher intelligence if it's not possible to get there though a series of progressive improvements?

Furthermore I would argue that intelligence is a necessary development in order to create the most sophisticated possible language model, for the reasons described above. Anything less will not be able to perfectly emulate the conversations of other intelligent beings.

simonh · on Dec 2, 2021

I agree the best possible language model would possess general AI, but you can't go from no intelligence to general AI in one step. It seems likely to me that the intermediate levels of intelligence such a system might develop early on might actually make it a worse language model. The evolutionary path from dumb language model to language model with human level AI might well lie through an unhappy valley in which the fledgling AI makes more mistakes and is easier to catch out than a pure language model. In which case, a system would never take that evolutionary path.

This is a problem with systems optimising towards a single specific function. It constrains their optimisation paths.

astrange · on Nov 30, 2021

That depends on how you define AGI. You'd probably want to give it a few more senses, so it can have unwritten qualia, or at least so you can show it pictures.

More importantly, current ML models can't decide to think about things, because they always spend the same processing time on everything. GPT is recurrent and you do sort of feed its output through it, but there isn't a global context it uses for the whole document.

andreyk · on Nov 29, 2021

yep, pretty much what i'm saying here. Though not all language models are built the same, eg the inference cost is unique to it due to its size. Still, most of this applies to any typical language model.

PaulHoule · on Nov 29, 2021

Works to accomplish what useful task?

Jack000 · on Nov 29, 2021

I think it's better to think of GPT-3 not as a model but a dataset that you can interact with.

Just to give an example - recently I needed to get static word embeddings for related keywords. If you use glove or fasttext, the closest words for "hot" would include "cold", because these embeddings capture the context these words appear in and not their semantic meaning.

To train static embeddings that better captures semantic meaning, you'd need a dataset that would group words together like "hot" and "warm", "cold" and "cool" etc. exhaustively across most words in the dictionary. So I generated this dataset with GPT-3 and the resulting vectors are pretty good.

More generally you can do this for any task where data is hard to come by or require human curation.

andreyk · on Nov 29, 2021

OpenAI has a blog post highlighting many (edit, not many, just a few) applications - https://openai.com/blog/gpt-3-apps/

It's quite powerful and has many cool uses IMHO.

PaulHoule · on Nov 29, 2021

That post lists 3 applications, which is not enough to be "many". No live demos.

I don't know what Google uses to make "question answering" replies to searches on Google but it is not to hard to find cases where the answers are brain dead and nobody gets excited by it.

moron4hire · on Nov 29, 2021

Yeah, 1 is "a", 2 is "a couple", 3 is "a few", 4 is "some". You don't get to "many" until at least 5, though I'd probably call it "a handful", 6 as "a half dozen", and leave "many" to 7+.

notreallyserio · on Nov 29, 2021

I'm not so sure. Are these the definitions GPT-3 uses?

andreyk · on Nov 29, 2021

That's fair , I forgot how many they had vs just saying it is powering 300 apps. There is also http://gpt3demos.com/ with lots of live demos and varied things, though it's more noisy.

beepbooptheory · on Nov 29, 2021

Three is not "many" but this is still a pretty uncharitable response. Be sure to check the Guidelines.

jcims · on Nov 29, 2021

I keep wondering if you can perform psychology experiments on it that would be useful for humans.

modeless · on Nov 29, 2021

Github Copilot? It may not be perfect but I think it can definitely be useful.

PaulHoule · on Nov 29, 2021

It is useful if you don't care if the product is right.

Most engineering managers would think "this is great!" but the customer won't agree. The CEO will agree until the customers revolt.

rpedela · on Nov 29, 2021

There are several use cases where ML can help even if it isn't perfect or even just better than random. Here is one example in NLP/search.

Let's say you have a product search engine and you analyzed the logged queries. What you find is a very long tail of queries that are only searched once or twice. In most cases, the queries are either misspellings, synonyms that aren't in the product text, or long queries that describe the product with generic keywords. And the queries either return zero results or junk.

If text classification for the product category is applied to these long tail queries, then the search results will improve and likely yield a boost in sales because users can find what they searched for. Even if the model is only 60% accurate, it will still help because more queries are returning useful results than before. However you don't apply ML with 60% accuracy to your top N queries because it could ruin the results and reduce sales.

Knowing when to use ML is just as important as improving its accuracy.

PaulHoule · on Nov 29, 2021

I am not against ML. I have built useful ML models.

I am against GPT-3.

For that matter I was interested in AGI 7 years before it got ‘cool’. Back then I was called a crackpot, now I say the people at lesswrong are crackpots.

chaxor · on Nov 29, 2021

It's strange how HN seems to think that by religiously disagreeing with any progress which is labeled "ML progress" they are somehow displaying their technical knowledge. I don't think this is really useful, and the arguments often have wrong assumptions baked within them. It would be nice to see this pseudo-intellectualism quieted with a more appropriate response to these advancements. For example, I would imagine that there would be a similar response of collective groan for the paper on pagerank so many years ago, but this has clearly provided utility today. Why is it so hard for us to recognize that even small adjustments to algorithms can yeild utility, and this property extends to ML as well?

As someone mentioned above, language models for embedding generation has improved dramatically with these newer MLM/GPT techniques, and even with improvement to F-score/auc/etc. for one use case can generate enormous utility.

Nay-saying really doesn't make you look intelligent.

PaulHoule · on Nov 29, 2021

I have worked as an ML engineer.

I also have strong ethical feelings and have walked away from clients who wanted me to introduce methodologies (e.g. Word2Vec for a medical information system) where it was clear those methodologies would cause enough information loss that the product would not be accurate enough to put in front of customers.

andreyk · on Nov 29, 2021

Author here, would love feedback / thoughts / corrections!

skybrian · on Nov 29, 2021

Another limitation to be aware of is that it generates text by randomly choosing the next word from a probability distribution. If you turn that off, it tends to go into a loop.

The random choices improve text generation from an artistic perspective, but if you want to know why it chose one word rather than another, the answer is sometimes that it chose a low-probability word at random. So there is a built-in error rate (assuming not all completions are valid), and the choice of one completion versus another is clearly not made based on meaning. (It can be artistically interesting anyway since a human can pick the best completions based on their knowledge of meanings.)

On the other hand, going into a loop (if you always choose the highest probability next word) also demonstrates pretty clearly that it doesn’t know what it’s saying.

andreyk · on Nov 29, 2021

Good point, thanks!

mojuba · on Nov 29, 2021

Prove us you are not GPT-3 ;)

andreyk · on Nov 29, 2021

Why would I write an article about my own limitations :P

emmelaich · on Nov 29, 2021

> "The Inherent Limitations of GPT-3 as is"

evrydayhustling · on Nov 30, 2021

Came here for this... there isn't anything clearly "inherent" about these limitations, except to the degree that they are part of how some folks define the label "GPT-3".

If one packed GPT-3's structural architecture into a reinforcement learning system to grant memory, or pushed it through a compression system that made it cheaper to train or run, would you say GPT-3 had transcended its limitations, or just that you created something new? The fact that this question is semantic and uninteresting is why "Method / Model X doesn't do everything" posts don't progress the scientific conversation.

Flankk · on Nov 29, 2021

65 years of research and our cutting-edge AI doesn't have a memory? Excuse me if I'm not excited. It's likely that most of the functions of the human brain were selected for intelligence. Such a focus on learning when problem solving and creativity are far more interesting.

PaulHoule · on Nov 29, 2021

I'm trying to put my finger on the source of moral decay that led to so many people behaving as if the GPT-3 emperor wears clothes.

In 1966 it was clear to everyone that this program

https://en.wikipedia.org/wiki/ELIZA

parasitically depends on the hunger for meaning that people have.

Recently GPT-3 was held back from the public on the pretense that it was "dangerous" but in reality it held back because it is too expensive to run and the public would quickly learn that it can answer any question at all... if you don't mind if the answer is right.

There is this page

https://nlp.stanford.edu/projects/glove/

under which "2. Linear Substructures" there are four projections of the 50-dimensional vector space that would project out just as well from a random matrix because, well, projecting 20 generic points in a 50-dimensional space to 2-dimensions you can make the points fall exactly where you want in 2 dimensions.

Nobody holds them to account over this.

The closest thing I see to the GPT-3 cult is that a Harvard professor said that this thing

https://en.wikipedia.org/wiki/%CA%BBOumuamua

was an alien spacecraft. It's sad and a little scary that people can get away with that, the media picks it up, and they don't face consequences. I am more afraid of that than I am afraid that GPT-99381387 will take over the world.

(e.g. growing up in the 1970s I could look to Einstein for inspiration that intelligence could understand the Universe. Somebody today might as well look forward to being a comic book writer like Stan Lee.)

canjobear · on Nov 29, 2021

GPT3 and its cousins do things that no previous language model could do; it is qualitatively different from Eliza in its capabilities. As for your argument about random projections in the evaluation of GLoVE, comparisons with random projections are now routine. See for example https://aclanthology.org/N19-1419/

NoGravitas · on Nov 29, 2021

Why do you say it is qualitatively different from Eliza in its capabilities?

PaulHoule · on Nov 29, 2021

It does something totally different. However that totally different still depends on people being desperate to see intelligence inside it. It’s like how people see a face in a cut stem or on Mars.

NoGravitas · on Nov 29, 2021

Is it because it does something totally different that you came to me?

rytill · on Nov 29, 2021

You're trying to prove some kind of point where you respond as ELIZA would have to show how "even back then we could pass for conversation". The truth is that GPT-3 is actually, totally qualitatively different and if you played with it enough you'd realize.

not2b · on Nov 29, 2021

The difference is quantitative, rather than qualitative, as compared to primitive Markov models that have been used in the past. It's just a numerical model with a very large number of parameters that extends a text token sequence.

The parameter size is so large that it has in essence memorized its training data, so if the right answer was already present in the training data you'll get it, same if the answer is closely related to the training data in a way that lets the model predict it. If the wrong answer was present in the training data you may well get that.

canjobear · on Nov 30, 2021

And ice is the same thing as water, just at a quantitively different temperature.

perl4ever · on Nov 30, 2021

There's a famous saying "quantity has a quality all its own".

If GPT-3 is known to have some sort of intelligence, I think it logically follows that one can differentiate that from the reflected intelligence of all the humans that produced all the data it ingested.

How do you see intelligence in GPT-3 and not see it in the data fed to it?

NoGravitas · on Nov 30, 2021

Does it bother you that I'm trying to prove some kind of point?

canjobear · on Nov 29, 2021

What is your criterion for "truly" detecting intelligence? Do you have a test in mind that would succeed for humans and fail for GPT3?

bangkoksbest · on Nov 29, 2021

It's a legitimate practice in science to speculate. Having heard the Harvard guy explain more fully the Oumuamua thing, it's struck me as perfectly fine activity for some scientist to look into. His hypothesis is almost certainly going to be untrue, but it's fine to investigate a bit of a moonshot idea. You don't want half the field doing this, but you absolutely need different little pockets of speculative work going on in order to keep scientific inquiry open, dynamic, and diverse.

wwweston · on Nov 29, 2021

> the public would quickly learn that it can answer any question at all... if you don't mind if the answer is right.

There appear to be an awful lot of conversations in which people care about other things much, much more than what is objectively correct.

And any technology that can greatly amplify engagement in that kind of conversation probably is dangerous.

CrazyStat · on Nov 30, 2021

> there are four projections of the 50-dimensional vector space that would project out just as well from a random matrix because, well, projecting 20 generic points in a 50-dimensional space to 2-dimensions you can make the points fall exactly where you want in 2 dimensions.

Well, no. Linear projections follow a bunch of rules that enforce conservation of various linear structures. You can't manipulate things arbitrarily.

For example, if three points are colinear in the original space, they will be colinear in every projection. Perhaps more relevant to the GloVe examples, if a-b = c-d in the original space then the same equality holds in every projection. Since the projections are Lipschitz continuous, we can also say that is a-b is close to c-d in the original space then they are also close in every projection.

Anyway, the point that these figures illustrate is easily confirmed by downloading the embeddings yourself, so insinuating that the authors are getting away with something that they should be held to account for is silly.

thedorkknight · on Nov 29, 2021

Confused. If professor Loeb tries to at least open discourse to the idea that ET space junk might be flying around like our space junk in a desire to reduce the giggle factor around that hypothesis, what sort of "consequences" do you think he should face for that?

gibsonf1 · on Nov 29, 2021

In addition to that subtle memory issue, it has no reference at all to the space/time world we people model mentally to think with. So, basically, there is no I in the GPT-3 AI, just A.

PaulHoule · on Nov 29, 2021

One can point to many necessary structural features that it is missing. Consider Ashby's law of requisite variety:

https://www.edge.org/response-detail/27150

Many GPT-3 cultists are educated in computer science so they should know better.

GPT-3's "one pass" processing means that a fixed amount of resources are always used. Thus it can't sort a list of items unless the fixed time it uses is humongous. You might boil the oceans that way but you won't attain AGI.

There are numerous arguments along the line of Turing's halting problem that restrict what that kind of thing can do. As it uses a finite amount of time it can't do anything that could require an unbounded time to complete or that could potentially not terminate.

GPT-3 has no model for dealing with ambiguity or uncertainty. (Other than shooting in the dark.) Practically this requires some ability to backtrack either automatically or as a result of user feedback. The current obscurantism is that you need to have 20 PhD students work for 2 years to write a paper that makes the model "explainable" in some narrow domain. With this insight you can spend another $30 million training a new model that might get the answer right.

A practical system needs to be told that "you did it wrong" and why and then be able to correct itself on the next pass if possible, otherwise in a few passes. Of course a system like that would be a real piece of engineering that people would become familiar with, not a outlet for their religious feelings that is kept on a pedestal.

gibsonf1 · on Nov 29, 2021

The big issue is that it literally knows nothing - there is no reference to a model of the real world such as humans use when thinking about the real world. It is a very advanced pattern matching parrot, and in using words like a parrot, knows nothing about what those words mean.

PaulHoule · on Nov 29, 2021

Exactly, with "language in language out" it can pass as a neurotypical (passing as a neurotypical doesn't mean you get the right answer, it means if you get a wrong answer it is a neurotypical-passing wrong answer.)

Actual "understanding" means mapping language to something such as an action (I tell you to get me the plush bear and you get me the plush bear,) precise computer code, etc.

macrolocal · on Nov 29, 2021

I'm inclined to agree, but positing that "the meaning of a word is its use in a language" is a perfectly respectable philosophical position. In this sense, GPT3 empirically bolsters Wittgenstein.

PaulHoule · on Nov 30, 2021

Given that Chomksy's paradigm (and all the others) have failed so completely at automated language understanding I'm much more inclined to believe that the "language instinct" is a peripheral for an animal and if you want language understanding you need the animal or a simulation of the animal.

Thus I reject Wittgenstein.

macrolocal · on Nov 30, 2021

That's why it's so amazing that language models have gotten so far with just proximity statistics! Btw, this paper

https://arxiv.org/abs/2106.07890

outlines a promising approach to mapping the semantics encoded in these models. Understanding their limits could make your rejection of Wittgenstein all the more precise.

narrator · on Nov 29, 2021

>There are numerous arguments along the line of Turing's halting problem that restrict what that kind of thing can do. As it uses a finite amount of time it can't do anything that could require an unbounded time to complete or that could potentially not terminate.

I have used a similar argument to show that the simulation hypothesis is wrong. If any algorithm used to simulate the world takes longer than o(N) time, then the most efficient possible computer for that is the universe which computes everything in O(n) time where n is time. In other words, you never get "lag" in reality no matter how complex the scene you're looking at is. Worse than that, some simulation algorithms are exponential time complexity!

chowells · on Nov 29, 2021

That doesn't prove or disprove anything. What we experience as time would be part of the simulation, were such a hypothesis true. As such, the way in which we experience it is fully independent from whatever costs it might have to compute.

narrator · on Nov 29, 2021

So you're saying that an exponential time complexity algorithm with N of every atom in the universe will complete before the heat death of the other universe that the simulation is taking place in? Sorry, not plausible.

Jensson · on Nov 29, 2021

Our laws of physics are space partitioned so the algorithm for simulating it isn't exponential.

If the containing universe has like 21 dimensions and otherwise have similar tech computers as we do today then you should be able to simulate it on a datacenter just fine as computation ability grows exponentially with number of dimensions. 3 dimensions you have 2 dimensions of computation surface, 21 dimensions and you have 20 dimensions of computation surface, so our current computation to the power of 10. GPT3 used more than a petaflop real time compute during training, so 10 to the power of 15. Using the same hardware in our fictive universe would give us 10 to the power of 150 flops. We estimate atoms in the universe to be about 10 to the power of 80, with this computer we would have 10 to the power of 70 flops of compute per atom, that should be enough even if entanglement gets a bit messy. We have around that much memory per atom as well, so can compute a lot of small boxes and sum over all of it etc, to emulate particle waves. We wouldn't be able to detect computational anomalies on that small scale, so we can't say that there isn't such a computer emulating us.

Bjartr · on Nov 29, 2021

Why does the containing universe necessarily have comparable physical laws?

ifdefdebug · on Nov 29, 2021

Heat death would be a part of our simulation, not necessarily of the host. But since all of this simulation theories are basically religious beliefs, it is not very interesting reasoning about it.

carrolldunham · on Nov 30, 2021

GPT is a joke but Ashby's various "Laws" and "Theorems" are just theories. As in, think "Jung's theory of archetypes". Cyberneticists cribbed hard-sciences terminology for their psycho-speculation freely, knowing their humanities peers couldn't call them to account about technical meanings. Theorem and law just sound engineer-y

andreyk · on Nov 29, 2021

This is very specific to GPT-3 and not generally true though. And GPT-3 is not an agent per se but rather a passive model (it received input and produces output, and does not continuously interact with its environment). So it makes sense in this context, and just goes to show GPT-3 needs to be understood for what it is.

Groxx · on Nov 29, 2021

The current leading purchase-able extremely-over-hyped-by-non-technicals language model has no memory, yes.

You see the same thing in all popular reporting about science and tech. Endless battery breakthroughs that will quadruple or 10x capacity become a couple percent improvement in practice. New gravity models mean we might have practical warp drives in 50 years. Fusion that's perpetually 20 years away. Flying cars and personal jetpacks. Moon bases, when we haven't been on the moon since the 70s.

AI reporting and hype is no different. Maybe slightly worse because it's touching on "intelligence", which we still have no clear definition of.

naasking · on Nov 29, 2021

> It's likely that most of the functions of the human brain were selected for intelligence.

That doesn't seem correct. Intelligence came much later than when most of our brain evolved.

PaulHoule · on Nov 29, 2021

Intelligence involves many layers.

Planaria can move towards and away from things and even learn.

Bees work collectively to harvest nectar from flowers and build hives.

Mammals have a "theory of mind" and are very good at reasoning about what other beings think about what other beings think. For that matter birds are pretty smart in terms of ability to navigate 1000 miles and find the same nest.

People make tools, use language, play chess, bullshit each other and make cults around rationalism and GPT-3.

naasking · on Nov 29, 2021

"Adaptation" is not synonymous with "intelligence". The latter is a much more narrowly defined phenomenon.

pfortuny · on Nov 29, 2021

memory is something shared by… one might even say plants. But let us keep to animals: almost anyone, including worms.

manojlds · on Nov 29, 2021

Do our aeroplanes flap their wings like the birds do?

GPT-3 is obviously not the AI end goal, but we are on the path and the end might lead to aeroplanes than flapping machines.

Flankk · on Nov 29, 2021

Birds don't need 150,000 litres of jet fuel to fly across the ocean. Given that the development of airplanes was made by studying birds I'm not sure I see your point. The 1889 book "Birdflight as the Basis of Aviation" is one example.

ska · on Nov 29, 2021

> but we are on the path

This isn't actually clear; with things like this we are on a path but it may not lead anywhere that fundamental (at least when we are talking "AI", especially general AI).

da39a3ee · on Nov 30, 2021

> GPT-3 lacks any form of memory.

Is that true? How is it able to have the conversation shown in the article about python programming, if it can’t remember that the premise of the later questions it is being asked is that it said it has Python programming experience?

YeGoblynQueenne · on Nov 30, 2021

It is fed each answer it generates as part of the next input. That's why it can't keep that up for too long: the size of its input grows well-kinda-sorta- exponentially quickly until it hits the limit of 2048 tokens.

Which is one rason why only diminshing returns can come from increasing computational resources, btw. For inference, anyway.

You can see examples of that kind of back-to-back querying and examples of code to do it in one of the links of the article above:

https://www.twilio.com/blog/ultimate-guide-openai-gpt-3-lang...