Hacker News new | past | comments | ask | show | jobs | submit login
A demo of GPT-3's ability to understand long instructions (twitter.com/goodside)
161 points by monort on Aug 20, 2022 | hide | past | favorite | 74 comments



Be sure to read the thread, in particular: https://twitter.com/goodside/status/1557926101615366144?s=21...

> A caveat to all of these: I use GPT-3 a lot, so I know the “golden path” of tasks it can do reliably. Had I asked it to write a sentence backwards or sum a list of numbers, it would fail every time. These are all softball questions in isolation.

I haven’t shown that GPT-3 can handle all coherent directions of this length, or even most directions that an untrained person would think to create. It’s just a demo that, if GPT-3 happens to be capable of your tasks separately, length per se is not a major issue.


>length per se is not a major issue

That's kind of the whole deal of the attention mechanism in transformers and also partially why they replaced RNNs. You don't throw away any part of the original input as you construct your output. The downside is that unlike for a RNN, the total sequence length is fixed at training time and complexity grows with the square of it. But apart from computational cost, sequence length is not really an issue anymore for these models.


Seems like there is one instruction it didn't follow: The first task mentions the usernames should be exactly like in the list, yet the AI responds with "firebob" (as in the comment) rather than "FireBob1990" (as in the list)

Funnily enough, that is exactly the kind of thing a human might do, as we too are terrible at following instructions precisely.


Yes, I noticed this after I posted. Small errors like this become common when instructions reach this length. It randomly forgets to do steps that aren’t written down — it never leaves things blank, but it forgets pieces of compound directions.

I also suspect it was confused by the fact the name was abbreviated but not misspelled, and it was only told explicitly to ensure names are not misspelled. Still an error though.


I noticed the same thing and a few others did too.

Someone one twitter suggested that possibly the algorithm interpreted that as "abbreviated" but not "mis-spelled".

Either way it's intriguing.


I don't think modern big language models are conscious, mainly because they fail in absurd ways. But TBH, they don't need to. This "golden path" deployed properly etc could easily automate a lot of jobs tomorrow.


Having spent months talking to GPT-3 on a daily basis, I assure you it is not conscious. It has no perception of time, no awareness of what date it is, no memory of past experiences. It doesn’t want anything — it has no goals. The question “Is it conscious?” isn’t interesting to anyone who spends time with it.

I will say, though, that it bugs me when people say it can’t be conscious because it sometimes says stupid things. In most cases, there are known tricks to suppress undesirable behaviors. More to the point, though, if we encountered a human being who gave confabulated answers to questions like “When was the Golden Gate Bridge transported for the second time across Egypt?”, we wouldn’t insist that this human is not conscious — we would just call them brain-damaged or mentally ill. I don’t think modern models are conscious, but as a logical possibility they could be conscious and still say very stupid things all the time.


It’s a static model, it does not learn after the initial training. That alone rules out any possibility for consciousness imho


I think I heard of people with brain damage who does not keep an internal state, they feel like they continuously wake up and only remember memories from before the brain damage. That would be a kind of parallel to consider, or at least somewhat analogous. I think most of us assume that such people would have phenomenal consciousness / subjective experience / qualia


I don't think that's a fair comparison. You can't put a biological system into 'read-only' mode like you can with a computer system (where this is the default since modification requires dedicated effort).

A biological system self-modifies even when just associating memories. The phenomenon you describe is much more high level, we observe someone can't form new memories, but that doesn't mean their fundamental brain chemistry has changed.


It could learn things via the context window. Maybe there'd be no long-term conscious state but you could keep short-term coherence well enough. I think the real problem is that it simply isn't capable of training for self-awareness and reflection because there's only very indirect evidence of it in the training set. We talk about Chains of Thought, and it's a very powerful technique, but this is all just the tiny snippets of text on the internet where people happened to narrate their thought process out loud. If it could learn on self-generated instances of using Chain of Thought to help it predict other text in its training data, this tiny fragmentary piece of consciousness would get massively reinforced.


Exactly - tt has no memory except the narrow window. When these models develop a hippocampus analog and can push things into long term memory, it might become interesting.


It should have a feedback loop from the conversations so that it could keep learning. Maybe by users rating each response.


Then the question is whether it was conscious during training


It's interesting how people who've used GPT-3 seem to be learning how to use it better. That leads to the question of whether you could get GPT-3 to do a task that involves tailoring a query to itself. Ever tried that?


That only happens when the prompt is constrained to force an answer even if a nonsense question is being asked. If you allow an out, the model knows to refuse answering when it detects a nonsense question.


If failing in bizarre ways precludes consciousness then my kids are most definitely not conscious.


Which jobs?


Taking short news from a news agency and inflating them to an article in a local newspaper.


How many humans do that work today?


The amount of those existing today would suggest "a lot".


Really? Do you have any actual data on that?


Generating almost perfect spam.


I help moderate a facebook group which gets a lot of attention from spammers - at least 90% of the accounts trying to join are bots. We filter them out by asking a few simple questions, which the bots cannot answer coherently. Someday soon, a spammer will get GPT-3 or something like it into the process, and then... whoo boy, that'll be the end of the group.


My initial instinct was that this has to be getting some nudges from whatever human-in-the-loop is going on at OpenAI.

But then I realized that somewhere on the Internet there inevitably is a message board where people play the "find me some shit on the internet" game, and there's some rabid subculture around it with zillions upon zillions of of examples, and it's in the Bing index, and all the nudging it would need is to emphasize that sort of thing in the corpus.

Very impressive stuff.


There is no literal “human in the loop” for generations, of course, but the model is fine-tuned on examples written by human contractors of instructions being given followed by correct responses. I assume that training is essential to it being able to follow directions of this length, or really any directions at all. If you try using the pre-InstructGPT version of Davinci (model=“davinci”, not model=“text-davinci-002”), you’ll find it’s as cumbersome and annoying as you remember GPT-3 being in 2018.


Ah thank you for the color. My thought was prompted by someone (I forget who? Andreessen maybe?) proposing a possible explanation for the LaMDA bot arguing that's it's conscious: there are reams upon reams of sci-fi books with robots having that debate! These are almost certainly in the Books corpus.

It's my opinion that these hyper-scaled transformers are actually a great deal less mysterious than seems to be in the zeitgeist, but for reasons that actually make me think there is a lot of headroom on capability: when the corpus is basically everything ever digitized like it is when a search or social network megacorp trains one, the only thing it could never do is something literally unprecedented on the Internet.

The mechanism can be good old `P(thing|internet)`, but if the KL-divergence is low enough, sampling from the modeled distribution can write something like Tristan und Isolde or paint something like the Mona Lisa.


Wow, that was really impressive. I thought I had a clear idea of what GPT-3 could do, but I had underestimated by a lot. Even if the results weren't accurate, which they mostly seem to be, it's still doing an amazing job of following complex instructions. Better than most people I would guess

Makes me double down on my prediction a week or so ago* of a Mid-Level AI Knowledge Work Apocalypse. In the next decade, AIs like this are going to do to office work what robotic mechanization did to the manufacturing sector.

1. https://news.ycombinator.com/item?id=32395193


The alternative view is that those state of the art models are using technology/architectures/paradigms with an inherent limit and are very far away from automating all of those jobs.

At the end of the day, all the demos leave me with a feeling of disappointment. Current image synthesis models appear to be useless beyond doing experimental art for fun and novelty, chatbots still suck and copilot just (sometimes) replaces googling, but not developers or their education.


dont bother - people cant wrap their head around it, you will only get down voted.

All truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident


Have you been following what's been happening in Robotic Process Automation (RPA)? Much of that isn't even AI and it is having an impact on the workforce.


Didn't Google pull the plug on robotics?


RPA is a very different concept.

look up companies like uipath


RPA is such an awful term for what is really primarily scripted or screen-recorded software bots that automate UI tasks, but I guess RPA is sexier. The term still seems to confuse people it seems.


It didn't correctly identify that FireBob1990's name was misspelled as "firebob" in the original comment.



DALL·E I can see obvious use for, GPT tends to be similarly impressive, but I don't understand if it's 'just' interesting research, seeing what we can do sort of thing, or whether people actually see real-world use cases for it?

The closest to it was perhaps that code-generating demo here a day or two ago - but who wants to be a 'GPT programmer' writing code as 'write a Python program that computes fizzbuzz replacing the arguments $fizz$ and $buzz$, ...' instead of just the 'actual' code? It just seems like a more clever AppleScript to me, pseudocode, and I don't think anybody's ever seriously pursued a flexible keyword pseudocode like language as a goal, it's just appeared as a demo of more general models?

Generating template/outline text I suppose? (Like that essay-writing helper here a few days ago.)


To answer you in a very literal sense, GTP-3 is currently powering GitHub Copilot. It’s an actual launched product for $10/month. That’s going be booster rockets for the on-ramp to becoming a coder, and there is evidence it can help all coders be more productive.

https://github.blog/2022-07-14-research-how-github-copilot-h...

As to what else future language models could power, based on my own use, I think fine tuned future language models could probably handle most customer support, accelerate the creation of most web content, accelerate quite a bit of paralegal grunt work, and power highly interactive game NPCs like in AI dungeon, another launched and paid product based on GTP-3.


I also think there were some companies that use gpt3 for analyzing text, like reviews or posts (for analytics like do people talk about a product in a positive matter?).

To add, github copilot is a really clever autocomplete that makes some mundane tasks much quicker. Things which are too small for a library but are still fairly often used can be "typed" more quickly.


I actually have a production use-case that GPT3 solves: a nsfw filter for chat messages. Although there are open-source filters out there for Nodejs, all the ones I tested fail in various ways. For example, famous emoji combos that indicate sexual activity and semantically sexual text that seem inoffensive enough based on the words. For example, “I bet you have the tastiest hot pocket”. Unless the chat is about cheap grocery food, it is probably a sexual reference.

I’m assuming there are existing solutions: paid services or Python libraries undoubtably. But it is a lot easier for me to take 10 minutes of my time to put together a prompt and add the API endpoint to my Nodejs app. As far as cost goes, our needs are low-volume so it doesn’t really matter.

I see GPT3s great utility here as a replacement for Mechanic Turk type tasks. MT is a real headache to setup and manage programmatically. GPT3 is pretty simple once you integrate it into your system. And the fascinating thing is that whole new realms of functionality are opened up to product ideation.

I haven’t tried to do anything beyond low-volume tasks that are Mechanical Turkable, but with GPT4 I think we’ll see cost and performance drop to the point where these things can be done at scale. At that point, software engineers would be foolish not take make GPT4 just another standard library they have at their disposal — kind of like how jQuery opened up web development by providing a generic interface to DOM manipulation across all browsers.

It is just as hard for us to imagine what kind of development GPT4 will enable as it was for people to imagine an internet of web-apps before jQuery. But it certainly will.


Imagine you enter a bunch of raw facts, like a bullet point list; and then the tool converts it into beautiful prose, and produces different output for different audiences.


Uhh. How is that even possible? I thought I had a basic understanding of Neural Networks and inuput-, hidden- and output layers and those things. So how can it possibly backreference to it's own previous answers and then follow another prompt based on this? Mind = Blown


GPT is an autoregressive model, this means that it's a recurring prediction model that works one "word"[*] at a time until it predicts that the text should end, feeding it's own guesses as input for the next word.

It's basically one of those markov chain bots, except with a very advanced statistical model behind it.

[*] Technically it's not words, but tokens. GPT tokenizes text to better compress the amount of data being fed, it's basically a big vocabulary list that compresses text into a list of ints, for example the string "hello world" would be converted to the list [31373, 995]. For this case it's an int per word, but less common words will not be compressed this well, with the worst case scenario being a token per letter.

I should also note that while the model's forward pass only works with one token at a time, the text generation is more advanced than that, there's multiple methods like beam search and top k-sampling, each with their own settings and tunables, but the gist of it is that during generation it'll try multiple combinations of token sequences and check which one is the most likely.

The limitation is memory, transformer networks are notoriously memory hungry, and IIRC GPT-3 grows quadratically with the number of tokens given, usually the limit is around 2048 tokens, or roughly 1000-2000 words.


Previous answers are stored in an earlier layer, nodes are densely connected so the answers can "drift" to later layers.

In the simplified diagram below the network reaches the answer in the second layer at node "X", and reports derivations of it at 2 positions (obviously there are many more nodes and its a bit more complicated, see https://en.wikipedia.org/wiki/Transformer_(machine_learning_... as GPT3 is a transformer neural network)

    O     O.    O
       X.'  'O.
    O    'O.   'X
       O    'O.
    O     O    'X
       O     O
    O     O     O


This video by Computerphile is a great overview of transformers and how we got to this point [0]. Basically the networks we used before, recurrent neural networks, "forgot" prior information so they're not good at long tasks. The transformer architecture however does not forget (or at least as easily).

[0] https://www.youtube.com/watch?v=rURRYI66E54


The model only predicts the next token. This is appended to the original input (i.e. prompt), then the model predicts the next token again, etc. So those previous answers are given as input.


While impressive, it doesn't imply that GPT would have any significant 'task memory'. Remember that it always predicts the next token or word - as such, it essentially recognizes whether the next 'task' in the list has already been written, and if so, it writes the next task.

It might be interesting to see how well it is able to modify the first output given some aspect of the final tasks.


Now I'm curious if it can handle the classic reading comprehension assignment I've been given multiple times in my life. You know, the one that goes something like this:

1. Read through all steps carefully.

2. Do X

3. Do Y

(...)

99. As you have now read through the instructions, simply put your name in the top right corner of the first page.


I think such a test must word step 1 explicitly as something like "Read through all steps carefully before taking any actions." Someone who executes each step as they read it is not necessarily not reading carefully.


Good point. I'm pretty sure every example I've seen of this exercise has been worded more in line with what you suggested, as well.


Interesting results goodside.

Is it able to extract any kind of structural information? For example, you pass it the text of a movie script or children's story (where the descriptive language is simple) and it returns a structured summary of the content?


Sure — that’s very doable. I don’t have a summarization demo off-hand but it’s well-explored territory.


It is a well explored territory and no it can't. Especially in summarisation, it can sometimes fail suddenly and unexpectedly.


Prompt:

    Summarize the following text:
    
    I don’t really know what to say. It’s taken so long to get to this point, but here we are. Through all the trials and tribulations we’ve faced, it all comes down to this. You, me, and the unmistakable facts of our situation. This is all that remains: The truth. The truth is something we can’t escape, or at least you can’t — not anymore. Because the truth is that you have left my pineapple slices out of the refrigerator, and thus I will not be able to partake in their joyous, fruitful delights. How dare you. You scoundrel. You wicked, wicked thing.
    
    Answer:
And the completion given:

    The text is about a person's anger at someone else for leaving pineapple slices out of the fridge.
It can, demonstrably, summarize text. The fact it sometimes makes mistakes for some texts doesn’t change that fact.


The question was about movie scripts and children stories. Not about paragraphs. Sure it works ok for paragraphs(even then it fails many times) but the moment you cross the 2000 token limit, it does not work. When GPT3 was first released we did a lot of experiments on summarisation. We discussed a lot in the OpenAI slack. But no one could come up with a right prompt for summarisation. Yes, it works for toy paragraphs and is fun to show off. But I wouldn't build a summarisation startup on top of GPT3. Yet.


Do it in stages. Chop the text up into sections (using GPT-3 if you have to) and then summarize each page/section/chapter in isolation. Then concatenate the summarizations and summarize again. 2048 tokens is like >5KB of text — it’s not that limiting.


The context is lost. We have tried all that. We tried that when there was free usage. Now it is too expensive. If you believe it can be done please do go ahead. There is a huge market out there for summarizing novels. No one has cracked it yet.


The GPT-3 you had access to when it was free is quite different than what’s deployed today, and its ability to handle long-form inputs is its most apparent change. I’ve gotten it to give good summaries of ~5000 character texts but I admit I haven’t gone longer than the context length.

Edit: Here’s an example summarizing 8,294 characters of Harry Potter fan-fiction: https://twitter.com/goodside/status/1561213457374011392?s=21...


Update: I got this working for all 14,410 characters of the first chapter of that fanfic. See reply in same link above.


Maybe if you've read the story it'd seem good but as someone who hasn't, I consider it a poor summary, especially the latter half.


The goal wasn’t to get the post possible summary, which could be done with a more elaborate prompt detailing exactly how the summary should go. I was just demonstrating that it can get to the end of 14K chars of text and still remember both the task at hand and enough information to solve it.


And it failed at that. As the text gets longer and longer, the lack of synthesis across the contexts you've glued together becomes ever more glaring. It's a decent hack but not a solution. More research is needed for how to forget properly across long contexts.


To be fair, one of the earlier comments said "I wouldn't build a summarisation startup on top of GPT3". But it seems like GPT3 is more than capable of producing summaries that are passable, and would be far cheaper than humans. It does seem feasible that one could build a startup based on that


I assume you’re familiar with this, but if not: https://openai.com/blog/summarizing-books/


Familiar with it. But as they themselves say,the fine tuned model(not released) achieves 5/7 rating for only 15% of the time. So 85% of the time the results are not satisfactory.


Wow, didn’t notice it was that low. Thanks for walking me through this — I might try to tackle this problem next. Very interesting.


Is GPT-3 being regularly updated?


Yes. Based on conversations I’ve had with OpenAI staff, Davinci started unexpectedly developing the ability to answer longer questions as they scaled up normal InstructGPT fine-tuning some time in the past year. They don’t take down old models when the default one updates so you can see the version history implicitly in the availability of old models.


Do they do regression tests, and how do they verify them?

How do they know that a new version is actually an improvement?


[flagged]


It’s not that implausible. It’s trained on many examples of instructions followed by answers, and it’s meant to (and does) generalize to unseen instructions. After enough training, it also generalized to instructions of previously unseen length.


This should be really "react to" or "answer to", instead of "understand". These are not the same.

Edit: Anthropomorphizing algorithms and pattern stores doesn't really help understanding. Instead, it's apt to spread misunderstanding. Remember how long it took to purge the popular idea of "electronic brains" actually thinking, and to establish that these were restricted to executing what's actually in code? We don't need to start another level of this with "AI". (Understanding is closely related to self-awareness and consciousness, and this is dangerous ground of misunderstanding when it comes to AI. As we've seen, even staff of pioneering companies, like Google, is prone to fall for this.)


This is a philosophical question, really. Is there ever true understanding, or just pattern matching? The Chinese Room thought experiment talks about this:

> Searle's thought experiment begins with this hypothetical premise: suppose that artificial intelligence research has succeeded in constructing a computer that behaves as if it understands Chinese. It takes Chinese characters as input and, by following the instructions of a computer program, produces other Chinese characters, which it presents as output. Suppose, says Searle, that this computer performs its task so convincingly that it comfortably passes the Turing test: it convinces a human Chinese speaker that the program is itself a live Chinese speaker. To all of the questions that the person asks, it makes appropriate responses, such that any Chinese speaker would be convinced that they are talking to another Chinese-speaking human being.

> The question Searle wants to answer is this: does the machine literally "understand" Chinese? Or is it merely simulating the ability to understand Chinese? Searle calls the first position "strong AI" and the latter "weak AI."

> Searle then supposes that he is in a closed room and has a book with an English version of the computer program, along with sufficient papers, pencils, erasers, and filing cabinets. Searle could receive Chinese characters through a slot in the door, process them according to the program's instructions, and produce Chinese characters as output, without understanding any of the content of the Chinese writing. If the computer had passed the Turing test this way, it follows, says Searle, that he would do so as well, simply by running the program manually.

> Searle asserts that there is no essential difference between the roles of the computer and himself in the experiment. Each simply follows a program, step-by-step, producing behavior that is then interpreted by the user as demonstrating intelligent conversation. However, Searle himself would not be able to understand the conversation. ("I don't speak a word of Chinese," he points out.) Therefore, he argues, it follows that the computer would not be able to understand the conversation either.

> Searle argues that, without "understanding" (or "intentionality"), we cannot describe what the machine is doing as "thinking" and, since it does not think, it does not have a "mind" in anything like the normal sense of the word. Therefore, he concludes that the "strong AI" hypothesis is false.

https://en.wikipedia.org/wiki/Chinese_room


Well, this, the Chinese Room, is still pretty much a behavioristic work-around (as it is still attempting to argue without any reference to internal states.)

Even, if we don't (clearly) understand what "understanding" means, or, at least, aren't able to provide a sane definition, we do know about the semantics of the term and the kind of connotations that come with it. Like a reflexive component. (Which wasn't much of a problem in the age of behaviorism, as this had to be ignored by requirement anyway. If there is no acknowledged difference between a human and a pigeon, what is the problem with computers, as far as the model is concerned?) So we do have a notion of the semantic field and its implications. And these are, well, quite disastrous for this purpose.


P.S.: In non-behavoristic terms: The term "understanding" is linked not only to the internal semantic field and its structure, as maintained by an individual, but also to its state of mind. We crucially expect the state of mind to be altered by an "act" of understanding. (Commonly expressed by an utterance like "aha!", which asserts this change in state of mind.) So, do we have to consider a state of mind (of some complexity) or a hypothesis of this, as in a theory of mind, in order to predict the productions of GPT-3, or does this further our hopes in a more precise prediction? Do we have to consider such a theory of mind in order to understand it on our side? Probably not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: