GPT-3 and Scaling Trends

mordymoop · on June 4, 2020

This blogger implies that GPT-3 could be within an order of magnitude of reaching a threshold of data efficiency. The implication is that a GPT model with >1 trillion parameters would begin to see a reduction in data efficiency. I have read all the GPT related papers and I’m frankly not sure what nostalgebraist thinks this would mean, practically speaking. All complex problem domains begin to see a dropoff in data efficiency once the “easy” structure is successfully learned. The lesson here might simply be that GPT type models are close (within an order of magnitude, so “close” is subjective) to being able to learn all the “obvious” regularities in massive language datasets, leaving the increasingly subtle regularities that might require very specific, very difficult-to-reach-through-local-minima-ravines, or very “big” learned abstractions to discover. Since GPT-3 can already do some rather incredible things in the zero-shot case, to say nothing of the few-shot case, this fails to make me feel suddenly dissatisfied with the performance large transformer models.

sillysaurusx · on June 4, 2020

Does anyone know of research into chatbot memory? I found this one: https://deepai.org/publication/a-proposal-for-intelligent-ag...

GPT-2 was a solution to a certain kind of problem: is it possible to throw out the idea of representing language while still having good performance on a variety of language tasks? But there doesn't seem to be anything like that for "memory," which seems separate and distinct from "tasks."

In concrete terms, I'm interested in models that have memory in the sense of "When you inference from the model, you leave a lasting impact on the model." Inferencing from the model should cause a change in the model parameters. Yet most models currently seem to assume that model parameters can be frozen without losing something essential to the end goal.

We're very task-oriented. But none of these chat bots can remember my name, or anything about me, and it's always bugged me. GPT-2 (and now GPT-3) punts the problem to a sufficiently clever programmer: just figure out how to encode all the "memory" into the context window, and then out pops the results you want. But that feels rather like arguing "Just come up with a technique that works, and it will work." Perhaps it's true, but not too helpful.

If you hear the same name a few times, you'll remember it a long time, and start associating it with someone's face. It seems like language models could do something similar. I don't know precisely what; maybe someone here does.

You could designate part of the model as long-term memory, short-term, etc. Inferencing from the model could cause larger effects in the short-term area than the long-term area (equivalent to a higher learning rate).

nmfisher · on June 4, 2020

Are you aware of the Neural Turing Machine (https://arxiv.org/abs/1410.5401) or Differentiable Neural Computer (https://deepmind.com/blog/differentiable-neural-computers/)?

They're not exactly what you describe, which I suppose is a truly online model that knows how/when to update its own parameters.

But those two models do incorporate a similar concept of external memory, whereby the controller is trained via BP to read/write to a tensor (essentially a form of soft-addressable memory available at inference time).

As far as I recall, these were never applied beyond toy problems, and it seems this line of research hasn't been very active (at least since the Transformer's "memorize all the things" approach started performing exceptionally well on all the benchmarks). I haven't read the paper you linked just yet - it may well be relevant.

albertzeyer · on June 4, 2020

For updating its own weights, there is the idea of fast weights (https://arxiv.org/abs/1610.06258).

This idea of updating its own weights is not new. Schmidhuber has done some work on this (http://people.idsia.ch/~juergen/deep-learning-miraculous-yea...). The main idea is also that the model can modify even itself (so not having two separate nets) (see Schmidhuber, "Steps towards `self-referential' learning").

Online learning / continual learning is yet another (orthogonal) topic. This is the setting where new (training) data becomes available all the time, and the model should use that data, i.e. all input (from inference) is used to update and train the model further. This can be done by standard backpropagation. The problem is usually to overcome catastrophic forgetting in this case. See for example: https://deepmind.com/blog/article/enabling-continual-learnin...

lumost · on June 4, 2020

It's really difficult to work with tabular/structured data on these Deep Nets. Particularly when the structured data has a variable number of entries. This has less to do with theoretical constraints, and more to do with the mechanics of our current ML toolchains.

I'd imagine if the toolchain problem was solved we'd see a lot more research in this direction.

abhgh · on June 4, 2020

There has been some work in grounding NN conversations by either giving them factual memory/knowledge base - [1], [2],[3]. Personally I would love to see more work in this area where you have a robust language generator guided by a (1) high level task learner AND/OR (2) a knowledge base.

[1] Commonsense Knowledge Aware Conversation Generation with Graph Attention https://www.ijcai.org/Proceedings/2018/0643.pdf

[2] A Knowledge-Grounded Neural Conversation Model https://arxiv.org/pdf/1702.01932.pdf

[3] Conversing by Reading:Contentful Neural Conversation with On-demand Machine Reading https://arxiv.org/pdf/1906.02738.pdf

reader5000 · on June 4, 2020

Recurrent neural nets is the general term for nets with memory as you describe. Indeed LSTMs, a type of recurrent net, used to be state of the art on language tasks until the GPT transformer models. I'm sure somebody somewhere is working to make a transformer with recurrency. The neural turing machine mentioned in another comment is such an example but it seems to have been abandoned.

The main problem with recurrent models is its hard to train them with backprop. For example the GPT-3 can handle sequences up to ~2000 tokens? I'm not sure what the largest sequence LSTMs could be trained on but it was probably less.

gwern · on June 4, 2020

LSTMs typically forget after more than a few hundred tokens (vanishing gradients?), so while you could probably BPTT 2000+ steps these days, there wouldn't be much point.

> I'm sure somebody somewhere is working to make a transformer with recurrency. The neural turing machine mentioned in another comment is such an example but it seems to have been abandoned.

Yeah, there's a bunch of Transformer variants which either use recurrency, compression for long-range, or efficient attention approximation for windows so large as to obviate recurrency. The NTM hasn't been shown useless so much as alternatives like Transformers proven to be way easier to implement & scale up to get similar performance, but it pops up occasionally; a particularly surprising recent appearance was Nvidia's GameGAN which uses a NTM-like memory module for learning to model Pac-Man: https://nv-tlabs.github.io/gameGAN/

lostmsu · on June 4, 2020

I've recently read a paper, that enables very long unrolls in RNNs due to O(1) memory requirements (in number of unroll steps): https://arxiv.org/pdf/2005.11362.pdf

GistNoesis · on June 4, 2020

Memory for chatbot is usually in the form of an external memory; aka not compressed into the weights of a neural network; but presented simultaneously as input and selected with some form of attention like transformers do.

In your specific chatbot use case the obvious external memory to present to the bot is the full chat history. Alternatively you can manually extract features from the full chat history and present those instead.

Language models are just always trying to predict the next character.

Language models are kind of the jack of all trades. They are generic and given enough parameters, enough data, enough compute they will learn to solve all tasks simultaneously. But they are not enticed to solve the task you are interested in, they are just trying to learn how to best predict the next character with the finite capacity they have, and will only learn memory insofar it helps them achieve their task.

If you are interested in a specific task you can inject your specific desires either into the structure of the model, or in the structure of the dataset.

There are various line of thoughts when you want to achieve a specific task.

-The GPT approach is to make the model bigger and bigger and solve task as one or zero shot learning. By not touching at the structure of the model anymore but by training it on tasks encoded as structured text.

-You can find some literature about question answering tasks. You can have a transformer "encode" a big document and answer the question by extracting just the relevant information.

-This mean for example if you want for your chatbot to remember your name or any information you gave him before, one quick way is to give it your whole chat history (it's probably not so big). It's akin making the context window big. You can use tricks like LSH tranformers, or some form of hierarchical memory to not be too much memory constrained by the attention.

-You can also encode "manually" : Train a separate neural network to answer the question you are interested about from the chat history. Like "what is his name ?", "how old is he ?",... And save it as a context information that will be presented as input knowledge for your specific chatbot training task.

-You can also update the weights continuously like it is done in reinforcement learning. Fixed-sized neural networks need to be presented the information multiple times before being able to ingest it. So you'll need to use some kind of replay memory. It's also not a great idea to update all billions of parameters every time you want to retain an information so you will probably need to use sparse operations such that only a few parameters are updated at a time. If you look hard enough you'll probably notice that most traditional database operations can be encoded as sparse neural network operations. If you look even harder you'll probably notice that a lot of information retrieval algorithms are just gradient descent on some form of sparsely encoded neural network operations.

-You can also use GAN for text so that the loss function optimized is closer to the task at hand.

ladberg · on June 4, 2020

The author mentions that the shape of the curve from 117M parameters to 175B is interesting, but doesn't show it. Does anyone have the graph?

Tarean · on June 4, 2020

Presumably it's one of these graphs? https://i.imgur.com/r7xJfi1.jpg https://lh6.googleusercontent.com/VHmmdKYio39kz017ECfPwCxGPT...

There are a lot of graphs in the paper, though.

nostalgebraist · on June 4, 2020

I wasn't referring to one specific graph.

The GPT-3 paper (https://arxiv.org/pdf/2005.14165.pdf) has a very large number of graphs showing parameter count on the horizontal axis and some kind of prediction quality metric on the vertical axis. Most of the interesting ones are in Appendix H.

"The point isn’t the performance at 175B, but the shape of the curve as it passes from 117M to 175B" was referring to a general point about how to interpret any/all of those graphs, not a particular one of them.

yunusabd · on June 4, 2020

I think it's this one [1], there are also more detailed graphs in section 3 of the paper.

[1] https://imagehost.imageupload.net/2020/06/04/IMG_20200604_09...

perl4ever · on June 4, 2020

I read some stuff about GPT-3, and what I noticed was that although it was doing some amazing things, there were a few tests that it did very badly on. I think like 30%-ish?

It would be interesting to read an article that focused on that, for contrast, and to give insight into what is still lacking.

alpineidyll3 · on June 4, 2020

Mine goes to 11.

riskable · on June 4, 2020

I hate to be the critic but this article reads like it was generated by a similar algorithm that "wrote" those pretend scientific research papers that were accepted by predatory publishers...

http://news.mit.edu/2015/how-three-mit-students-fooled-scien...

There's a whole lot of acronyms and basically zero context. The language is so generic you could swap out many of the terms and acronyms with random ones and it would still seem like it makes sense.

When I read the title my first thought was, "Wait: When did GUID Partition Table format version 2 come out? They're already taking about version 3‽"

drusepth · on June 4, 2020

I don't think the audience for this article is meant to include people that aren't familiar with GPT-2/GPT-3.