Are there any possible technologal or scientific leaps on the horizon that would reduce training time by an order of magnitude or more? GPT-3 took 355 years to train with incredibly expensive hardware, which means small players have no chance to push the state of the art
As models get bigger less and less neurons get activated by any given input. If you can somehow predict which neurons get activated you can skip the vast majority of the computational load. I have read a paper where they argued that only 0.5% of the neurons are actually active in a 200 million parameter model so you can get a 200x improvement just from that.
What this tells you is that there is very little money in optimizing deep learning and that NVIDIA has made it very easy to just throw more hardware at then problem.
Oh - there are a lot of people working on optimizing AI. Amongst hobbyists, academia, and corporations alike.
The thing is, if you come up with a neat optimization that saves 30% of compute for the same results, typically instead of reducing your compute budget 30%, you instead increase your model/data size 30% and get better results.
This is hard a-priori, but fairly easy post-facto. Model distillation isn't a common practice yet, but it has already been demonstrated to be quite effective for specific use cases.
It is somewhere from 8x to 25x faster than doing dense machine learning. The speedup was higher on the original CPU implementation and the GPU paper mentions that if there isn't enough shared memory on the GPU it will have to switch to an algorithm that has more overhead.
Edit: There is a paper for sparse spiking gradient descent promising a 150x improvement. I am not sure how practical this is because spiking neural network hardware heavily limits your model size but here it is:
I wonder about this, too. OpenAI's biggest 'moat' is that their model takes so much resources to train, not that their algorithms are particularly secret.
One idea I had was to not use one single model to learn all steps of the task, but to break it up. The human brain has dedicated grammar processing parts. It is unclear whether something like a universal grammar exists, but we have at least an innate sense for rhythm. Applied to NLP, you could heavily preprocess the input. Tokenize it, annotate parts of speech. Maybe add pronunciation, so the model doesn't have to think about weird english spelling rules, and so you can deal with audio more easily later. So I would build all these little expert-knowledge black boxes and offer them as input to my network.
But there is also some inherent resource cost in large language models. If you want to store and process the knowledge of the world, it is going to be expensive no matter what. Maybe we could split the problem into two parts: Understanding language, and world knowledge (with some messy middle ground). I believe you could replace the world knowledge with a huge graph database or triple store. Not just subject-verb-object, but with attribution and certainty numbers for every fact. The idea would be to query the database at inference time. I don't know how to use this in conjunction with a transformer network like GPT-3, so you'd likely need a very different architecture.
The big benefit of this would be that it is feasible to train the language part without the world knowledge part with much less resources. But you have other benefits, too. ChatGPT is trained to "win the language game". But as they say, winning the argument does not make you right. If you have a clean fact database, you can have it weigh statements from trustworthy sources higher. You then basically have a nice natural language frontend to a logical reasoning system that can respond with facts (or better: conclusions).
GPT and human brain ( at least the language / speech part ) have nothing in common. We, as humans, do not use language in a generative way, is derived from a higher or very low level of abstraction ( intentions, emotions, etc ) and is explictly use for communicating something. Even this text is based on previous knowledge, saved in an abstract way, and while writing this I must follow the synthax of the language or writing the right order otherwise, you , the person who reads this, will not understand what I mean. While GPT can generate the same text, it does not have a motivation and has no need to communicate ( while I just wanted to feel good by bringing some contribution on HN ).
> and while writing this I must follow the synthax of the language or writing the right order otherwise
A good example that is not, word randomised order and kombination with Mrs Spelling and fonetic spel-ing prevent ye knot that which I wrote you to komprehend.
(My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).
A better point is that GPT-3's training set is more tokens than the number of times an average human synapse fires in a lifetime, squeezed into a network with about 3 orders of magnitude fewer parameters than the human brain has synapses.
It's wrong to model AI as anything like natural intelligence, but if someone insists, my go-to comparison (with an equivalent for image generators) is this: "Imagine someone made a rat immortal, then made it browse the web for 50,000 years. It's still a rat, despite being very well-trained."
> (My apologies to non-native speakers of English; if someone did that to me in German I'd have no clue what was meant).
At least for me it's perfectly understandable (except the "Mrs" part). This reminds of those "did you know you can flip characters randomly and our brain can still understand the text" copypastas that can be found everywhere. I think it's probably quite similar for word order: As long as your sentence structure is not extremely complicated, you can probably get away with changing it any way you like. Just like nobody has issues understanding Yoda in Star Wars.
Although I think there are some limits to changing word order - I can imagine complicated legal documents might get impossible to decipher if you start randomizing word order.
These are conceptual "differences" that don't actually explain the mechanics of what's going on. For all you know "motivation", "intentions", etc. are also just GPT-like subsystems, in which case the underlying mechanics are not as different as you imply.
That's the hardware it runs on, not the software architecture of GPT. I could equally say that transistors are faster than synapses by the same ratio that marathon runners are faster than continental drift.
It seems to me that a lot of everyday communication is rather statistical in nature. We don’t necessarily think deeply about each word choice but instead fall back on well worn patterns and habits. We can be more deliberate about how we compose our sentences but most situations don’t call for it. It makes me wonder if we don’t all have a generative language model embedded in our brains that serves up the most likely next set of words based on our current internal state.
Here we go again. They must have something in common, because for about 90% of the tasks the language model agrees with humans, even on novel tasks.
> We, as humans, do not use language in a generative way
Oh, do you want to say we are only doing classification from a short list of classes and don't generate open ended language? Weird, I speak novel word combinations all the time.
No, what is meant is that the next word I speak/write after a current word are not based on a statistical model, but on a world model which includes a language structure based on a defined syntax and cultural variaty. I actually mean what I say while the ChatGPT just parrots around weights and produces an output based purely on statistics. There is zere modeling which translates into real world ( what normally we call "understanding" and "experience" ).
Oh, I see. Then I agree with you, an isolated model can't do any world modelling on its own. No matter how large it is, the real world is more complex.
It might be connected to the world, of course. And it might even use toys such as simulators, code execution, math verification and fact checking to further ground itself. I was thinking about the second scenario.
The more experience I get, the more I wonder if this is really the case for us. We certainly have some kind of abstract model in our heads when thinking deeply about a problem. But in many settings - in a work meeting, or socially with friends - I think it is a much more automatic process. The satisfaction you get when saying the right thing, the dread when you say something stupid: It is just like playing a game. Maybe the old philosophical concept of society as merely "language games" is correct after all. A bit silly but I find the thought makes annoying meetings a bit more bearable.
But you are of course right with GPT, it has no inner life and only parrots. It completely lacks something like an inner state, an existence outside of the brief moment it is invoked, or anything like reflection. Reminds me of the novel "Blindsight" (which I actually haven't read yet, but heard good things about!) where there are beings that are intelligent, but not conscious.
This biggest most is high-quality data. Both their proprietary datasets (WebText, WebText2 etc), but also now their human-annotated data. Another secondary moat is their expertise with training models using PPO (their RL method), they can get results that are quite better than other labs. I say this moat is secondary because it's possible that you can get similar results with other RL algorithms (e.g. DeepMind using MPO) and because maybe you don't really need RL from Human Feedback, and just fine-tuning on instructions is enough
I find OpenAI having exclusive access to that kind of high-quality data more concerning than them having access to their current amount of compute and currently trained model. A couple of million dollars worth of compute is in the realm of any medium sized research university, larger company or any country worth of mention. And seeing as Moore's law still applies to GPU, the cost will only fall.
However high-quality data is scarce. I would be willing to fund a proper effort to create high-quality data.
It's not just about compute; if that were the case, then models like BLOOM and OPT, which also have 175 billion parameters, would have the same performance for real-world use cases as GPT-3, but they don't. Datasets are also very important.
An interesting outcome of the nanoGPT repo is this struggle to exactly match the Chinchilla findings[0], even after discussing it with the authors.
A larger discussion is that the scaling laws achieve loss-optimal compute time, but the pre-training loss only improves predictions on the corpus, which contains texts written by people that were wrong or whose prose was lacking. In a real system, what you want to optimize for is accuracy, composability, inventiveness.
I highly doubt this in practice on a large scale. Outside of the common phenomena of "most large NNs are under trained" and "less better data is sometimes better than more worse data", there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.
I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.
What do you mean by "small players have no chance"? OpenAI was founded in 2015, it used to be a "small player" which just got things right and grew with it - we're not talking of Google or Facebook investing a chunk of their billions cash. In Germany, AlephAlpha has built their own supercomputer and are training similar sized models. It's expensive for sure, but well in the possibilities of startups. In France researchers trained the similarly sized BLOOM model https://huggingface.co/bigscience/bloom. They claim it cost between $2 and $4 millions.
Sure, a single researcher can't replicate this at their university, but even though OpenAI likes to publish it this way, we're not really talking about research here. Research was inventing the transformer architecture, this is just making it bigger by (very smart) engineering choices. It's something companies should do (and are doing), not researchers.
> we're not talking of Google or Facebook investing a chunk of their billions cash
OpenAI had raised $1B from Microsoft in 2019 and used it to train a 175B param model. Now, they have raised $10B and are training GPT-4 with 1.5T params. GPUs are capital intensive and as long as there are returns to bigger models, that's exactly where things will go.
It could actually work. It would be an incredibly gutsy move and I love it, and they'd probably earn a lot of respect. They’d get so much press for it. And if it held up, it’d probably be one of the things that MS is remembered for.
Small players should focus on applications of this tech.
We now know that whatever AI Models succeed in the future, they'll be trained by a huge company and finetuned to a specific use case. Small companies should be working on use cases, and then just upgrade to the latest SOTA model.
> Small players should focus on applications of this tech.
That sounds a bit condescending. We are probably at a point from which the government should intervene and help establish level playing field.
Otherwise we are going to see a deeper divide between multibillion businesses conquering multiple markets and sort of neofiefdom situation.
This is not good.
It's not that condescending, that's todays reality. Should I feel entitled for $600k training time that may or may not work? Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?
It's quite reasonable to make use of models already trained for small players.
> Do you think the government is a good actor to judge if my qualifications are good enough to grant me resources worth a house?
Governments already routinely do that for pharmaceutical research or for nuclear (fusion) research. In fact, almost all major impact research and development was funded by the government, mostly the military. Lasers, microwaves, silicon, interconnected computers - all funded by the US tax payer, back in the golden times when you'd get laughed out of the room if you dared think about "small government". And the sums involved were ridiculously larger than the worth of a house. We're talking of billions of dollars.
Nowadays, R&D funding is way WAY more complex. Some things like AI or mRNA vaccines are mostly funded by private venture capital, some are funded by large philanthropic donors (e.g. Gates Foundation), some by the inconceivably enormous university endowments, a lot by in-house researchers at large corporations, and a select few by government grants.
The result of that complexity:
- professors have to spend an absurd percentage of their time "chasing grants" (anecdata, up to 40% [1]) instead of actually doing research
- because grants are time-restricted, it's rare to have tenure track any more
- because of the time restriction and low grant amounts, it's very hard for the support staff as well. In Germany and Austria, for example, extremely low paid "chain contracts" are common - one contract after another, usually for a year, but sometimes as low as half a year. It's virtually impossible to have a social life if you have to up-root it for every contract because you have to take contracts wherever they are, and forget about starting a family because it's just so damn insecure. The only ones that can make it usually come from highly privileged environments: rich parents or, rarely, partners that can support you.
Everyone in academia outside of tenured professors struggles with surviving, and the system ruthlessly grinds people to their bones. It's a disgrace.
Pharmaceutical or nuclear research doesn't really classify as "small scale" as this thread started. I know there are massive amounts of money handed our by governments to fund research, but for a 3 guy startup in a garage that's probably hopeless. Public money is cursed anyways, it's better not to touch it.
I've also read it at many places, that academic research funding is way too misaligned. It's a shame, really.
I'm not being condescending at all, we've learned that the value in AI is in the applications. If you think government should regulate the field, it should be to make AI Models a commodity, like electricity.
> In our experiments on the Pile, a standard language modeling benchmark, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets.
The research is still ongoing, although perhaps lower-profile than what appears in the press.
RETRO did get press, but it was not the first retrieval model, and in fact was not SOTA when it got published; FiD was, which later evolved into Atlas[0], published a few months ago.
How long does it take to train a human? It's useless for two years then maybe it can tell you it needs to poop.
The breakthrough will be developing this equivalent in an accessible manner and us taking care to train the thing for a couple of decades but then it becomes our friend.
Neither does OpenAI. It costs so much and still delivers so little. A human can generate breakthroughs in science and tech that can be used to reduce carbon emissions. ChatGPT can do no such thing.
What percentage of humans make meaningful contributions to advancing science or technology? The overwhelming majority of us are just worker bees servicing the needs of the human population.
I agree with you on this point. It’s also arguable that less people with a better education system could yield the same result with less environmental impact.
But my point, poorly explained, is that whatever ChatGPT is, it isn’t original or creative thought as a human would do it.
Chomsky’s example (which is based off Turing): Do submarines swim? Yes, they swim — if that’s what you mean by swimming.
We don't have any clear definitions for "creativity" to begin with. In practice, in these contexts, it seems to be defined as "whatever only humans can do" - that is, the goalposts are automatically moved with every AI advancement.
How could they be moved when they aren't even defined in the first place? Scientists don't even know where to begin when it comes to studying the mind and human consciousness.
But yes, scientists can look at your experiments and show that they don't have anything in common with human thought.
> What percentage of humans make meaningful contributions to advancing science or technology?
I’m a nobody that you’ve never heard of and I’ve arguably made meaningful contributions. If that’s true, don’t you think there could be way more people out there than you or sibling commenter imply?
You can't know that. Currently, 8 billion humans generate a few scientific breakthroughs per year. You'd have to run several billion ChatGPTs for a year with zero breakthroughs to have any confidence in such a claim.
With billions of GPT output streams, how do you actually discover and rank what’s significant? Screen them through some even more powerful models? I imagine it’s like a volcano eruption of text where some are absolutely brilliant and most is worthless and finding the jewels is even more demanding than generating it all.
Some theories are easily testable. For instance, ask it to write some code to efficiently solve traveling salesman problems, and then test the code on some sample problems. You can score the quality of solutions and time taken, and manually inspect the best ones.
At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would. It can’t consume hundreds of math textbooks and learn the principles of math and then apply them more broadly to science textbooks and research papers. It can’t even reliably add two numbers.
Yes, brute forcing with hard AI can produce many thoughts. But the AI wouldn’t know they are correct. It couldn’t explain why. Any discovery would only be attributable to randomness. It wouldn’t be learning from itself and its priors.
> At this point there is no framework that suggests GPT understands the underlying data. It can’t assign meaning as a human would.
Actually there are many indications that GPT understands the data, because its output mostly makes sense. The reason it can't assign meaning the way a human would is because a human can correlate words with other sensory data that GPT doesn't have access to. That's where GPT creates nonsense.
Think carefully about what "understanding" means in a mechanistic sense. It's a form of compression, and a few billion parameters encoding the contents of a large part of the internet seems like pretty good compression to me.
GPT doesn't display understanding of purely abstract systems, so I doubt it's an issue of lacking sensory information. It can't consistently do arithmetic, for example - and I think it would be presumptuous to insist that sensory information is a prerequisite for mathematics, even though that's how humans arrived at it.
It's not yet clear why it struggles with arithmetic. It could be data-related, could be model-related, although scaling both seems to improve the situation.
In any case, GPT could still understand non-abstract things just fine. People with low IQ also struggle with abstract reasoning, and IQ tests place GPT-3 at around 83.
I still think that this will be a major form of AI that is accessible to the public at large and it will enable productivity improvements at all levels.
I'm not joking, this is really something I think will/should happen.
Alternatively, are there ways to train on consumer graphics cards, similar to SETI@Home or Folding@Home? I would personally be happy to donate gpu time, as I imagine many others would as well.
There absolutely are! Check out hivemind (https://github.com/learning-at-home/hivemind), a general library for deep learning over the Internet, or Petals (https://petals.ml/), a system that leverages Hivemind and allows you to run BLOOM-176B (or other large language models) that is distributed over many volunteer PCs. You can join it and host some layers of the model by running literally one command on a Linux machine with Docker and a recent enough GPU.
Disclaimer: I work on these projects, both are based on our research over the past three years
Or how to apply communism to software engineering. I like that.
More seriously, the risk that a few companies become even more powerful thanks to their restricted access to such NN is very frightening. The worth is, without legal restrictions, there is nothing that we can do against it. And I doubt that legal restrictions come in the next months / years.
Well at that point, some people might have the crazy crazy insight that no matter how big the model is, or how many GPUs they have, it burns up all the same.
"We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run."
That's still including margins of cloud vendors. OpenAI had Microsoft providing resources which could do that at much lower cost. It still won't be cheap but you'll be way below $5m if you buy hardware yourself, given that you're able to utilize it long enough. Especially if you set it up in a region with low electricity prices, latency doesn't matter anyway.
I think AI is going to go the way of the hard sciences where the age of tinkerers making progress by leaps and bounds in their basement is over and incremental progress is going to be the domain of universities or large private companies that can afford to throw money behind it. I would love to be proven wrong and see radical shifts in how people approach these problems. Seems like the cycle started and got to this point way too soon for AI though
My take on this is that (good) content is one of the bigger problems still, particularly also who exactly the original training data belongs to (or where it comes from). There's a certain risk (we'll see with Github CoPilot soon) it will slow down for a bit until the licensing issues are all sorted out. This can only be solved (for now) by bringing in public funding/data, which universities have always been a very good proxy for. Which also means it (usually) should be open access to the public, to some extent (and useful for the garage folks to catch up a bit). But, once we're past that, it'll be all about that giant body of pre-trained data, securely kept within the next Facebook or Microsoft, amounting to literal data gold (just much higher value at a lot less weight).
> Are there any possible technologal or scientific leaps on the horizon
Yes. From 2017: "Prediction 4: The simplest 2D text encodings for neural
networks will be TLs. High level TLs will be found to translate machine written programs into understandable trees."
We have something coming out that is an OOM better than anything else out there right now.
small players will never have a chance to push the state of the art, as whatever optimization there is will also be applied at large scale with more money
Take a leaf from Seti@Home‘s book and try to come up with a distributed, volunteer-based approach to training an open source LLM. There is already an enormous amount of suitable ML hardware on end user devices.
Good point, but perhaps a leap could take small players into territories of language models that are large enough to be useful. GPT-3 crossed that threshold
> Could this be distributed? Put all those mining GPUs to work.
Nope. It's a strictly O(n) process. If it weren't for the foresight of George Patrick Turnbull in 1668, we would not be anywhere close to these amazing results today.
In theory, yes. "Hogwild!" is an approach to distributed training, in essence, each worker is given a bunch of data, they compute the gradient and send that to a central authority. The authority accumulates the gradients and periodically pushes new weights.
There is also Federated Learning which seemed to start taking off, but then interest rapidly declined.
Wikimedia and other organizations that deal with moderation might want to keep this technology out of the hands of the general public for as long as possible.
There are a couple of cases where small changes in the model make training much quicker. For example, the currently leading Go AI, KataGo, requires much less time to train than AlphaGo did.
Yes. There are plenty forward leaps, most of them are not new and are just waiting to be integrated or released :
Let's pave the road for SkyNet hard lift-off :
-The first obvious one is use of external knowledge store, aka instead of having to store facts in the neural weights where they struggle, just store them in a database and teach your neural network to use it. (This is also similar to something like webgpt where you allow your network to search the web). This will allow you to have a network of 1G parameters (and external indexes of a few TB) that is as performant as a network of 100G parameters, and with better scaling property too. You can probably gain at least 2 orders of magnitude there.
-The second leap is better architecture of your neural networks, approximating transformer that are quadratic compute by something that is linear compute (linformer) or n log n compute (reformer) can get you an order of magnitude faster by simply reducing your iteration time. Similarly using some architectures based on sparsity can give you faster computation (although some of the gains are reduced by lesser efficiency of sparse memory access pattern). Using (analog bits) Diffusion to Generatively PreTrain sentences at a time instead of token by token. You can probably gain between 1 and 3 order of magnitude here if you write and optimize everything manually (or have your advanced network/compiler optimize your code for you)
-The third leap is reduced domain : You don't have a single network that you train on everything. Training one network by domain allows you to have a smaller network that compute faster. But also it allows you to focus your training on what matters to you : for example if you want to have a mathematics network, its parameters are not influenced a lot by showing it football pictures.
There is at least 2 orders of magnitude there.
-The fourth one is external tool usage. It's kind of related to the first one but whereas in the first one is readily differentiable, this one necessitate some Reinforcement Learning (that's what decision transformer are used for).
-Compression : compress everywhere. The bottlenecks are memory bandwidth related. Work in compressed form when relevant. One order of magnitude
-Distributed training : Because the memory bandwidth of inside a GPU is in the order of TB/s where as the transfer to the GPU is in the order of 10GB/s. There is an advantage to have the parameters reside on the GPU but there is a limited quantity of memory in the GPU, so distributed training (something like petals.ml) allows you to increase your memory bandwidth by collaborating. So each actor can probably gain an order of magnitude. Provided that they can keep bad actors away.
-Use free resources : The other day Steam had 10M users with GPU waiting around doing nothing, just release a dwarf fortress mod with prettier pictures and use the compute for more important tasks.
-Remove any humans in the loop : it's faster to iterate when you don't have to rely any human, either for dataset construction or model building