I think many people don't appreciate just how simple state-of-the-art neural net techniques are. It's great to have an illustration of just how little code you need to get the results that have amazed people.
You could say that it relies on PyTorch which is a lot of code, but most of the complexity comes from the need to do GPU/TPU/SIMD acceleration. A truly from-scratch CPU-only implementation in C would still not be a large amount of code.
As a developer with interested in neural-net based ML, my eyes start to glaze over a bit when I see so much crazy-looking numpy operations even just to get a trivial representation of data.
I guess it's just something I have to get used to, but I wish there was an interface to do the same nd-array based logic that was designed more for developers like me rather than data scientists who perform surgery on 50d arrays all day long.
Don’t look at code, because it bakes in so many derived values and simplifications that it’s hard to recover the original ideas. If you can find the original math, and walk through the derivatives (seeing zeros crop up and get simplified away), it starts to make a lot more sense.
Sorry if this sounds condescending but just learn linear algebra and numpy. Linear algebra really is a minimum requirement if you want to get anywhere near ML. College freshmen can learn linear algebra and numpy and so can you. You'll only get even more lost and frustrated if you try to go into it thinking in terms of for loops.
It's kind of like saying Redis is just an dict. The optimizations in the entire software/hardware stack is what makes these innovations possible. A C implementation would not work in reality for large models.
That's not really a C implementation of GPT-2 since it cannot be used to do the thing everyone cares about: self-supervised learning from text. In fact, it doesn't even use the weights in the same way GPT-2 does, so it's not clear how close it is to GPT-2's inference mode. The source isn't even on the page.
This is very cool, thanks for sharing! From the readme (https://bellard.org/nncp/readme-gpt2tc.txt), the program benchmarks very comparably to CMIX, which is the top algorithm on the Large Text Compression Benchmark
(http://mattmahoney.net/dc/text.html). I'm guessing that any GPT implementation would be ineligible for the benchmark because of its file size but impressive nonetheless.
You mean CPU implementation. C language isn't the culprit here, and low-level parts of ML frameworks are normally written in C or C++ as well and compiled with nvcc or equivalent.
It really depends on what level allow yourself to start.
If you hand derive all equations in your program you are going to have a hard time ... but automatic differentiation (ad) and existing linear algebra libraries (for gradient descent/ADAM) make the the job pretty easy.
I recently dabbled in automatic (song-)mixing ... reimplemented the code of some open audio plugins in a way that they were auto-derivable and "trained" the whole thing on some multitracks with reference mixes.
Sadly that didn't resulted in a good mix ... but in theory it should work.
I wonder, is a community trained model feasible? As in, get a few tens of thousands of dev to run a seti@home type app on their GPUs during the night, and at the end you get access to the 175B trained model. If it cost 5m to train, but IIRC that was estimated at cloud gpu prices, if you're using spare capacity you're just paying for electricity.
Fun idea. GPT @ Home :D. Scatter of the inputs would be very cheap as they are tiny LongTensors (sequences of indices), but the Gather of the gradients seems like a bottleneck. These models can be quite large. Maybe each worker only communicates back some sparse or potentially precision-reduced gradients? In the limit, I recall papers that were able to only communicate one bit per dimension. May also be possible to further reduce the number of weights by weight sharing, or e.g. with HyperNetworks.
I long wanted to see a proof-of-work cryptocurrency that does neural network training instead of burning through hashes. Imagine if 0.5% of the planet's energy consumptions (9 GW) was used for training neural networks instead of mining bitcoin! It would also solve the problem of ASICS being 1000x more efficient than GPUs, so everyone can participate. It would incentivise the development of efficient neural network training hardware. Somebody do this already!
It does updates to weights based on 1 bit precision updates each iteration.
It would be fairly trivial to go to less than 1 bit precision too - simply set some threshold (eg 3), and wherever the difference between the weight on the server and the client is greater than 3, transmit a binary "1", else send a binary "0". Then entropy code all the resulting binary.
By adjusting the threshold up and down, you trade off the size of the data to send Vs precision.
Efficiency and resource cost is the big question though. You don't pay for the electricity or part wear that you don't use, and home computers or workstations may not be as efficient at performing a training run vs a task-specific setup. AI@home might end up costing even more, and increase the footprint of the model more, than doing it all together.
Part of the magic really needed is finding simpler ways to achieve the same levels of model robustness.
Sure part wear is relevant but I feel that most parts worldwide gets chucked way before they are worn out. Electric efficiency is probably quite worse though. Although you could possibly find opportunity in maximizing the load in regions that have surplus energy and/or renewable sources.
> huggingface/transformers has a language-modeling example. It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statments that is unsued in the default setting of simple language modeling.
I don't understand this criticism of Transformers. Doesn't tracing (in both TorchScript and ONNX forms, which Transformers supports for exporting) just take the relevant model graph and freeze it? I don't think either contains the somewhat-weighty performance code.
Try to read the code and understand how it works and you will find it very challenging to interpret. But not just that even the documentation is very sparse and hard to read. Compare that to the open-ai code, is so coincise and easy to read. There is mastery in doing that, deep mastery. Few repositories on tensorflow or pytorch organization get to that level.
Honestly the library doesn't seem that hard to understand, although it can be under documented at times - I found looking through the source very helpful.
minGPT is actually quite performant too, the min refers to breadth of supported functionality (eg the absence of support for various additional conditioning, exotic masking, masked LMs, finetuning, pruning, etc).
This is really cool to see; I'm glad karpathy shared the work.
The internet at large has been really good at taking opaque machine learning research and elucidating the details and recreating the results. I've seen a few posts/repositories/etc doing that for GPT-3 as well but man, 175B parameters is just so far out of reach for hobbyists. It's really a shame.
In time further research will likely make language models more efficient to where something GPT-3-like can be trained at the hobbyist level. Probably a blended model like SHA-RNN or something with dedicated memory in its architecture, so the model isn't burning precious weights on remembering e.g. Lincoln's birthday. In the meantime though it makes me sad that something as impressive as GPT-3 is solely the toy of corporations.
I'm curious: Did you get comparable results to openai? I know a few people tried to train GPT-2 themselves (before it was openly released) and their results were quite inferior.
It's fun to find the sources to see how GPT output correlates with the input data:
Input texts:
Go, rate thy minions, proud insulting boy!
Hither to London, to be crown'd our king. / Welcome, sweet prince, to London, to your chamber.
/ Post you to London, and you will find it so /
Now to London,
To see these honours in possession.
How will the country for these woful chances
Misthink the king and not be satisfied!
Output:
Go, rating to London, with all these woful chances
Misthink the king and not be satisfied!
GPT is using multi-headed attention, so of course it's not as simple as putting a few texts together, but I was still interested in finding some similar texts (that can be done because the training data is only 1MB).
That's a really interesting idea. Could you go into detail about how you're searching for similar texts using GPT?
It's true that the probability distribution is a sort of "edit distance". And GPT has already been used for text compression: https://bellard.org/nncp/gpt2tc.html so it seems not too far of a stretch to use it for similarity matching.
(Sure, perhaps there are more efficient or more effective techniques than using GPT for this, but I like the idea and am curious how it works.)
Just started a run of play_char on my 8GB 2070. Had to drop batch size to 128. Getting ~2.2 iterations per second, so it looks like it's going to take two hours for training to finish. I don't expect my training to differ from karpathy's, but I'm curious to play around with the trained model a bit. Already ran the math one and got the ~same results.
In the committed notebook the final training loss for play_char was 0.30588. Yet my training got down to 0.02638. Odd. Either way the resulting model seems to be just as good/bad. Like char-rnn it's amazing to see it spell words from scratch consistently. It has a good grasp on structure and even a passable grasp on grammar. But also like char-rnn it lacks any ability to form coherent sentences.
EDIT: I'm running it on the IMDB dataset now ... just to see.
Running for two epochs on the IMDB dataset (133MB corpus) it only got to a loss of 1.1. Likely the regularization is too high (I didn't tweak the hyperparameters at all, and assume regularization was quite high for the limited tinyshakespeare corpus). Either way, it at least started to learn more grammar:
Prompt: This is my review of Lord of the Rings.
> I can't tell why the movie is a story with a lot of potential
the main reason I want to see a movie that is Compared to the Baseball movie 10 Both and I can say it was not just a bad movie.
From the GitHub Readme "The rest of the complexity is just being clever with batching (both across examples and over sequence length) so that training is efficient."
What kind of complexities is he talking about? Is it simply the complexity of having a batch dimension? (compared to more simple single input code)
I've been delaying introducing myself to this field seriously, but this feels attractive to me. I have a 32GB Dual Xeon through RDP, would you recommend running this in local?
Nice but i am scared people write in their resume that they train a GPT model from scratch. And when asked in detail they will accept just ran minGPT without understanding it.
This is the AI story now a days.
Best solution is to ignore minGPT and write your own version.
That seems fine. How many reddit clones have we seen?
For what it's worth, as a hiring manager who hires technical people, but not software engineers, these kinds of side projects can really help folks with a less formal education.
The upside is that if they DO understand it or at the very least learned something interesting or useful while working on the project it will help them a lot in an interview.
Folks who do what you're worried about exist, and they just don't get hired by people who aren't impressed with the shallow use of some new technology. There are also plenty of companies where that person will be fine, not really need to use GPT to do their job, and everyone will be happy anyways :)
How you feel if somebody claim they know java just because they can call elastic search api (or put any tool written in java). Are you going to hire them and put them on a project which involve coding in java?? the issue is false claim on resume which shows no integrity.
On other extreme do you go to a doctor for surgery who make false claim about being done surgery.
Some companies are looking for someone who knows enough Java to use an ElasticSearch API - it's about knowing your audience, and I think a lot of new folks just don't know how to target their applications.
For what it's worth I've gone back and forth over this in my career, and I do think it has a bit to do with the level of your assertion, but there's an amount of naïveté that creeps into resumes especially since advice is so widely varied, do you brag, sell yourself, show real projects, or your past titles.
Anyways, there's no right way, and I've decided that the job seeker is in a position of weakness to employers and the industry in general, and if people are seeking to better themselves - great.
When I interview and hire people, I'm the one who screens out people who lie or are incapable of doing their job, what they put on their resume is just part of the process.
Your medical example is a bad one, sorry, I'm not going to engage with it, there are gatekeepers in almost all industries, and I'm not proposing changing them by saying that someone putting Java on their resume is not the same as a doctor lying about their qualifications. One, because systems exist to vet those qualifications, and two because they're simply not on the same spectrum.
How many people have you screened, hired, and for how many roles? Is this a problem you've run into or experienced professionally, or just something that annoys you?
I've experienced both. When I ran a satellite, I employed interns and most were CS majors who claimed to know C, C++, and Java - I asked them to write strcpy in C as my interview, none of them could do it, and these people were juniors in college, so no, I'm not too worried about what people put on their resume because it's just not representative of their actual abilities ever.
If someone claims to know Java and gets a job with no actual test of their skill, that's the manager's fault.
Granted, that's that case with any machine learning project on a resume. People were using scikit-learn/NLTK for abstracting/simplifying training long before TensorFlow and PyTorch were the new hotness.
Those projects are good things to press in a phone interview.
Does just training a model get you a good job nowadays? I don't think anyone competent hiring people would be impressed by that, except at garbage places with garbage hiring managers
I would expect an entry level person to be able to verbally answer basic questions such as:
- Explain why an NN with n nodes across multiple hidden layers can model a more complex structure than an NN with n nodes but only one hidden layer.
- When using an NN, when is it appropriate and not appropriate to utilize a cross-entropy cost function?
- Why can a single perceptron not approximate an XOR operation?
- Why is neural network (NN) training data divided into three sets: training, generalisation, and validation? What is the purpose of each? Must the three sets be mutually exclusive?
the Tesla Autopilot rewrite due in 6-10 weeks that'll enable self-driving cars, promised by
Elon on Friday, must be going well if the head of Autopilot is playing around with language ML models and open sourcing them
Laundering a guideline about not being rude about people's work into not talking about the people behind the work, especially when they're running a years long, millions of dollars fraud, isn't exactly the HN you want, sir.
Plenty of commenters express skepticism about self-driving cars and/or about Tesla while staying within the site rules. I'm sure you can too, if you want to.
Would you please stop posting in the flamewar style, and especially please stop snarking on HN? I realize it's an internet tradition, but we've all had lots of opportunity to see what its systemic effects are when it dominates the culture, and we don't want those here.
That has nothing to do whatsoever with Tesla's "4 years and running of consumer fraud". Whatever your thoughts on Tesla's public image or business behaviors, it's not especially relevant to the discussion on machine learning.
If they’re giving up on autopilot then fine. But if people’s lives are riding on the quality of the work then I think it sets a high bar for personal projects.
I'm sorry, what? Are people not allowed to have free time and hobbies anymore?
Just because someone's working on a high-profile high-risk project does not in any way change their personal time: it's theirs to do what they want with.
Society has given Kaparty et al a lot of latitude to take liberties with other peoples lives. In my view this creates an obligation to do the best job humanly possible.
That's all well and good for his job, and even that assertion is debatable. He's a human after all.
But this is a personal project. Something that is unrelated to his job. So it doesn't really matter what "bar" it is because he's doing it for his own benefit, not Tesla's or "humanity's".
Unless you're implying he can't have personal hobbies anymore. Which was the exact point of my previous comment.
His job affects peoples lives. He certainly is human. I'm not criticising his project. I'm not saying he can't have personal hobbies.
And I'm not the one that decides the bar.
I'm saying that there is a bar and it's higher when lives depend on it.
There is a long list of activities that people could do in their personal time that would be considered inappropriate. What is done outside of work is relevant. It can form part of a character assessment which can have political and legal ramifications.
If Tesla ends up with a dangerous product then it becomes very relevant.
There is no bar on personal projects. Period. That's the whole point of it being a personal project.
At least you've made it perfectly clear that you don't think people should have lives outside of work. Hope you also subscribe to the same ideology yourself and don't have any personal projects or hobbies (why are you on HN anyway, shouldn't you be working right now?) or else you'd easily be considered a hypocrite.
There is a bar on personal projects. Period. Period.
Either you're misunderstanding logic, law or society. If Karparthy fails he runs a real risk of having his life very closely examined. The rules are different for different people.
I do machine learning medical work for the coronavirus. I work at near optimal as humanly possible. Occasional hackernews is one of my few indulgences.
Agree to disagree. My company has no control over the work I do in my personal time, and neither does some random person on the other side of the world. This applies to me, you, and everyone else. Nobody is special because "they do important work".
By your logic, you should have no indulgences, so you should probably stop posting.
You could say that it relies on PyTorch which is a lot of code, but most of the complexity comes from the need to do GPU/TPU/SIMD acceleration. A truly from-scratch CPU-only implementation in C would still not be a large amount of code.