Hacker News new | past | comments | ask | show | jobs | submit login
Karpathy's MinGPT (github.com/karpathy)
374 points by aliabd on Aug 17, 2020 | hide | past | favorite | 102 comments



I think many people don't appreciate just how simple state-of-the-art neural net techniques are. It's great to have an illustration of just how little code you need to get the results that have amazed people.

You could say that it relies on PyTorch which is a lot of code, but most of the complexity comes from the need to do GPU/TPU/SIMD acceleration. A truly from-scratch CPU-only implementation in C would still not be a large amount of code.


As a developer with interested in neural-net based ML, my eyes start to glaze over a bit when I see so much crazy-looking numpy operations even just to get a trivial representation of data.

I guess it's just something I have to get used to, but I wish there was an interface to do the same nd-array based logic that was designed more for developers like me rather than data scientists who perform surgery on 50d arrays all day long.


I agree; check out NamedTensor for an attempt to fix this: http://nlp.seas.harvard.edu/NamedTensor


Don’t look at code, because it bakes in so many derived values and simplifications that it’s hard to recover the original ideas. If you can find the original math, and walk through the derivatives (seeing zeros crop up and get simplified away), it starts to make a lot more sense.


Have you checked out numba? You can use regular Python for loops and numpy calls and it will compile it. Instead of numpy vectorization mess.


Good luck with anything of non trivial types. Numba is awesome but also quite finicky.


Sorry if this sounds condescending but just learn linear algebra and numpy. Linear algebra really is a minimum requirement if you want to get anywhere near ML. College freshmen can learn linear algebra and numpy and so can you. You'll only get even more lost and frustrated if you try to go into it thinking in terms of for loops.


It's kind of like saying Redis is just an dict. The optimizations in the entire software/hardware stack is what makes these innovations possible. A C implementation would not work in reality for large models.


Here is a C implementation of GPT-2. https://bellard.org/nncp/gpt2tc.html

I don't disagree that hardware acceleration is key in enabling these models, but I still find it interesting how simple the core techniques are.


Is there anything Bellard hasn’t done?


He didn’t release blueprints of cheap homemade 1MW fusion reactor yet, but otherwise - yep, it seems that he covered everything else.


he didn't rewrite everything in rust.


Once he picks up Rust, he'll become a 10x programmer.


What a loss. To be downgraded that much.


That's not really a C implementation of GPT-2 since it cannot be used to do the thing everyone cares about: self-supervised learning from text. In fact, it doesn't even use the weights in the same way GPT-2 does, so it's not clear how close it is to GPT-2's inference mode. The source isn't even on the page.


Hm, I notice the source code is missing? I did I overlook it


This is very cool, thanks for sharing! From the readme (https://bellard.org/nncp/readme-gpt2tc.txt), the program benchmarks very comparably to CMIX, which is the top algorithm on the Large Text Compression Benchmark (http://mattmahoney.net/dc/text.html). I'm guessing that any GPT implementation would be ineligible for the benchmark because of its file size but impressive nonetheless.


> A C implementation

You mean CPU implementation. C language isn't the culprit here, and low-level parts of ML frameworks are normally written in C or C++ as well and compiled with nvcc or equivalent.


It's simple in the same way Conway's Game of Life is simple.


It really depends on what level allow yourself to start.

If you hand derive all equations in your program you are going to have a hard time ... but automatic differentiation (ad) and existing linear algebra libraries (for gradient descent/ADAM) make the the job pretty easy.

I recently dabbled in automatic (song-)mixing ... reimplemented the code of some open audio plugins in a way that they were auto-derivable and "trained" the whole thing on some multitracks with reference mixes. Sadly that didn't resulted in a good mix ... but in theory it should work.

What's your "simple deep learning" story?


Karpathy is also very good at making complex architectures to very simple.


I wonder, is a community trained model feasible? As in, get a few tens of thousands of dev to run a seti@home type app on their GPUs during the night, and at the end you get access to the 175B trained model. If it cost 5m to train, but IIRC that was estimated at cloud gpu prices, if you're using spare capacity you're just paying for electricity.


Fun idea. GPT @ Home :D. Scatter of the inputs would be very cheap as they are tiny LongTensors (sequences of indices), but the Gather of the gradients seems like a bottleneck. These models can be quite large. Maybe each worker only communicates back some sparse or potentially precision-reduced gradients? In the limit, I recall papers that were able to only communicate one bit per dimension. May also be possible to further reduce the number of weights by weight sharing, or e.g. with HyperNetworks.


I long wanted to see a proof-of-work cryptocurrency that does neural network training instead of burning through hashes. Imagine if 0.5% of the planet's energy consumptions (9 GW) was used for training neural networks instead of mining bitcoin! It would also solve the problem of ASICS being 1000x more efficient than GPUs, so everyone can participate. It would incentivise the development of efficient neural network training hardware. Somebody do this already!


I built this a few years ago:

https://github.com/Hello1024/shared-tensor

It does updates to weights based on 1 bit precision updates each iteration.

It would be fairly trivial to go to less than 1 bit precision too - simply set some threshold (eg 3), and wherever the difference between the weight on the server and the client is greater than 3, transmit a binary "1", else send a binary "0". Then entropy code all the resulting binary.

By adjusting the threshold up and down, you trade off the size of the data to send Vs precision.


I read a paper that did exactly what you describe but of course I can't find it now...



I wonder if it would just be simpler to collect the money that would be spent on electricity and use the cloud anyway.


People's home GPU's are much much cheaper than using the cloud because of the way Nvidias licensing works to jack cloud gpu prices.


Distributed training is indeed a thing! A few random Arxiv pulls:

https://arxiv.org/abs/2007.03970 https://arxiv.org/abs/1802.09941

Efficiency and resource cost is the big question though. You don't pay for the electricity or part wear that you don't use, and home computers or workstations may not be as efficient at performing a training run vs a task-specific setup. AI@home might end up costing even more, and increase the footprint of the model more, than doing it all together.

Part of the magic really needed is finding simpler ways to achieve the same levels of model robustness.


> electricity or part wear

Sure part wear is relevant but I feel that most parts worldwide gets chucked way before they are worn out. Electric efficiency is probably quite worse though. Although you could possibly find opportunity in maximizing the load in regions that have surplus energy and/or renewable sources.


Seems doable to me


> huggingface/transformers has a language-modeling example. It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statments that is unsued in the default setting of simple language modeling.

I don't understand this criticism of Transformers. Doesn't tracing (in both TorchScript and ONNX forms, which Transformers supports for exporting) just take the relevant model graph and freeze it? I don't think either contains the somewhat-weighty performance code.


FWIW, I took that to mean "code path is challenging [for a human being] to trace."


Try to read the code and understand how it works and you will find it very challenging to interpret. But not just that even the documentation is very sparse and hard to read. Compare that to the open-ai code, is so coincise and easy to read. There is mastery in doing that, deep mastery. Few repositories on tensorflow or pytorch organization get to that level.


Agreed re: OpenAI's GPT implementation. It took roughly a year to appreciate how simple it is. https://github.com/openai/gpt-2/blob/0574c5708b094bfa0b0f6df...

Especially compared to StyleGAN, BERT, or pretty much anything else.

I used to hate the OpenAI GPT codebase: zero comments? no classes? What does "mlp" even mean? But over time, I find myself reverting to their style.


Honestly the library doesn't seem that hard to understand, although it can be under documented at times - I found looking through the source very helpful.


I think the argument here is about pedagogy not performance.


minGPT is actually quite performant too, the min refers to breadth of supported functionality (eg the absence of support for various additional conditioning, exotic masking, masked LMs, finetuning, pruning, etc).


GPT training performance on the CPU is funny. The vocab size and context window size have a massive effect on both speed and accuracy.


Sure thing! I only meant to imply the relative ordering of considerations.


I agree with this. Changes for Deep learning models is so fast there is no point maintaining a super reusable code.


If you just want to understand the Transformer, here is a clean implementation:

https://github.com/blue-season/pywarm/blob/master/examples/t...


and here's a breakdown of the architecture:

http://dugas.ch/artificial_curiosity/GPT_architecture.html


These 4 videos (~45 mins) do an excellent job at explaining attention, multi-headed attention, and transformers: https://www.youtube.com/watch?v=yGTUuEx3GkA



This is really cool to see; I'm glad karpathy shared the work.

The internet at large has been really good at taking opaque machine learning research and elucidating the details and recreating the results. I've seen a few posts/repositories/etc doing that for GPT-3 as well but man, 175B parameters is just so far out of reach for hobbyists. It's really a shame.

In time further research will likely make language models more efficient to where something GPT-3-like can be trained at the hobbyist level. Probably a blended model like SHA-RNN or something with dedicated memory in its architecture, so the model isn't burning precious weights on remembering e.g. Lincoln's birthday. In the meantime though it makes me sad that something as impressive as GPT-3 is solely the toy of corporations.


We recently trained GPT-3 (SMALL) at work on our GPU cluster for fun, took 4 days across a couple dozen machines...

Millions of dollars in CAPEX and OPEX just for one model


I'm curious: Did you get comparable results to openai? I know a few people tried to train GPT-2 themselves (before it was openly released) and their results were quite inferior.


You're saying your project costed millions of dollars, or the big boys' projects did?


If "4 days across a couple dozen machines" cost millions, something is very wrong.


Not if it was a couple dozen of these machines:

https://www.hardwarezone.com.sg/tech-news-nvidia-dgx-a100-su...


Running your own DC is quite expensive with GPU hardware. One DGX-2 is $400k and draws something like 24kW.


> draws something like 24 kW.

That number is off. The DGX-2 consumes 10 kW at peak [0] and the DGX-2H consumes 12 kW at peak [1].

[0] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

[1] https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Dat...


It's fun to find the sources to see how GPT output correlates with the input data:

Input texts:

Go, rate thy minions, proud insulting boy!

Hither to London, to be crown'd our king. / Welcome, sweet prince, to London, to your chamber. / Post you to London, and you will find it so / Now to London, To see these honours in possession.

How will the country for these woful chances Misthink the king and not be satisfied!

Output:

Go, rating to London, with all these woful chances Misthink the king and not be satisfied!


Did you understand the correlation? And what part of the source code is doing it? Or you had just fun without any understanding?


GPT is using multi-headed attention, so of course it's not as simple as putting a few texts together, but I was still interested in finding some similar texts (that can be done because the training data is only 1MB).


That's a really interesting idea. Could you go into detail about how you're searching for similar texts using GPT?

It's true that the probability distribution is a sort of "edit distance". And GPT has already been used for text compression: https://bellard.org/nncp/gpt2tc.html so it seems not too far of a stretch to use it for similarity matching.

(Sure, perhaps there are more efficient or more effective techniques than using GPT for this, but I like the idea and am curious how it works.)


I was using Ctrl-F in Chrome for the words on the training data (Shakespeare texts).

With Teansformer models it's quite easy to print out top Query-Key pairs to debug what happened, but that was not my intention.


why don't use TF-IDF and use just elastic search?


Is it possible to build an useful transformer without investing millions of dollars?


People have built passable translation systems with transformers using a single high end GPU.


This is great. Years ago his RNN example helped me a lot.


Just started a run of play_char on my 8GB 2070. Had to drop batch size to 128. Getting ~2.2 iterations per second, so it looks like it's going to take two hours for training to finish. I don't expect my training to differ from karpathy's, but I'm curious to play around with the trained model a bit. Already ran the math one and got the ~same results.


In the committed notebook the final training loss for play_char was 0.30588. Yet my training got down to 0.02638. Odd. Either way the resulting model seems to be just as good/bad. Like char-rnn it's amazing to see it spell words from scratch consistently. It has a good grasp on structure and even a passable grasp on grammar. But also like char-rnn it lacks any ability to form coherent sentences.

EDIT: I'm running it on the IMDB dataset now ... just to see.


Running for two epochs on the IMDB dataset (133MB corpus) it only got to a loss of 1.1. Likely the regularization is too high (I didn't tweak the hyperparameters at all, and assume regularization was quite high for the limited tinyshakespeare corpus). Either way, it at least started to learn more grammar:

Prompt: This is my review of Lord of the Rings.

> I can't tell why the movie is a story with a lot of potential the main reason I want to see a movie that is Compared to the Baseball movie 10 Both and I can say it was not just a bad movie.


From the GitHub Readme "The rest of the complexity is just being clever with batching (both across examples and over sequence length) so that training is efficient."

What kind of complexities is he talking about? Is it simply the complexity of having a batch dimension? (compared to more simple single input code)


A TensorFlow re-implementation of mingpt https://github.com/kamalkraj/minGPT-TF


I've been delaying introducing myself to this field seriously, but this feels attractive to me. I have a 32GB Dual Xeon through RDP, would you recommend running this in local?


[flagged]


Personal attacks are not ok on Hacker News. Also, please don't fulminate in comments generally. We want curious conversation, not flamewar.

https://news.ycombinator.com/newsguidelines.html


Nice but i am scared people write in their resume that they train a GPT model from scratch. And when asked in detail they will accept just ran minGPT without understanding it. This is the AI story now a days. Best solution is to ignore minGPT and write your own version.


That seems fine. How many reddit clones have we seen?

For what it's worth, as a hiring manager who hires technical people, but not software engineers, these kinds of side projects can really help folks with a less formal education.

The upside is that if they DO understand it or at the very least learned something interesting or useful while working on the project it will help them a lot in an interview.

Folks who do what you're worried about exist, and they just don't get hired by people who aren't impressed with the shallow use of some new technology. There are also plenty of companies where that person will be fine, not really need to use GPT to do their job, and everyone will be happy anyways :)


How you feel if somebody claim they know java just because they can call elastic search api (or put any tool written in java). Are you going to hire them and put them on a project which involve coding in java?? the issue is false claim on resume which shows no integrity. On other extreme do you go to a doctor for surgery who make false claim about being done surgery.


Some companies are looking for someone who knows enough Java to use an ElasticSearch API - it's about knowing your audience, and I think a lot of new folks just don't know how to target their applications.

For what it's worth I've gone back and forth over this in my career, and I do think it has a bit to do with the level of your assertion, but there's an amount of naïveté that creeps into resumes especially since advice is so widely varied, do you brag, sell yourself, show real projects, or your past titles.

Anyways, there's no right way, and I've decided that the job seeker is in a position of weakness to employers and the industry in general, and if people are seeking to better themselves - great.

When I interview and hire people, I'm the one who screens out people who lie or are incapable of doing their job, what they put on their resume is just part of the process.

Your medical example is a bad one, sorry, I'm not going to engage with it, there are gatekeepers in almost all industries, and I'm not proposing changing them by saying that someone putting Java on their resume is not the same as a doctor lying about their qualifications. One, because systems exist to vet those qualifications, and two because they're simply not on the same spectrum.

How many people have you screened, hired, and for how many roles? Is this a problem you've run into or experienced professionally, or just something that annoys you?

I've experienced both. When I ran a satellite, I employed interns and most were CS majors who claimed to know C, C++, and Java - I asked them to write strcpy in C as my interview, none of them could do it, and these people were juniors in college, so no, I'm not too worried about what people put on their resume because it's just not representative of their actual abilities ever.

If someone claims to know Java and gets a job with no actual test of their skill, that's the manager's fault.


Granted, that's that case with any machine learning project on a resume. People were using scikit-learn/NLTK for abstracting/simplifying training long before TensorFlow and PyTorch were the new hotness.

Those projects are good things to press in a phone interview.


What are some questions you’d ask in an interview ?


This is an invented problem. I would assign zero space in my head to this if I were hiring.


It's a good starting point.


Does just training a model get you a good job nowadays? I don't think anyone competent hiring people would be impressed by that, except at garbage places with garbage hiring managers


Wouldn’t it be enough for entry level ? What else would you expect for an entry level ML position at a small company


I would expect an entry level person to be able to verbally answer basic questions such as:

- Explain why an NN with n nodes across multiple hidden layers can model a more complex structure than an NN with n nodes but only one hidden layer.

- When using an NN, when is it appropriate and not appropriate to utilize a cross-entropy cost function?

- Why can a single perceptron not approximate an XOR operation?

- Why is neural network (NN) training data divided into three sets: training, generalisation, and validation? What is the purpose of each? Must the three sets be mutually exclusive?

Etc.


"- Explain why an NN with n nodes across multiple hidden layers can model a more complex structure than an NN with n nodes but only one hidden layer."

I'd be curious to hear your answer on that one.


the Tesla Autopilot rewrite due in 6-10 weeks that'll enable self-driving cars, promised by Elon on Friday, must be going well if the head of Autopilot is playing around with language ML models and open sourcing them

https://twitter.com/elonmusk/status/1294374864657162240?s=20


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


Laundering a guideline about not being rude about people's work into not talking about the people behind the work, especially when they're running a years long, millions of dollars fraud, isn't exactly the HN you want, sir.


Plenty of commenters express skepticism about self-driving cars and/or about Tesla while staying within the site rules. I'm sure you can too, if you want to.

Would you please stop posting in the flamewar style, and especially please stop snarking on HN? I realize it's an internet tradition, but we've all had lots of opportunity to see what its systemic effects are when it dominates the culture, and we don't want those here.


He's allowed to do other things. Also, maybe his part of the project is done. Also the thing might be delayed.


[flagged]


Per the guidelines:

> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html


I downvoted because complaining about Tesla is irrelevant to the topic, not to express pro-Tesla sentiment.


Karpathy is the head of Tesla autopilot, and the headline is Karpathy's MinGPT.


That has nothing to do whatsoever with Tesla's "4 years and running of consumer fraud". Whatever your thoughts on Tesla's public image or business behaviors, it's not especially relevant to the discussion on machine learning.


If they’re giving up on autopilot then fine. But if people’s lives are riding on the quality of the work then I think it sets a high bar for personal projects.


I'm sorry, what? Are people not allowed to have free time and hobbies anymore?

Just because someone's working on a high-profile high-risk project does not in any way change their personal time: it's theirs to do what they want with.


Society has given Kaparty et al a lot of latitude to take liberties with other peoples lives. In my view this creates an obligation to do the best job humanly possible.


That's all well and good for his job, and even that assertion is debatable. He's a human after all.

But this is a personal project. Something that is unrelated to his job. So it doesn't really matter what "bar" it is because he's doing it for his own benefit, not Tesla's or "humanity's".

Unless you're implying he can't have personal hobbies anymore. Which was the exact point of my previous comment.


His job affects peoples lives. He certainly is human. I'm not criticising his project. I'm not saying he can't have personal hobbies. And I'm not the one that decides the bar.

I'm saying that there is a bar and it's higher when lives depend on it.

There is a long list of activities that people could do in their personal time that would be considered inappropriate. What is done outside of work is relevant. It can form part of a character assessment which can have political and legal ramifications.

If Tesla ends up with a dangerous product then it becomes very relevant.


There is no bar on personal projects. Period. That's the whole point of it being a personal project.

At least you've made it perfectly clear that you don't think people should have lives outside of work. Hope you also subscribe to the same ideology yourself and don't have any personal projects or hobbies (why are you on HN anyway, shouldn't you be working right now?) or else you'd easily be considered a hypocrite.


There is a bar on personal projects. Period. Period.

Either you're misunderstanding logic, law or society. If Karparthy fails he runs a real risk of having his life very closely examined. The rules are different for different people.

I do machine learning medical work for the coronavirus. I work at near optimal as humanly possible. Occasional hackernews is one of my few indulgences.


Agree to disagree. My company has no control over the work I do in my personal time, and neither does some random person on the other side of the world. This applies to me, you, and everyone else. Nobody is special because "they do important work".

By your logic, you should have no indulgences, so you should probably stop posting.


Did you consider that taking time off is doing the best job possible?


That certainly depends on how much time off and what was done in that time.


Ok, leave it up to him to make that call.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: