Hacker News new | past | comments | ask | show | jobs | submit login
MinGPT: Minimal PyTorch re-implementation of GPT (github.com/karpathy)
223 points by memorable on Sept 6, 2022 | hide | past | favorite | 24 comments



Hah funny to see this on HN, it is a relatively old project but one that I continue to love and still work on. I was trying to train a GPT one day and discovered that available implementations were quite complex, spread across many files, and took way too many kwargs switches for esoteric/rare options that just bloated and complexified the code. But in my head a GPT was a super simple neat, isotropic model, so I got all worked up and wrote minGPT.

The project went on to have more impact than I originally imagined and made its way into a number of projects and papers. One of those I found only a few days ago here: https://twitter.com/karpathy/status/1566100736076697600 . What I love about these projects is that the authors often "hack up" minGPT in code directly. They don't configure a comprehensive kwarg monster. I think there's a beauty in that. Very often I wish we had more gists and fewer frameworks - to look at code chunks, understand them completely, tune them to our needs, and re-use them in projects, similar to how bacteria trade little DNA plasmids. minGPT is written for those who want that for their GPT projects. There's plenty of cons to this approach too, ultimately I think there's value in both approaches.

Coming up the theme of future minGPT development: more examples, and more teeth - it should be possible to demonstrate the training of relatively serious (~few B) models with minGPT on one n-gpu node and reproduce some benchmarks around that scale, but never sacrifice its readability.


I completely agree! I personally find these powerful new network releases border on the depressing, in that they aren’t really network releases but huge training systems of dispersed YAMLs. YOLOv4 was a case in point where I was too overwhelmed to try and integrate it into a project I was working on.

PS you are a hero of mine - I’m an academic medical doctor for who CS231n was my first foray into AI, and since then I’ve gone on to gold medal in a couple of Kaggle competitions and secured 5 years of higher research funding to pursue clinical AI. I am immensely grateful to you and Fei-Fei Li.


For anyone else who was new to the phrase "isotropic model":

https://github.com/christianversloot/machine-learning-articl...


This works for an architecture which has been well tuned and studied before, like LSTM or Transformer.

Once you do research on the model, testing out things, it often tends to become such kwarg monster in many frameworks.

Having everything (relevant) in one file (even in the config file itself with hyper params) allows you to copy the file for every experiment and modify it inplace. This avoids the kwargs mess. But then the config files are very complex, and can become messy in other ways (esp for research projects). Example: https://github.com/rwth-i6/returnn-experiments/blob/master/2...

Such approach makes it much more flexible and does not mess with the baseline code. As you say, it's more like an evolutionary DNA-like approach, where you then tend to do crossovers with other evolved good-performing configs, etc.


In Python I have found that a good way to deal with large config files is to use Dataclasses and serialize or deserialize them with OmegaConf.


Thanks for making it it! There is immense value in something you can just dive into and hack on. I’ve been hacking on stable Diffusion/latent diffusion these past couple weeks, and you don’t know how much time it would have saved me, if it just had something similar!


Are there any similarly structured projects around?


What's required for an AGI to update karpathy's HN bio?


This is actually a pretty neat, self-contained implementation that can super easily extended beyond stereotypical natural language models, for example to create world models for video games [1] or to create robot models that can learn to imitate from large, chaotic human demonstration data [2] (disclaimer, I'm an author on the second one.) Basically, GPT (or minGPT) models are EXCELLENT sequence modelers, almost to the point where you can throw any sensible sequence data at it and hope to get interesting results, as long as you don't overfit.

Even though I have only been working on machine learning for around six years, it's crazy to see how the landscape has changed so fast so recently, including diffusion models and transformers. It's not too much to say that we might expect more major breakthroughs by the end of this decade, and end in a place we can't even imagine right now!

[1] https://github.com/eloialonso/iris [2] https://github.com/notmahi/bet


> Even though I have only been working on machine learning for around six years, it's crazy to see how the landscape has changed so fast so recently, including diffusion models and transformers.

it's pretty wild considering how hidden markov models were considered state of the art not all that long ago.


Some people demean GPT-3 saying it's just a Markov model.


Karpathy really seems to have discovered there are a lot of hours in the day now he doesn't work for Tesla


He was doing this kind of stuff while he was at Tesla too - https://github.com/karpathy/cryptos


Pretty sure he wrote this while working at Tesla also


Not only him. The tech boom in the past decade made a lot of great programmers rich, and it is a good thing. Looking also at how Aras Pranckevicious (of the Unity fame) is now contributing to Blender. (Also to some extents Rui (for mold fame) and Raph Levien (for xi editor fame), although not certain about their financial standing).


This implementation is quite old now actually - although I agree, it certainly seems that way otherwise :)


I love your approach and philosophy around programming. If anyone is unaware, Karpathy has a relatively small youtube channel he started a few weeks ago. https://youtu.be/VMj-3S1tku0


Related:

Karpathy's MinGPT - https://news.ycombinator.com/item?id=24189497 - Aug 2020 (102 comments)


Nice! I remember way back when studying Karpathy’s character RNN code, a great study resource. Looking forwards to understanding this example also!


I am working on a video lecture series that will step through it and "spell it out". Without it even this code can be a bit opaque for someone who is new to the field and e.g. uncomfortable with n-dimensional array manipulations or the surrounding language modeling concepts.


That's very generous of you. Thanks.

All we need now is a good, local copilot implementation. Has anyone done anything like that with minGPT that you know of? Something to inspire the masses like stable diffusion has with images.


Here was I thinking someone had recreated the GUID Partition Table in some form of micropython. Perhaps someday.


Is there a Colab available yet?


With enough training data and enough GPUs to do the model training, you'll be there! Goes to show that for AI, the code really isn't the important part. AI is and always has been about data and compute.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: