Hacker News new | past | comments | ask | show | jobs | submit login
Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy [video] (youtube.com)
1110 points by georgehill on Jan 17, 2023 | hide | past | favorite | 104 comments



Just started watching and Andrej is an excellent "thingexplainer". His in-depth knowledge of the underlying atoms/bits comes through. As an extra benefit to watching his lecture using the https://karpathy.ai/zero-to-hero.html is the link to his "discord chat". This is a very active community and Andrej is very active there. So feel free to watch lectures, cry through the assignments and come to the discord with questions. And they will be answered.


He has a website[0] for these videos

[0] https://karpathy.ai/zero-to-hero.html


I might be too new to this area -- but is this actually explaining how to create like a small version of the actual trained model -- not like "using the trained model for X"? like I can imagine in the future people won't start from pure scratch, there will be building blocks that everybody starts from, but mostly just wondering like how hard is it to actually replicate what openAI has done if you had the money to pay for the training?


rough steps:

1. collect a very large dataset, see: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla... . scrape, de-duplicate, clean, wrangle. this is a lot of work regardless of $.

2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.

3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...

4. follow the 3-step recipe of https://openai.com/blog/chatgpt/ to finetune the model to be an actual assistant instead of just "document completor", which otherwise happily e.g. responds to questions with more questions. Also e.g. see OPT-IML https://arxiv.org/abs/2212.12017 , or BLOOMZ https://arxiv.org/abs/2211.01786 to get a sense of the work involved here.


I was so confused by the saltiness until I saw the username. I'm sure you've earned it.

I got into deep learning because of your char-rnn posts a while ago -- it inspired me to do an undergrad thesis on the topic. I read arxiv papers after that and implemented things from the ground up until a startup liked my work and hired me in a neural network engineer position.

Fast forward a few years and I was enamoured with minGPT and it stuck with me. I wanted a CIFAR10 experimentation toolbench so I took my hand at my best swing at applying the minGPT treatment on the current best single-GPU Dawnbench entry, added a few tweaks and got https://github.com/tysam-code/hlb-CIFAR10. It currently (AFAIK) holds the world record for training to the 94% mark by a fair bit.

It's about 600 lines in a monolithic file, only requiring torch and torchvision, but it's my first project like this and I'd like to learn how to better minify codebases like this. It seems like the hardest part is knowing how to structure inheritance and abstraction, but I don't know if you had any good outside references/resources that you used or would recommend. If you have any feedback or help, I am open to receiving it, as I am very much a newbie at this particular art/science. It is quite a fun one, however (especially as it is a useful tool for my day-to-day work).

I'm also hoping to apply the same treatment to a small language model at some point by taking the Dawnbench approach -- picking a good target validation loss value or some reasonable metric, then optimize around that obsessively to build a good tiny reference model. I don't know if you'd know anyone that's interested in that kind of thing, but I feel like that would be a fun next step for me.


> I was so confused by the saltiness

I was confused by what you thought was salty. I don't see it remotely.


Same. I think it just lacks wide-eyed optimism.


Extremely interested in your take on where language/reasoning competency ends and knowledge retrieval begins.

OpenAI stuff has succeeded in part because it can synthesize good bullshit* on a huge variety of topics. For many purposes this makes it as good as asking someone in the same room to look something up for you on Wikipedia.

But while vast general and somewhat special knowledge is very impressive, comprehension and reasoning ability can exist without it. We know from our own human experience that general knowledge is useful to have, but not the same thing as intelligence or wisdom. It seems rational to think that the size of model needed to get ChatGPT's adequate level coherence and rationality is much less than that required to also encode sufficient general knowledge to be informative on just about any topic, most of which are not language specific.

* in the Frankfurtian sense of 'information provided without regard to its correctness'


This is why it has always seemed to me that the 'chat bot' -> AI pathway has felt quite analogous to the 'chess bot' -> AI pathway. We're constantly trying to replicate things that look like demonstrations of intelligence, but never really bothering with what intelligence is. What I mean is that a man lifting 400kg is a demonstration of exceptional athleticism. A 400kg man sitting on a balance and having 400kg go up on the other side is not, even though if we only observe the output (400kg goes up) then it is absolutely identical.

This isn't just a 'only humans can be intelligent' type argument, but emphasizing that what we want and what we're pursuing seem to be quite different. Newton deriving the inverse square law of gravitational attraction by observing things fall on Earth and watching the celestial bodies in the sky - that is an application of the sort of intelligence that we want. Asking a student to memorize and later recite that the gravitational force is proportional to m1*m2/r^2 is the sort of intelligence that we're building. And it's not like the latter leads to the former, of course it's the exact opposite!


We don't know how to define the essence of intelligence, so all we can do is knock all the strawmen we traditionally attribute to intelligence, until we're either left with the essence of true intelligence, or we knock off everything and we find intelligence was just a bunch of tricks after all.


I don't think the essence is especially elusive. It's the ability to make novel, meaningful, and useful discoveries from precepts that don't immediately "obviously" lead to those discoveries. Observing the sky and nature leading to a mathematical formulation of gravity is an absolutely amazing leap. In part because of what was done, but perhaps even more so for even beginning to imagine it was something that could be done.

Imagine stargazing in a time prior to Newton, observing objects falling on Earth, and somehow managing to derive an accurate mathematical formulation of something you had no reason to imagine even existed. Man hadn't been to space, let alone the moon, and so for all we knew if you dropped an apple anywhere it would just as well fall. Perhaps how Newton may have began his discovery was by asking himself where it would fall, but that's a tangent.

---

Humanity's entire existence has been but a blink of time on any sort of timescale, besides our own lives. And in that time we went from the bleeding edge of technology, perhaps pun intended, being 'poke them with the pointy end' to having men travel into space, voyage to other 'planets', land on them, and imminently live on them.

A truly intelligent machine, given its capacity for practically infinite storage, infinitely more accurate recall, and arbitrarily small 'generations' (in terms of self recursive evolutionary improvement) ought be able to not only match this, from a similar starting base of knowledge, but move rapidly beyond it at an extremely swift rate.

So I don't think it would be particularly ambiguous, or debatable. An intelligent machine would quickly advance nearly all fields of humanity by an unimaginable amount. Given the ability for a machine to also scale its own processing capacity to arbitrarily higher degrees (while we're, more or less stuck with fixed 'hardware') this should be able to continue for an exceptionally long period of time as well.

You would end up completely revolutionizing humanity to a degree we can't even really imagine today - repeatedly, and it would all happen within a matter of years. I'm even being somewhat generous here by allowing for a greater intelligence to remain mindless and servile, which seems quite antithetical to intelligence. But if we can "just" get to here, as described, I think that'd be pretty compelling, servility notwithstanding.


> practically infinite storage, infinitely more accurate recall

You can already hold all of humanity's written cultural knowledge on a thumbnail-sized drive. Recall is pretty accurate, too.

> revolutionizing humanity

Tell that anyone who lived 100 or 1000 years ago. World-wide instant communication. A cultural species of interconnected thought, a network the size of a planet. Augmented with tools of perfect memory recall, error-free precision calculations, way beyond what an unaugmented human brain can do. Welcome to the present. We even got remote-controlled robots working for us on other planets.

I'll keep my ad-blocker enabled, though. But allow me to think of it as a brain augmentation. We do what we always did: cultural evolution. We are creating tools to augment ourselves, not machines that are independent from us. As we integrate those tools into our daily routine, the tools are shaping our practice and needs. So we again create different tools, or ways of living. We diversify, then copy the successful. What's the point of creating a new god of intelligence? We have plenty of gods already. Let's maybe study and discuss non-human intelligence instead. (Edit: rewrote last paragraph.)


> I don't think the essence is especially elusive. It's the ability to make novel, meaningful, and useful discoveries from precepts that don't immediately "obviously" lead to those discoveries.

But that's trivial simply by enumerating all Turing machines that reproduces the observed outputs. The point of intelligence is that it somehow "filters" the space of possible theories in some specific, computable way.

The ideal model of this is Solomonoff induction, which orders Turing machines by Kolmogorov complexity, but that ordering is not computable. So intelligence is some computable approximation of this, but discovering the specifics of how that works is non-trivial.


I would define intelligence not in terms of search, but creation. And the two are indeed different. As one simple example - early man had no concept of math, or even numbers. Incidentally, the same is even true of some isolated tribes to this day [1]. Somehow we created numbers, seemingly from nothing. And it was only this creation that enabled us to move onto even more creation where the search space continues to grow ever wider, yet we continue to pull something from nothing. It's not like we had any real basis for the formulation of numbers, or even reason to imagine they existed.

The further you go back in our development, the greater the distinction between creation and search becomes. The article itself even gets into a bit of a paradox on this note. It suggests that language defines thought, and since they have no numbers in their language - they cannot think about numbers. But then how do we have numbers? Somebody was certainly able to, and it's not because they started with numbers in their language. And for that matter how do we even have language? Another thing that was developed from absolutely nothing. Go for enough back in our evolutionary timeline and we wouldn't have even had the ability to express e.g. 'angry noise'. Yet somehow, we created such things - again seemingly from nothing. And I think that is the purest essence of intelligence.

[1] - https://www.nature.com/articles/news040816-10


But everything expressible by humans is expressible by a Turing machine, so there is no fundamental difference between search and creation since Turing machines are recursively enumerable.

We didn't create numbers seemingly from nothing, it was necessary to track our food and our children or family. Even crows have the ability to count.


Again, read the paper. Numbers were thought to be an intuitive concept - they are not, not even amongst humans. One needs not numbers to keep track of their family or food anymore than they need calculus to say, "Wow that thing's speeding up." They have "one", "two", and "many" and that works for all their purposes.

Basically look to any example, where what is discovered is not a recombination of preexisting knowledge but the emergence of new knowledge and you'll find search is pointless. As an example, consider hand washing. Now a days we all intuit that hand washing is a good way to prevent the spread of disease. But of course that intuition is because we all know of and accept a germ based theory of disease. A couple of hundred years ago this was not true. Go a little further back and the concept of germs did not even exist. And so surgeons did not regularly wash their hands even before doing things like surgery.

Now I challenge you to, even in wildly hand-wavey fashion, to describe the creation of a turing machine that could, from the basis of knowledge of an individual of such times, "discover" the secret of hand washing. There were no records kept on hand washing : illness rates or anything of the sort, because nobody even stopped to consider the impact it might be having.

The difficulty you're going to face here is that there is no preexisting knowledge to draw upon. You are not "searching" for an answer, but having to create it, seemingly from nothing.


> Again, read the paper. Numbers were thought to be an intuitive concept - they are not, not even amongst humans.

Your link doesn't prove anything, it contains testimony from experts arguing both sides. Odd that you think one side is automatically correct from one study that was inconclusive and a clear example of an exception to the rule, at best.

> Basically look to any example, where what is discovered is not a recombination of preexisting knowledge but the emergence of new knowledge and you'll find search is pointless

I think you'll find it much harder to argue this point than you think. Most such discoveries result from simple observations of the world, so the information was already out there, people just didn't notice it before.

> Now I challenge you to, even in wildly hand-wavey fashion, to describe the creation of a turing machine that could, from the basis of knowledge of an individual of such times, "discover" the secret of hand washing

Exactly the way it happened: someone noticed that fewer people died in hospitals where the doctors washed their hands after performing autopsies. The scientific process is reliable because it's mechanistic, repeatable. All scientific knowledge derives from simple, repeatable observations like this.

The closest thing you'll find to true invention is maybe math and various logics. But even then, this is often simply a process of permuting existing axioms, and adding a new randomly generated axiom to see if anything interesting haopens. This is a search process, most of whose results will be internally inconsistent and so get discarded quickly by the human mind with it's effective pattern matching.


Hand washing records were not kept for the same reason I've mentioned multiple times - nobody ever thought it relevant, so it wasn't considered relevant. So it's not like you can simply search the records for something which does not exist! Even when one doctor finally did discover the value of handwashing, through a remarkable degree of serendipity, his hypothesis, based on anecdotal evidence, was rejected because it did not line up with scientific thought of the time. He ended up in an insane asylum and died. Like a Greek tragedy, his cause of death was an infected wound on his hand, very possibly caused by excessive washing! [1]

So now we return to the same question. How do you expect a machine to simply discover the value of hand washing? Let alone carry out tests? You have no data on hand washing whatsoever as it's not seen as relevant. It's a rather random hypothesis that, given the knowledge of the time, would have less than zero basis for support. There is no logical reason for its discovery, nor ought it ever be prioritized highly in any way whatsoever.

And this is, in many ways, the rule more than the exception for discovery.

[1] - https://www.nationalgeographic.com/history/article/handwashi...

----

On the numbers issue. I was not referring to "views" but facts. The tribesmen had terms only for 1, 2, and many. And while the article doesn't mention it, it's safe to assume they have 0 systems of mathematics. Assuming they are not uniquely retarded, we were all in a similarly limited state of understanding at some point. Searching for where we are, from the state of where they are, will yield no results. Yet somehow, we achieved it.


hence the artificial in the name


On number 2, even if you are John Carmack you may have trouble getting the right people on the phone.

https://twitter.com/id_aa_carmack/status/1305967411749892098...

Anyone at Google Cloud out there? It seems I can't get my GPU quota raised to 40 x V100 as an independent researcher. I was told that setting up a website would help, but I would rather not. I can pay the bills...


If you're actually an independent researcher, sometimes you can find professors at universities or national labs that are willing to help out in exchange for credits on the paper. I've had success at [redacted] labs in the New Mexico region as well as folks from my previous university. The trick is asking people who do research that's sort of adjacent to your field.


@ingenieroariel, sorry for this trouble. I am product manager for Cloud TPU, I would be happy to connect you with my GPU colleagues and also explore if Cloud TPU can help with your research as well. What's the best way to connect with you?


I gave up and moved to lambdalabs, then they ran out of quota across the board, and now I use a combination of Vast, and Coreweave today.


b-but you work for- why wouldn't they... you know what, never mind.


Oh, I quit recently. Very surprisingly, I learned it was harder to get access to GPUs at big tech companies outside their dedicated research teams than it was for a scrappy hacker outside on side projects with lots of savings. So I quit to work on those projects. I miss not having a dedicated infra team but don’t miss having to beg for resources. I wanted to use google cloud since I use some of their other services for this, but could not figure out how to get them to increase quotas and take my money. I was willing to pay nearly double the hourly rate I am currently paying at coreweave to use only one cloud provider but they just wouldn’t sell me it


Out of curiosity, just how much resources are you able to access with google colab and can you use it for AI training? I’ve been becoming more interested in fintech and trying my hand at predictive/pattern recognition AI but the costs are incredibly prohibitive to getting started, both hardware and data.


It is sometimes far more convenient to avoid the paper trail and bureaucracy of having to provision things internally, similarly I'm sure to how renting GPUs online for short periods of time avoids the issue of having to pay for the maintenance and time costs of maintaining them onsite.


You can skip to step 4 using something like GPT-J as far as I understand: https://github.com/kingoflolz/mesh-transformer-jax#links

The pretrained model is already available.


GPT-J I think hasn't gone beyond 20B parameters, and while it is not the most obvious I think the original question is asking about the full 180B parameter+ kind of model. :) :thumbsup:


It's cool you're on here and I'm sure I speak for many people in saying I really appreciate your video series and comments like the above!

Thank you!


>> de-duplicate, clean, wrangle. this is a lot of work regardless of $.

This sounds like a great job for a specialized GPT!


Thanks for laying out the plan. I was trying to understand the cost of each of these steps below and started wondering about the following:

> rough steps:

> 1. collect a very large dataset, see: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla... . scrape, de-duplicate, clean, wrangle. this is a lot of work regardless of $.

Pile seemed quite clean and manageable to me (I was able to preprocess it ~8 hours for a simple task on consumer grade hardware). Is Pile clean and rich enough for LLM training too ?

> 2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.

It seems like the standard instructGPT model itself is based on a 1 billion param GPT model. Wouldn't that fit on a 24GB RTX 3090 ? Might take longer, maybe not enough opportunity for hyper-parameter search, but still possible right ? Or is hyper-parameter search on a thousand machines in parallel the real magic sauce here ?

> 3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...

Sounds like a good opportunity to learn. No pain, no gain :-)

> 4. follow the 3-step recipe of https://openai.com/blog/chatgpt/ to finetune the model to be an actual assistant instead of just "document completor", which otherwise happily e.g. responds to questions with more questions. Also e.g. see OPT-IML https://arxiv.org/abs/2212.12017 , or BLOOMZ https://arxiv.org/abs/2211.01786 to get a sense of the work involved here.

Maybe somebody would open source the equivalent datasets for this soon ? Otherwise the data collection seems prohibitively expensive for somebody trying to do this for fun: contract expert annotators, train them, annotate/reannotate for months ?


I would love to know what your thoughts are on how software engineering (and jobs in general) will change over the next 10 years and what we lowly developers can do to keep up & maybe even be involved in that change


Andrej wrote an interesting post some years back titled Software 2.0 about the direction he saw software engineering going. It's more about changes in software than the changes in the job market, but I suspect you'd still find it interesting. https://karpathy.medium.com/software-2-0-a64152b37c35


Thanks!


GPUs are much less efficient than cores of the type Cerebras has made.


Step number 1 is already the first problem.


Today to me is the equivalent of the phone phreaking days when people are just doing as much as they can and getting away with as much as they can for as long as they can, until the regulations come.

It will be an interesting time in the next few years as I think StabilityAI's guerilla marketing tactics have inadvertently by proxy also placed the ML dataset debate right in the laps of the larger consumer market.


He's building the model from scratch, as the title suggests. He only trains a small model with 10M parameters on it, something that is feasible with a single GPU. In comparison, GPT-3 has 175B parameters.

> wondering like how hard is it to actually replicate what openAI has done if you had the money to pay for the training?

It would most certainly be possible for another company to build something very similar (models of similar size have even be released publicly). I'm honestly unsure why Microsoft would rather pay $10B to acquire less than half of OpenAI, as they have the hardware to do it (OpenAI uses MS cloud products.) Must be some business reasons I don't understand. OpenAI definitely has some very talented people working for it, though.


>I'm honestly unsure why Microsoft would rather pay $10B to acquire less than half of OpenAI, as they have the hardware to do it (OpenAI uses MS cloud products.)

Because the hardware is the least interesting part of it?

Microsoft buys the know-how, the talent, and perhaps some patents, but most importantly the GPT brand name...


Does the time to train the model increase linearly with the number of parameters, or exponentially?

In other words, GPT-3 is 17,500X the number of parameters but does that mean you can train it in 17,500X the amount of time it takes to train the 10M param model?


In theory it should be linear, however, the parallelization is not perfect and some overlapping parts of gradients are computed on multiple GPUs at the same time so expect some constant factor slowdown on average.


On top of what other people have said about parallelism overheads, you normally need more data to train a bigger network and the training time is roughly proportional to network size * training data.

IIRC OpenAI used a million times more data to train GPT3 than karpathy used in this video, so a naive estimate would be that it would take about 20 billion times more compute. This is could be a significant overestimate since Karpathy probably used each bit of the training set more times than openAI did.


I am not from the LLM world, but I believe it's mostly constrained by the standard multiprocessing limits -- communication and synchronization of multiple workers, some of whom operate over an exceedingly slow Ethernet interface.


They are buying the talent like they would when they buy any company. They are certainly not buying a single trained model.


How much of the purchase price is a barter for Azure compute-time? That may explain a lot.


By single GPU, is a normal one would suffice?


I use this model to help me reason through it.

Realistically, it takes five steps. At a glance, the five steps are simple. But when you dig in, you realize that people have 15+ years of experience in each of the individual steps. At a glance, their insight into the individual problems will seem way too big to actually use, but as you dig deeper, you’ll find edge case after edge case that uses those insights.

So, it’s really five steps, but very smart people have devoted their lives to figuring out each one of those. We’re lucky to live in a time where we can stand on those shoulders, but it can take quite the leap to get up there in the first place.

And of course, at each point you’ll get some nice rewards. It feels good. It’s exciting. And my inner twelve year old feels like this is why we fell in love with tech in the first place.

So, it is quite hard but the fact it exists means it’s obtainable. If you get into it, I have a lot of respect for you and hope you have a ridiculous amount of fun. On the other hand, if you get into it and just don’t enjoy it, who cares? There are many other interesting fields!!


As someone who hasn't really done much deep learning, I've always wondered if the work itself is fullfilling or if it is just the fact that there is absurbly cool outcomes? The math isn't super complex, it seems the majority of the effort is data cleaning and tuning. Is it just a massive labor of love? I also worry that the labor itself doesn't build on itself and becomes obsolete knowledge like a web framework.


Yeah, he's explaining how you would create the base model, which is actually one of the more straightforward parts given that they've published their architecture (though I'm sure they've withheld a bit of their special sauce).

In reality, putting aside the millions of $$ needed to pay for the GPUs to train the model, the complexity actually lies in the training data acquisition/cleaning and the infrastructure needed to harness the 1000s of GPUs to train it in a remotely reasonable timeframe.

That being said there are a number of companies (Google, AI21, Cohere, and probably others) who have successfully created large language models like GPT3, so it's definitely not impossible when you have the resources.


Yeah, that would be really interesting, open source models with similar quality to GPT that even smaller players could use to train models tailored for their application. Kind of the language equivalent to what the OS community already achieved for image models (eg replicating Dreambooth for StableDiffusion). I think one problem standing in the way of that would be that the computational requirements for language models just seem to be a lot higher.


A bit off topic but, the power of GPT (and DL in general) is in the data. Yet, we’ve allowed private enterprises to control what should be distinctly public goods. I don’t know where we took the wrong turn within the past decade but we desperately need to correct this mistake.


I am not getting the angle here. Anyone, including you, can write GPT-like code, train the model with public data and release it for free. It may cost a few million to train in GPU costs, but if what you say is that important, surely there are folks here(I am assuming a good chunk of HN folks have a decent amount of disposable income) who will donate it for the public good? If there are not, then they either don't consider it important or are just virtue signaling. Or the effort is actually hard to implement.

I am totally okay with OpenAI being worth $30 billion or whatever when compared to crypto scams being worth billions.


I'm interested in solving problems you mention in this space. For the sake of simplicity, I will also agree that the data and the models are free if you know how and where to look. The problem is what then? Who has the money and/or compute capacity to do the work at a scale that can compete with the industry behemoths?

I've been slowly building out a home lab to test mesh computing in this space. Perhaps there is a way to carve the workloads into chunks that can be deployed to a distributed mesh of trusted nodes that have a hardware specs suitable for the task. Then somehow aggregated the results and distributes the entire package back to the network of contributors of that compute capacity. In other words, I will agree to lend you my compute capacity in exchange for a copy of the model you are training. I'd love to collaborate with folks and grow this idea and get a legit open source project going.

Let's build the "Constellation". If anyone wants to geek out and make this happen i'd love to chat. art.aquino at compute dot tech

Building compute clusters and cool software is a passion of mine. So i'm looking to build a network of like minded folks without any commitment and just to help each other.


For a distributed computing/BitTorrent-style method of running these LLMs, see: https://github.com/bigscience-workshop/petals.


How can we really stop this though?

I feel like Microsoft is doing what Microsoft does yet again. They piggybacked on the whole "open source" angle but this time, rather than be "software vendor lock-in closed source assholes", they're now being "IP theft open source" assholes.


It's not just Microsoft. Google probably has models at least on par with OpenAI.

Facebook probably too.


> I don’t know where we took the wrong turn within the past decade but we desperately need to correct this mistake.

I don't know if these ideas are worse technically or politically, or both, but what comes to mind are these alternatives:

1. Have someone start a non-profit that curates public data goods, maybe gaining access to data through voluntary donations and through buying all the data provider feeds, and funding through subscriptions by people and organizations who want to understand the data provider infrastructure.

2. Get legislation passed that identifies public data goods and requires that they be made available to all.


Totally agree. I also doing some Gcloud training. This should also be public good. When I move my machines to Google/AWS/ .. my business changes ownership and I become slave chained by the "competition" and "pricing".

To fix this we need to change mentality and governments.

But people are satisfied with their:

"Mirror, Mirror on the Wall, Who's the Coolest of Them All? ... You, because you are indistinguishable from the others!!!!"


Andrej’s entire series is by far one of the most useful resources I’ve found. Even as an instructor of these topics myself I learn something new about the material and about how to teach it!


Karpathy also has a great Recipe for Training Neural Networks:

http://karpathy.github.io/2019/04/25/recipe/


Just finished this, need a part 2 for the PPO reward functionality


This is really great, thank you. I would love to see a real "from scratch" that doesn't use torch.py et. al., though.


https://youtu.be/VMj-3S1tku0

The entire series is at https://karpathy.ai/zero-to-hero.html, first video is literally from scratch (i.e. just Python, the only external dependency is for plotting graphs if I recall correctly).


Andrej's "Building Makemore" series is exactly this, it includes a wonderful lecture where he computes all the gradients for a simple network by hand and compares them against the values produced by torch's autograd.


I find Makemore to be my favorite neural-net-based (w)rapper.


Same, but one that doesn't assume python or an operating system.


Same, but I would honestly rather we get away from silicon or anything resembling what we call these "computers" these days. Things are too complicated and we need to truly get back to the basics: rocks on a hill.


Abacus-driven deep learning with trained monkeys flipping gradients.


Outside tools are not required.

We'll selectively breed humans until one can just compute it in...meat[0].

https://www.youtube.com/watch?v=7tScAyNaRdQ


in mitro computing


You don't need to flip gradients, just need to train the monkeys to operate matchboxes: https://en.wikipedia.org/wiki/Matchbox_Educable_Noughts_and_...


Reminds of that part in the Three-Body Problem novel where the emperor of the videogame builds a CPU with soldiers as bits


Sounds like a project for Ben Eater https://eater.net/


That would be next to impossible to do even in a semester long course...



Does one really need Python. What about using Lua.

https://news.ycombinator.com/item?id=7928738


Fantastic material! By the way, one of the simplest explanation of the difference between BetchNorm and LayerNorm


For me he is the best educator in this space bar none. When Karpathy explains stuff it just clicks in my head.


His cs231n course is how I learnt about neutral networks. And now on to learning GPT and Transformers.


Now be meta and build GPT with GPT.


I really hate the phrase "from scratch" in the title.


The char-rnn guy does it again.


I am a simple man. I see a video post by karpathy, I upvote and watch.

<end reddit-speak>

I discovered Andrej very recently and I am a huge fan. Kudos to this whole effort!

Two ideas --

1. While these explainers are outstanding -- I can think of supplementary material/presentation that can nicely complement these explanations if they are presented visually. Especially the concepts of multidimensional tensors.

Something like what 3B1B (or his followers that create 'Summer of Math Exposition' videos) does -- which is not a skill I have.

I am thinking of creating some visual slides (my forte) but would there be interest in making this a larger collaboration that creates explainers for "visual learners"?

2. There should really be a discussion forum for people who follow along these "make more" tutorials -- to have discussions about each specific video, infact each specific timestamped chapter of these videos in that context.

Is there a framework or tool that lets us integrate YouTube videos and timestamped chapters into a "discussion forum" -- whether a simple website or a discord/slack. Once again this is slightly outside my skillset but if it appeals to people, maybe some ideas and effort can come together to make this happen?

EDIT: #facepalm -- I see there is already a discord [0] on the webpage [1] (but not likely tied to very specific chapters as I imagined -- but should be a excellent start anyway)

[0] - https://discord.gg/3zy8kqD9Cp [1] - https://karpathy.ai/zero-to-hero.html


In case you're not aware, 3B1B has a Github repo for the engine he uses for the math animations so that others can use it to make similar things: https://github.com/3b1b/manim

There's also a loose group of people already doing the visual learners "explainers" thing over here: https://explorabl.es/ (you can scroll down for links to tools they use to make their explainers).

But yes, I also feel this is an important development and that this should be an ongoing way of teaching people things. Formal education has IMO stalled out around the printing press but there are massive opportunities on computers (and especially on globally networked computers) to take that a step further and leverage the capabilities of computers to make education even more engaging and information-dense.


fastai uses its forums to manage discussions for specific chapters. It wouldn't be hard setting it up, but the instructor would have to declare it in the video or description to actually send students to the forum.

fastai forum : https://forums.fast.ai/


I had the please of working under Andrej a few years ago. It was very interesting the problems we were solving until this other guy that is a billionaire would throw a wrench (and fits) into the whole thing.


I’d be interested to hear more. It was surprising to hear of Andrej leaving.


You mean he got hired by Elon Musk to work at Tesla? :P


Karpathy's videos (and blogs) are excellent. I wonder how history will reflect on his time at Tesla, however.


What do you mean by "however"? He built the entire FSD architecture - in less time and with fewer people than Google, Apple, and god knows who else. He's likely saved more lives than most doctors at this point...


No he didn't, a (large from what I hear) team did. However Karpathy is still a smart, well accomplished, guy and he has a talent many academics don't: being able to write actual production grade, high quality software. And he's a good communicator.


It's probably a reaction to: https://news.ycombinator.com/item?id=34415413

which I'm sure was more a marketing error than anything else.


> likely saved more lives than most doctors at this point

Care to elaborate on this ridiculous claim?


What's ridiculous about stats? A Tesla with autopilot or FSD is more than 10x less likely to be in a collision, based on various US agencies stats. Average chance of collision is 1 in 366 for every 1000 miles driven. Something like 1 in 20000 of those result in fatalities. Now, take millions of Teslas equipped with any assisted-driving tech, multiply by miles driven, include the above averages, and divide by ten. Doctors would be jelly.

Perhaps you can elaborate on what strikes you as ridiculous about fairly straightforward stats? Surprised to see you on a ML thread...


Can you cite "various US agencies stats"? I say this as someone who used FSD Beta for a full year and Autopilot (with Navigate on Autopilot) for over 3 years now. It would actually help the conversation here to cite it, though I never knew third-parties evaluated Tesla's claims/statistics.


There's also a strong selection bias to a bulk stat like the one parent mentioned. Teslas are expensive cars and I'm guessing their drivers are likely to be older, more affluent, and more educated that average - this all correlates to lower accidents/fatalities regardless of FSD[1][2].

[1] https://pubmed.ncbi.nlm.nih.gov/24103823/

[2] https://personalinjurylawyersaustintx.com/blog/education-lev...


You are absolutely correct - Tesla's own report acknowledges this, but it also highlights how this safety stat is dwarfed when autopilot is active: https://tesla-cdn.thron.com/delivery/public/image/tesla/5f93...

Reference: https://www.tesla.com/VehicleSafetyReport


Millions of Teslas equipped with ‘any assisted-driving tech’ != autopilot on FSD. We have to compare deaths in Teslas with deaths in other vehicles too, as that’s what people would use if not a Tesla.

The chance of fatality per collision is way off.

If you’re going to go down a rabbit hole we need to look at lives impacted by lithium mining compared to regular combustion engine vehicles.

While I appreciate your point I’d be surprised if the number of QALYs is higher from working on FSD at Tesla compared to being a doctor.


You're probably right that fatalities per collision is off but plug in your own numbers and then divide by ten... the point still stands. Autopilot has saved countless lives. And it was architected by a very small team, led by this one man. But I'll admit, the comparison with doctors is a little hyperbolic. Although in a few years of growth and global scale, who knows...?


Tesla kick started the global EV revolution, so probably favorably regardless of the few negatives that inherently come along the way.


As if history will care?


The history AI will have an opinion that mirrors the politics of the user that asks it what it thinks of Karpathy’s time at Tesla.


Only in the default bubble mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: