Hacker News new | past | comments | ask | show | jobs | submit login
GPT4 is 8 x 220B params = 1.7T params (twitter.com/swyx)
388 points by georgehill on June 21, 2023 | hide | past | favorite | 206 comments



weird title, note that the tweet said "so yes, GPT4 is *technically* 10x the size of GPT3, and all the small circle big circle memes from January were actually... in the ballpark?"

It's really 8 models that are 220B, which is not the same as one model that is 1.7T params. There have been 1T+ models via mixtures of experts for a while now.

Note also the follow up tweet: "since MoE is So Hot Right Now, GLaM might be the paper to pay attention to. Google already has a 1.2T model with 64 experts, while Microsoft Bing’s modes are different mixes accordingly"

There is also this linked tweet https://twitter.com/LiamFedus/status/1536791574612303872 - "They are all related to Switch-Transformers and MoE. Of the 3 people on Twitter, 2 joined OpenAI. Could be related, could be unrelated"

Which links to this tweet: "Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!"

Anyway... remember not to just read the headlines, they can be misleading.


More Efficient In-Context Learning with Generalist Language Model (GLaM) on the Google blog (2021): https://archive.is/cuyW0

Btw, MoE is one of the 5 other ways Google thinks LLMs can be more efficient: https://archive.is/2XMvh


(OP here) - yeah i know, but i also know how AI twitter works so I put both the headline and the caveats. i always hope to elevate the level of discourse by raising the relevant facts to those at my level/a little bit behind me in terms of understanding. think theres always a fine balance between getting deep/technical/precise and getting attention and you have to thread the needle in a way that feels authentic to you to do this "job"

ultimately my goal is to Learn in Public and demonstrate to experts that spending their time teaching/sharing with me is a good use of time because i will augment/amplify/simplify their message.

(pls give the full podcast a listen/read/watch, George went deep on tinygrad/tinybox/tinycorp and there's lots there he IS the authority on, and people are overly fixated on the GPT4 rumor https://www.latent.space/p/geohot#details )


Makes it worse, since you have this understanding and still went with this explanation that it's 1.7 trillion patients


1.7 trillion patients? We're going to need more hospital beds ...


Oh, is that how Bing's image descriptions work? If you upload an image and ask for a description, it'll give a really good one. But if you ask it where it got it from, it says it got it from another algorithm. So is it using another expert AI to describe, and then feed it into GPT4 for conversion into language?


Pretty sure this is how the human brain works. If we add models that can transform information into updates to existing models, then add a model that can process streams of information and determine which models to update with the information (and can also apply that process to itself), we start to approach a primitive brain, no?


> Pretty sure this is how the human brain works

How we thought it works in the 1960's yes.


How do we think it works currently?


For a moment I thought you were Andrej Karapathy!


Title should be: "GPT4 is 220B"

Factual, and surprising to many.


GPT-4 is 1.7T params in the same way that an AMD Ryzen 9 7950X is 72 GHz.


You might be surprised to learn that dell "hpc engineers" (or maybe HPE?) have attempted to sell me hardware using that exact logic to compare servers.


Well, they're not wrong...

Given a highly-parallelizable numerics job, and all other things being equal, a 20 core 2.0Ghz machine will run the job in the same time as a 40 core 1.0Ghz machine.

...though usually we'd just give it in FLOPS - though FLOPS doesn't describe non-float operation performance.

Probably just a sales/marketing person who learned a new technical word and wants to sound knowledgable to the customer but it back-fires.


At the risk of torturing this analogy beyond recognition, it’s more like having an 8 core CPU but with a weak 150W PSU such that you can only run one core at a time.


Analogy sufficiently tortured


Is each core specialized to some particular type of task or something like that? Which would make this less silly of an idea.


in my understanding with an 8 way MoE youre still running all 8 for every request. only if you’re making a router model then itd be the “weak PSU” analogy. and even then a sufficiently perfect router would essentially have to know enough about its routes to almost simulate them which idk if i can believe works


Amdahl's law though ...


...doesn't really bite you for embarrassingly parallel numerics work.


VMware still uses ghz for displaying load and compute capacity on your servers.


Add the hurts please!


No, GPT-4 is 1.7T params in the same way that an AMD Ryzen 9 7950X is (24 times X million transistors). It's not wrong to say GPT-4 is 1.7T parameters, your statement is wrong. Parameters can be configured in different ways, just as transistors on a chip can. In the same way that having 24x transistors does not imply 24x performance, having 24x parameters does not imply 24x performance.


I think frequency is not additive like the parameters. If we are looking for analogy in compute power, then FLOPs is a better analogy to parameters.


I'm fairly certain the point of the gp was that the number of parameters are also not additive.


Parameters are the number of weights needed to specify the model given the architecture; they are exactly additive.


A tree is not a clique.


if you have a basket with 4 apples and a basket with 3 pears, are you not having 7 fruits ?

just because the first selection layer is very thin doesn't mean that the network cannot be considered composable


You can’t claim it’s a 7-fruit Basket, unless you find a way to fit them in the same basket


but you can claim its a 7 fruit grouping of baskets, we’re just in a territory where the fruit count matters less and less now

which happens to be consistent with what openai is telling us


If you can add FLOPs, and you have a number of cores with the same fixed number of FLOPs/cycle, it follows that you can add their frequency.

For the vast majority of real-life workloads there are vast differences between one core having X GHz or Y FLOPs vs lots of cores that sum up to that number. Which is the point GP tried to make.


I wouldn’t trust anything geohot says. He doesn’t have access to any inside information.


https://twitter.com/soumithchintala/status/16712671501017210...

It looks like at least one other person has also heard the same information.


To be fair, this was already a common whisper at the time, so it could be a Chinese whisper effect. Even I thought this on release day: https://news.ycombinator.com/item?id=35165874

What is weird is how competitive the open-source models are to the closed ones. For instance, PaLM-2 Bison is below multiple 12B models[0], despite it being plausibly much bigger[1]. The gap with GPT-4 is not that big; the best open-source model on the leaderboard beats it 30% of the time.

[0]: https://chat.lmsys.org/?leaderboard

[1]: https://news.ycombinator.com/item?id=35893357


From my perspective, there's a vast divide between open source models and GPT4 at present. The lmsys leaderboard rankings are derived from users independently engaging with the LLMs and opting for the answers they find most appealing. Consequently, the rankings are influenced not only by the type of questions users ask but also by their preference for succinctness in responses. When we venture into the realm of more complex tasks, such as code generation, GPT4 unequivocally outshines all open source models.

That said, the advancements in models like Orca and the "Textbooks are all you need" paper are noteworthy (https://arxiv.org/pdf/2306.02707.pdf, https://arxiv.org/abs/2306.11644). I'm optimistic about what future smaller models could achieve.


I agree. Also, ELO rankings can artificially deflate strong performers when having too few binary comparisons. The Guanaco paper points that out, giving instead a 326-point lead to GPT-4 over Guanaco-65B across 10k orderings, based on GPT-4’s opinion, which corresponds to Guanaco winning about 13% of the time. But even that only relies on the Vicuna benchmark set.


There are huge differences. I used OpenAI’s APIs in my last book on LangChain and LlamaIndex, and GPT-3.5 and GPT-4 are good enough right now to support building applications for years (although I look forward to improvements).

I am writing a new book Safe For Humans AI in which I am constraining myself to using open models that can be run on a high end PC or a leased GPU server. Yesterday I was exploring what I could do with T5-flan-XXL, and it is useful, but not as effective tool as the closed OpenAI models. I will do the same with Orca this week.


I think a lot of people see a larger gap between open-source models and GPT-4 than is present because they test out the 7B models which fit on their machine. Models like T5-FLAN-XXL are very far below the quality one can expect from the best in open-source, and barely usable for CoT or tool use.

Especially for LangChain, I recommend using ≥33B models like Guanaco or WizardLM. Guanaco-65B honestly feels on-par with ChatGPT-3.5. (To be clear, there is a large gap with GPT-4 though.) It is a costly test, although GPTQ (for instance on exllama) help make it affordable.

I haven’t tried Orca since they haven’t released the weights yet, but it doesn’t seem like they have a 33B version.


Thanks for your advice! I am trying to write the book examples using a A6000 (48G video memory), but I may have to go higher end. The hourly lease charge on a A6000 is very inexpensive and I wanted the examples to be "almost free" to run. I will see what is required for Guanaco-65B.


> The gap with GPT-4 is not that big...

Oh it is though. I've tried several OS models and nothing comes even close to GPT-4. Turns out ClosedAI has a moat after all.


AI development is taking place at breathtaking velocity and GPT-4 isn't the end of the road. OpenAI's moat relies on them improving at a faster pace than competitors. I wouldn't take it for granted that OpenAI's lead will remain the same in a year as it is today. Very possibly they could expand their lead. Also possibly, their lead could vanish.


Individual developers’ hardware and data has limit, I don’t see the OS models breaking into 100B models anytime soon


GPT-4 came roughly 12 months after GPT-3.5, and I'd say the competition is roughly approaching 3.5 now, so I'm hoping they will reach GPT-4 level early 2024. Not that far off!

Wish there was a bookmark/remind-me function so I could come back in 8 months' time and see how close I was.


It’s funny that this post is trending on HN right next to the post about a paper showing how to build a model 1000x smaller than 1.7T that can code better than LLMs 10x larger.


I don't find it funny, I find it scary and mind-blowing: the impact of these headlines is additive - this one confirms the effectiveness of combining models, and the other one suggests you could cut the model size a couple orders of magnitude if you train on clean enough data. Together, this points at a way to achieve both GPT-4 that fits on your phone, and a much more powerful model that's not larger than GPT-4 is now.


If it truly is the training data that's making models smart, then that would explain that there is both a minimum and maximum "useful" size to LLMs. The recent stream of papers seems to indicate that the cleaner the input data, the less size is required.

That would negate, at least partially, the "we have 20 datacenters" advantage.


I think it is. I also think this is what OpenAI did. They’ve carefully crafted the data.

I don’t think they have an ensemble of 8 models. First, this is not elegant. Second, I don’t see how this could be compatible with the streaming output.

I’d guess that GPT4 is around 200B parameters, and it’s trained on a dataset made with love, that goes from Lorem Ipsum to a doctorate degree. Love is all you need ;)


Is it a large language model anymore ? Something doesn’t quite add up?


It's still large. But it might no longer need to be "only few entities on the planet can afford to make one, and not at the same time, since NVIDIA can pump out GPUs only so fast" large.


This doesn't seem likely in the near term. An iPhone 13 Pro has 6gb of memory, which might be enough for a single 1.3b parameter model if you purged everything else out of RAM. Combining 8 (or even 4) of them on a single phone won't happen anytime soon at the rate phones improve.

Plus, none of the smaller models are really appreciably close to GPT-4 on most metrics. It's not clear to me you could get there at all with 1.3b models. Maybe somebody gets there someday with 65b models, but then you're far out of the reach of phones.


Are you assuming 32 bit parameters? You definitely don't need to keep that much. People have been cramming llama into 3-4 bits. You wouldn't want 8x1.3b but it should fit.

Also Apple has never gone for high RAM in mobile devices. I could go get 12GB in a brand new phone for $400, and some high end phones have had 16GB since 2020.

So combined, you could do normal app stuff with 2.5GB and 20-25b 3-bit inference with the other 9.5GB.



2x0=0


It’s fine to say you don’t believe him, it’s fine to say he’s wrong, but if you want to claim he doesn’t have inside information, I would expect a standard of evidence as good as that you would expect for Geohot’s claim.

So tell me: how do you know the Geohot doesn’t have inside information?


It's up to Geohot to prove his claim. Althoug he may have insider information, he isn't in a position known to have inside information. It might be more correct to phrase it that way but we all got the meaning.


That's not how it works out in practice. For example, journalists who have insider info from sources and leak it on the regular don't do anything to prove they have insider information, they build up reputation based on how many predictions they make.

There's no point in outing your sources, that's how you lose them.


Until reputation is built you aren't trusted. It seems geohot's reputation still doesn't convince everyone.


He has a reputation but it's for being untrustworthy.


Prove a negative you mean?


It's up to anyone making a claim, if they want people who aren't yet convinced to become convinced, to provide some supporting evidence. This basic rule of discourse applies whether one makes a positive claim or a negative claim.

Bluedevilzn made a negative claim, so naturally I'm asking if they can prove their negative.


He doesn't strike me as the type of person to lie (except when trolling). His reputation is solid enough that I'm sure he's had discussions with people in the space.


>his reputation is solid

Eh, is it? Not sure if I consider him an authority on anything anymore.

https://www.reddit.com/r/ProgrammerHumor/comments/z2y8i0/fro...

>This is the interview. Build this feature. You don't get source access. Link the GitHub and license it MIT.

is akin to "Build this for free, license it MIT so I can use it without any issues, and oh, btw, I dont have authority to hire you, teehee."


I remember watching a live stream [1] of him going through Twitter engineering articles and trying to reverse engineer the Twitter frontend and backend, and he clearly had absolutely no clue of how anything remotely related to the web works.

He was just clicking around compulsively, jumping between stuff randomly without even reading it, while not understanding what he was looking at. I have no idea how he's successful in tech, judging from what I saw there.

[1] https://youtu.be/z6xslDMimME


> I have no idea how he's successful in tech, judging from what I saw there.

I think judging only from what you saw there is the issue. If you look somewhere like Wikipedia [0], you'll see he was the first person to jailbreak the iPhone, the first person to achieve hypervisor access to the PS3's CPU, he took first place in several notable CTFs (at least one time as a one-person team), he worked on Google's Project Zero team (and created a debugger called QIRA while he was there), creating comma.ai, and the list goes on.

[0]: https://en.wikipedia.org/wiki/George_Hotz


He's good at breaking APIs. That's different skill from making APIs and things.


Well he picked the right problem at the time which was (and still is!) search. I was really looking forward to some progress on this and then he deleted tweets and dropped off the system. I would love to see a podcast of him talking about that & fyi @realGeorgeHotz is back on it now to promote Tiny grad/box/corp.


As someone who has spent 20+ years working in large companies you can spot developers like him a mile away i.e. the sort of behaviour you see with skilled but arrogant grads/interns.

The right approach when you're new is to quietly pick a simple problem away from the core services where you can learn the processes and polices needed to get something into Production. More so when you're in a company that is undergoing a brain drain.

Then you can progressively move to solving more demanding and critical issues.


You should write a book or a blog about how to not rock the boat at large companies. Maybe you can call it Going Nowhere Fast.


Large companies normally (unlike Twitter) have achieved strong product market fit and are all about tweaks that deliver marginal revenue. Given that they already have a big slice of the market, the multiplier on marginal revenue makes what you're doing worthwhile.

"Rocking the boat" is more likely than not to make things worse.


You don't need a book about it when it's a concept children understand.

Before you run you learn to walk.


Some people were just born to run. Their legs don’t really work right for walking.


This wasn’t very impressive: https://www.pcmag.com/news/hacker-george-hotz-resigns-from-t...

Spent two weeks trying to find someone to build a faceted search UI and then quit.


I think deciding to get away from the Musk/Twitter debacle as soon as you realize how bad it is isn't necessarily a bad thing..


How is it a debacle and how is it bad?


Because Twitter is trending towards bankruptcy.

a) It is valued at about a 1/4 of what it was purchased at.

b) Twitter Blue has generated an irrelevant amount of revenue and churn is increasing [1].

c) Roadmap looks poor. Video is a terrible direction where only Google, Amazon, TikTok etc have been able to make the numbers work and that's because it is subsidised through other revenue sources. Payments is DOA given Twitter's inability to comply with existing regulations let alone how difficult KYC/AML is to manage.

d) Regulatory and legal risk increases by the day. Lawsuits continue to pile up and EU/FTC are hovering around as Twitter is not in compliance with previous agreements.

e) Brand safety continues to be a huge challenge that isn't solvable without effectively going back to what Twitter was previously.

f) BlueSky and Instagram are both releasing competitor apps to the broader public in the coming months. The market simply won't sustain this many text-based social media apps.

[1] https://github.com/travisbrown/blue


a) Meta had similar swings of over 3x between a 2021 high and a 2022 low. We can only guess what Twitter would be worth today if the stock was floated, but it isn’t. Substituting market valuations with our personal beliefs isn’t interesting.

b) Every social media company has a well populated graveyard of failed experiments behind them.

c) Opinion.

d) Business as usual for every social media company since the dawn of time.

e) Opinion. And advertiser behaviour suggests otherwise.

f) History is littered with new entrants which fail to unseat the incumbent. It does happen, but it’s statistically rare.


a - f) Opinion.


Precisely. It’s all opinion.


Except for data points like Twitter Blue subscribers, number of active lawsuits, failed video startups etc.

As well as comments from Musk himself about the decline in the value of Twitter and from advertisers themselves about the challenges around brand safety.


Twitter was trending towards bankruptcy anyway.

At best it has sped up towards that destination, at worst changes will eventually avoid it.


g) bot account traffic has gone through the roof after constraints were removed. Follows from random bots occur multiple times a day on accounts with a couple dozen followers

h) the tweets served are no longer weighted towards followers but instead whoever paid for blue checks

i) accounting on views is entirely wrong with a tweet registering a billion views


His biggest issue was trying to convince the remaining engineers and Elon that a refactor was more promising than continuing to build on the layers of spaghetti code that runs twitter.

I don't disagree with him, it's just clear that he didn't understand how dire the financial situation was and that even a progressive refactor starting with the most basic features would take considerable engineering hours and money.


As the saying goes, if you don’t have time to do it right, you’d better have time to do it twice.


He's a crypto scammer. Look up cheap eth


I mean that was marketed as a memecoin from the beginning. More so than doge even.


He added some code to give himself tons of it and didn't report it to anyone, dismissed it as a joke or something when it was found and then disappeared after hyping the project. Sound familiar?

To me he joins the ranks of the most basic shady crypto types.


You mean he has become Elon Musk?


That's luckily not relevant here. Multiple sources have independently confirmed the same rumors. This means that there is a significant probability of it being true.


What exactly is "it"? What is being discussed here? The size of the model? Who cares? People will make the same API call to get the results. Everyone is acting like a glorified AI expert now smh


The rumor here is that GPT-4 isn't one giant model, but instead a "mixture of experts" combination of 8 smaller GPT-3 sized models with different training sets.

This has implications for what may be possible on consumer hardware.


I find it interesting that geohot says it is what you do “when you are out of ideas,” I can’t help but think that having multiple blended models is what makes GPT-4 seem like it has more “emergent” behavior than earlier models.


OpenAI has always stuck to “simpler” approaches that scale. It’s the best bet to make and it’s paid off time and time again for them. I don’t think it’s so much “they’re out of ideas”, it’s just an idea that scales well


Yeah, and who says that this is all that they're currently working on? They likely found this to work better, launched, and did other things in the meantime.


Are the models specifically trained to be experts in certain domains?

Or the models are all trained on the same corpus, but just queried with different parameters?

Is this functionally the same as beam search?

Do they select the best output on a token-by-token basis, or do they let each model stream to completion and then pick the best final output?


If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to.

It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference.

[1] https://arxiv.org/pdf/2101.03961.pdf


So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"? PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks.


Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.

There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?

[1] https://transformer-circuits.pub/2022/toy_model/index.html


Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.


Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...


They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics.

The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word)


I guess that neural network has to have the capability of identifying the subject and know in every moment which network is the most capable for that subject, otherwise I can't understand how it could possibly evaluate which is the best answer.


Results of this sort of system frequently look almost random to human eyes. For example one expert might be the "capital letter expert", doing a really good job of putting capital letters in the right place in the output.


Democracy of descendant models that have been trained separately by partitioning the identified clusters with strong capabilities from an ancestor model, so, in effect, they are modular, and can be learned to be combined competitively.


Heh. I understand all these words separately. Btw, to which of the question of parent comment this is an answer?


Ancestor model = pre-trained model that used a large diverse corpus

Descendant models = models fine tuned from an ancestor on one particular domain, e.g. by partitioning your training data by subject or source

Democracy = some weighted mix of the descendant models is used to find the next token


So if this is true - which is a big if since this looks like speculation rather than real information - could this work with even smaller models?

For example, what about 20 x 65B = 1.3T params? Or 100 x 13B = 1.3T params?

Hell, what about 5000 x 13B params? Thousands of small highly specialized models, with maybe one small "categorization" model as the first pass?


Well at the end of the day you’ll also need a model for ranking the candidates, which becomes harder as the number of candidates grows. And the mean quality of any one candidate response will drop as the model size decreases, as will the max quality.



« We can’t really make models bigger than 220B parameters »

Can someone explains why?


Because of memory bandwidth. H100 has 3350gB/s of bandwidth, more gpus will give you more memory but not bandwidth. If you load 175b parameters in 8bit then you can get theoretically 3350/175=19 tokens/second. In MoE you need to process only one expert at a time so sparse 8x220b model would be only slightly slower than dense 220b model.


Okay, memory bandwidth certainly matters, but 19 tokens a second is not some fundamental lower limit on the speed of a language model and so this doesn't really explain why the limit would be 220b rather than say 440b or 800b?


It's not a fundamental limit. Google palm had 540B parameters as dense model. But it's a practical limit because models with over 1T would be extremely slow even on newest gpus. Even now, OpenAI has limit of 25 messages. You can read more here: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-fo...


I'm not trying to say memory bandwidth isn't a bottleneck for very large models. I'm wondering why he picked 220b which is weirdly specific. (To be honest although I completely agree the costs would be very high, I think there are people who would pay for and wait for answers at seconds or even minutes per token if they were good enough, so not completely sure I even agree it's a practical limit)


It doesn't fit in VRAM.


I’ve been a bit surprised that Nvidia hasn’t gone to extreme lengths to fit 1tb of memory on a card just for this reason.


The issue, as pointed above, is primarily bandwidth (at inference), not addressable memory. Put simply, the best bandwidth stack we currently have is on-package HBM -> NVLink, -> Mellanox InfiniBand, and for inference speed you really can't leave the NVLink bandwidth (read, 8x DGX pod) for >100b parameters. And stacking HBM dies is much harder (read, expensive) than GDDR dies which is harder than DDR etc.

Cost aside, HMB dies themselves aren't getting significantly denser anytime soon, and there just simply isn't enough package space with current manufacturing methods to pack a significantly increased number of dies on the gpu.

So I suspect the major hardware jumps will continue to be with NVLink/NVSwitch. Nvlink 4 + NVSwitch 3 actually already allows for up 256x GPUs https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-ho... ; increased numbers of links will let ever increasing numbers of GPUs pool with sufficient bandwidth for inference on larger models.

As already mentioned, see this HN post about the GH200 https://news.ycombinator.com/item?id=36133226, which has some further discussion about the cutting edge of bandwidth for Nvidia DGX and Google TPU pods.


Thanks for this info!


https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh20...

I think they _are_ going pretty extreme now.


Offtopic, but as a VR gamer that article just made me very sad. I was really hoping to see NVidia produce some decent cards in the near future, but looks like their main revenue is really going to be gargantuan number-crunchers. They'll likely only keep increasing the VRAM of gaming cards by arbitrarily-small numbers once every few years :-(


Gaming seems a lot less important than AI, in particular the graphical fidelity. Even games with crappy graphics can be fun. Crappy AI, not so much.


Yes, the future of VR gaming looks closer to the Sony Playstation or even the Apple Vision than NVIDIA's products.


> GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.

There was a post on HackerNews the other day about a 13B open source model.

Any 220B open source models? Why or why not?

I wonder what the 8 categories were. I wonder what goes into identifying tokens and then trying to guess which category/model you should look up. What if tokens go between two models, how do the models route between each other?


220B open source models wouldn't be as useful for most users.

You need two RTX 3090 24GB cards already to run inference with a 65B model that is 4bit quantized. Going beyond that (already expensive) hardware is out of reach for the average hobbyist developer.


You could run it quantized to 4 bits on CPU with 256GB ram, which is much cheaper to rent/buy. Sure it might be somewhat slow, but for lots of use cases that doesn't matter.


Benchmarks I've run with a Ryzen 7950x, 128 GB RAM with Nvidia GeForce 3060 12 GB VRAM show a slowdown less than half when not using the GPU, with LLama.cpp as the inference platform and various ggml open source models in the 7B-13B parameter range.

The Ryzen does best with 16 threads, not the 32 it is capable of, which is expected due to it having 16 CPU cores.


Llama.cpp running on the GPU is pretty slow. Better try with something else. The speedup going from CPU to RTX 3090 is usually around 10x or 15x.


Google open-sourced (Apache 2.0) the Switch Transformers C-2048 model (1.6T parameters for 3.1 TB): https://huggingface.co/google/switch-c-2048


I think it’s just an ensemble of models, so you do some kind of pooling/majority vote on your output tokens


Would this be before or after inference? Is there some sort of a delegation based on the matter?


If it is output tokens then it is after the inference.


same. i wish i had asked george instead of nodding along like an idiot. he probably wouldnt know but at least he’d speculate in interesting ways.


It was a great interview, thank you.


If it’s trivial then why does every other competitor suck at replicating it? Is it possible this is just a case of sour grapes that this intellectual is annoyed they’re not at the driving wheel of the coolest thing anymore?


I don't think he thinks its trivial - just that its not some revolutionary new architecture. It's an immense engineering effort rather than a fundamental breakthrough.


That's on the order of 25 4090 GPUs to run inference. Not a crazy number by any means. We will see consumer robots running that by the end of the decade, mark my words.


Language models have tiny inputs (words), and tiny outputs (words), and no tight latency requirements (inference takes seconds). That makes them perfect for running where compute is cheap (ie. next to a hydrothermal plant in iceland) and querying them over the internet.


It's a lot of electricity to power it and heat that gets generated... and electricity to cool it from the heat


25 GPUs? A lot of electricity?


can you explain the math of how you got to 25 gpus? everyone seems to know these conversions and idk if i missed the memo or something


220 Bp * 8 = 1760 Bp

1760 Bp takes ~ 1760 Gb of (V)RAM when 8-bit quantized. Plus you will need some memory for state. So you will need at least 1760/24=74 consumer-grade GPUs (3090/4090) or 1760/80=22 professional-grade GPUs (A100/H100).


I was incorrectly calculating based on 1 weight == 1 transistor which is totally wrong. This figure that you provided is more accurate.

We can see that today, the MI300X can already run inference for some open source LLMs: https://www.youtube.com/watch?v=rYVPDQfRcL0

There are almost certainly algorithmic optimizations still available. LLM-scale computing inside of consumer robots should be achievable by the end of the decade. In fact electric cars are probably the best "housing" for this sort of hardware.


thank you for this! 1Bp ~ 1Gb VRAM 8bit is a very nice equivalence to hold in my head.


If parameters are 8 bit (aka a byte), one billion parameters is roughly 1 billion bytes, aka 1 gigabyte. ;)


doh. of course


It is my understanding that Nvidia has castrated these cards by removing NVLink?


How are. 254090 GPUs supposed to run inside a robot ?


They mean 25 x RTX 4090 GPUs. 4090 is a model number


I think the question stands. You can't fit 25 4090s in a robot (unless we're talking about something massive with an equally massive battery), and even if you could an LLM wouldn't be appropriate for driving a robot.

Given the pace of improvements I don't see how you compress 25 4090s into a single GPU in 7 years. A 4090 isn't 25 times the power of a GTX 980, it's closer to maybe three times.


Teslas are already consumer items that rock massive batteries.

My 25 count is off. It's probably closer to 75 GPUs right now.

Let's say today's models are running vanilla transformers via pytorch without any of Deepmind's Flamingo QKV optimizations. In 10 years algorithmic optimizations, and ML platform improvements push that efficiency up 3-5 fold. We're down in the ballpark of 25 GPUs (again)

Now, we ditch the general purpose GPUs entirely and go for specially built AI inference chip. The year is 2033 and specialty inference chips are better and more widespread. Jettison those ray tracing cores, and computer rendering stuff. Another 3x improvement and we're at ~8 GPUs

Now you said the 4090 is about 3.5x faster than the 780 from a decade ago. We are now on the order of 2 chips to run inference.

These models won't just be running in Teslas, they will be running in agricultural vehicles, military vehicles, and eventually robot baristas.


I guess you’re onto something, that combined with reductive, concave, inference architecture gains and new flux capacitor designs we are on 10x performance maximum optimisation curve.

I can taste that robot barista cappuccino now.


Light or quantum could possibly deliver by that time


Neither of these are likely nor necessary to deliver the necessary results.


And to give a bit more context, it's one of the top consumer grade cards available (and has 24GB of RAM). It costs on the order of $1.6k instead of $15-25k of H100.


That just goes to show how huge the markup is on those H100 cards! Ten 4090 cards have more compute and more memory (240GB!) than a H100 card. The cost of the 4090s is the same or lower for about 4x the compute.


i believe that's George's goal with tinybox



thanks :) posted it on HN but got no traction, definitely underperformed my expectations. my only explanation is i think i screwed up the title :/

i somewhat knew that people were going to obsess over the gpt4 tidbit, but hey ultimately George is just relaying secondhand info, and I wish more people focused on the tinycorp/tinybox/tinygrad story. I'm going to try again tomorrow to tell the story better.


We're sourcing off Geohot? Yikes.


At a minimum he glossed over the multimodal capabilities of GPT-4. If they use the same set of tokens, it’s unclear how this doesn’t pollute text training data. If they use separate tokens, the model size should be bigger.


I assume they just intersperse and/or use a special <IMAGE> token to prime the models appropriately. This has been done before by others training on e.g. HTML with <img> tags replaced with image tokens from a vision model.


This is the same strategy used by Palm-E and Mini-GPT4. They have a projection layer that transforms the vision embeddings into the token embedding space.


Reminds me of this: https://en.wikipedia.org/wiki/Society_of_Mind

> In his book of the same name, Minsky constructs a model of human intelligence step by step, built up from the interactions of simple parts called agents, which are themselves mindless. He describes the postulated interactions as constituting a "society of mind", hence the title.


I often hear the idea of digital is faster then biology. This seems mostly derived from small math computations.

Yet it seems the current form of large language computations is much much slower then our biology. Making it even larger will be necessary to come closer to human levels but the speed?

If this is the path to GI, the computational levels need to be very High and very centralized.

Are there ways to improve this in its current implementation other the cache & more hardware?


OpenAIs modus operandi is basically "does it get better if we make it bigger". Of course they are constrained by economical factors like the cost of training and inference, but if they have the choice between making the model better or more efficient they choose the better model.

I believe over the next years (and decades) we will figure out that a lot of this can be done much more efficiently.

Another problem with the analogy to humans if obviously that these models know much more than any one human can remember. They are trained on our best approximation of the total sum of human knowledge. So the comparison to any one human will always be fraught with problems.


This is probably not the path to GI. First we would need a precise scientific formalism to accurately describe intelligence, which currently does not exist. Second, it may or may not end up being tied to consciousness, and there's a thing called the hard problem of consciousness, that possibly might not be solvable.

It might end up being the kind of thing where if you want to accurately model consciousness, you would need a computer the size of the universe, and it's gotta run for like 13.8 billion years or so, but that's just my own pure speculation - I don't think anybody even has a clue on where to start tackling this problem.

This is not to discourage progress in the field. I'm personally very curious to see where all this will lead. However, if I had to place a bet, it wouldn't be on GI coming any time soon.


To accurately model consciousness it seems like we'd only need a computer the size of our brains, we'd just need to be as efficient as it.


Seems like it, doesn't it? I'd be curious to see if and how we can get there. What I was highlighting are some serious challenges along the way that might end up leading us to insights about why it might be harder than we think, or why there may be factors that we aren't considering.

It's very easy to say "brain is made of fundamental particles and forces, all we have to do is create a similar configuration or a model of them," but it's precisely in the task of understanding the higher order patterns of those fundamental particles and forces where we seem to run into some serious challenges that as of yet remain unaddressed.

The AI/ML way of approaching this is more of a top-down approach, where we just sort of ignore the fact that we don't understand how our own brains/minds work and just try to build something kind of like it in the folksy sense. I'm not discouraging that approach, but I'm very curious to see where it will lead us.


For all we know these models should be compressible to much smaller sizes.


What do you mean? ChatGPT/GPT4 is way faster than a human in every task.


This seems pretty consistent with what Sam Altman has said in past interviews regarding the end of continuously increasing scale and having multiple smaller specialist models: https://finance.yahoo.com/news/openai-sam-altman-says-giant-...


I think this could be fake. He says it’s an MoE model, but then explains that it’s actually a blended ensemble. Anyone else have thoughts on that?


What's 16 iter inference?


16 iterations of inference. Like each of 8 models is processed 2 times.


Is this still orders of magnitude smaller than a human brain?

How many? Based on current human neurons/synapses knowledge?


We really have no idea how to directly compare the two.

Also, vast portions of the human brain are dedicated to the visual cortex, smelling, breathing, muscle control... things which have value to us but which don't contribute to knowledge work when evaluating how many parameters it would take to replace human knowledge work.


While those portions of the brain aren't specific to learning intellectual or academic information, they might be crucial to making sense of data, help in testing what we learn, and help bridge countless gaps between model/simulation and reality (whatever that is). Hopefully that makes sense. Sort of like... Holistic learning.

I wonder if our brains and bodies are not all that separate, and the intangible features of that unity might be very difficult to quantify and replicate in silica.


We can say that such and such part of the brain is "for" this or that. Then it releases neurotransmitters or changes the level of hormones in your body which in turn have cascading effects, and at this point information theory would like to have a word.

"If our small minds, for some convenience, divide this glass of wine, this universe, into parts -- physics, biology, geology, astronomy, psychology, and so on -- remember that nature does not know it!" -Richard Feynmann


Reality only computes at the level of quarks - a less wrong post


It's really interesting that human organism requires so much computational power to support live.


I think it's even more interesting that the required amount of energy to do that high computational work isn't that high. Evolution has been working on it for a long time, and some things are really inefficient but overall it does an OK job at making squishy machines.


The human brain uses roughly 20 watts, which is really a remarkably low number.

https://psychology.stackexchange.com/a/12386


That's 20% of total energy consumption.


I had a good chuckle at "squishy machines". That's a really interesting way to think about it. It makes me wonder if, some day, we will be able to build "squishy machines" of our own, capable of outperforming silicon while using a tiny fraction of the energy.


Not really, as we only use 10% of our brains. /s


on top of that I would add that humans have very high DPI touch sensors across full body (skin)


2 orders of magnitude smaller, assuming 100T synaptic connections in the human brain.


And how big is the memory and language part…?


In the podcast they talk about 20 Peta FLOPS as the human brain equivalent for measuring comparison.


It’s remarkable.

I’m curious how long until people are just using training brains in a jar to compute.


They already are, for the hyperhyperparameter choices.


Well, not in a jar ... Right?


Open plan noisy offices are almost that.


How would you know that you're not in a jar? :-)


Not yet, but in our last call we assured our stakeholders that we are very close to that.


We have no idea how to estimate the computational capacity of the brain at the moment. We can make silly estimates like saying that 1 human neuron is equivalent to something in an artificial network. But this is definitely wrong, biological neurons are far more complex than this.

The big problem is that we don't understand the locus of computation in the brain. What is the thing performing the meaningful unit of computation in a neuron? And what is a neuron really equivalent to?

The ranges are massive.

Some people say that computation is some high level property of the neuron as a whole, so they think each neuron is equivalent to just a few logic gates. These people would say that the brain has a capacity of about 1 petaFLOP/s. https://lips.cs.princeton.edu/what-is-the-computational-capa...

Then there are people who think every Na, K, and Ca ion channel performs meaningful computation. They would say the brain has a capacity of 1 zettaFLOP/s. https://arxiv.org/pdf/2009.10615.pdf

Then there are computational researchers who just want to approximate what a neuron does. Their results say that neurons are more like whole 4-8 layer artificial networks. This would place the brain well somewhere in the yottaFLOP/s range https://www.quantamagazine.org/how-computationally-complex-i...

And we're learning more about how complex neurons are all the time. No one thinks the picture above is accurate in any way.

Then there are the extremists who think that there is something non-classical about our brains. That neurons individually or areas of the brain as a whole exploit some form of quantum computation. If they're right, we're not even remotely on the trajectory to matching brains, and very likely nothing we're doing today will ever pay off in that sense. Almost no one believes them.

Let's say the brain is in the zettaFLOP/s range. That's 10^21 FLOP/s. Training GPT-3 took 10^23 FLOPS total over 34 days. 34 days has 2937600 seconds. 10^23/10^7 is about 10^16 FLOP/s. So by this back of the envelope computation the brain has about 4 orders of magnitude more capacity, or 1000x. This makes a lot of sense, they're using a pettaFLOP/s supercomputer basically which we already knew. We'll have zettaFLOP/s supercomputers soon, yottaFLOP/s, people are worried we're going to hit some fundamental physical limits before we get there.

All of this is a simplification and there are problems with every one of these estimates.

But, in some sense none of it means anything at all. You can have an extremely efficient algorithm that runs 1 million times faster than an extremely inefficient algorithm. Machines and brains do not run the same "software", the same algorithms. So comparing their hardware directly doesn't say anything at all.


This is an important point. On the one hand real neurons are a heck of a lot more complex than a single weight in a neural network, so exactly mimicking a human brain is still well outside our capabilities, even assuming we knew enough to build an accurate simulation of one. On the other hand, there's no intrinsic reason why you would need to in order to get similar capabilities on a lot of areas: especially when you consider that neurons have a very 'noisy' operation environment, it's very possible that there's a huge overhead to the work they do to make up for it.


Yep, this is what I am thinking.

Since it is already giving interesting results, let's say a "brain" is a connectome with its current information flow.

Comparing AI with a brain in terms of scale is somewhat hazardous but with what we know about real neurons and synapses, one brain is still several orders of magnitude above the current biggest AIs (not to mention, current AI is 2D and very local, as the brain is 3D and much less locality constrained). The "self-awareness" zone would need a at leas 1000x bigger, redondant, with a saveable flow of information, 3D with less locality, connectome that of the current biggest AI. Not to mention, realtime rich inputs/outputs, and years of training (like a baby human).

Ofc, this is beyond us, we have no idea of what's going on, and we probably won't. This is totally unpredictable, anybody saying otherwise is either trying to steal money for some BS AI research, or a the real genius.


Very interesting summary.

Now the question I have is how small a model could be that can fascinate a big population of smarter creatures. On one hand we know that the human brain has a lot of power but in the other it could be "manipulated" by lesser intelligente creatures.


I really wonder if it is the case that the image processing is simply more tokens appended to the sequence. Would make the most sense from an architecture perspective, training must be a whole other ballgame of alchemy though


Probably. Check the kosmos-1 paper from Microsoft that appeared a few days before GPT4 was released: https://arxiv.org/abs/2302.14045


I like how geohot takes these concepts that seem impossibly complex to an outsider (mixture of models, multi modality) and discusses them so casually that they seem accessible to anyone. The flippancy is refreshing.


There's another layer to that sort of...erasure. You'll see it in a few years.


Interesting. And makes sense. Eg I could see one of the eight being very close focussed and trained on GitHub like data. Could help it stay on tasks too


i'd love any and all informed speculation on what the 8 models could be. it seems fascinating that we don't know.


Sad that there is a grammar error lining the top of the video.


literally unwatchable




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: