weird title, note that the tweet said "so yes, GPT4 is *technically* 10x the size of GPT3, and all the small circle big circle memes from January were actually... in the ballpark?"
It's really 8 models that are 220B, which is not the same as one model that is 1.7T params. There have been 1T+ models via mixtures of experts for a while now.
Note also the follow up tweet: "since MoE is So Hot Right Now, GLaM might be the paper to pay attention to. Google already has a 1.2T model with 64 experts, while Microsoft Bing’s modes are different mixes accordingly"
There is also this linked tweet https://twitter.com/LiamFedus/status/1536791574612303872 -
"They are all related to Switch-Transformers and MoE. Of the 3 people on Twitter, 2 joined OpenAI. Could be related, could be unrelated"
Which links to this tweet:
"Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!"
Anyway... remember not to just read the headlines, they can be misleading.
(OP here) - yeah i know, but i also know how AI twitter works so I put both the headline and the caveats. i always hope to elevate the level of discourse by raising the relevant facts to those at my level/a little bit behind me in terms of understanding. think theres always a fine balance between getting deep/technical/precise and getting attention and you have to thread the needle in a way that feels authentic to you to do this "job"
ultimately my goal is to Learn in Public and demonstrate to experts that spending their time teaching/sharing with me is a good use of time because i will augment/amplify/simplify their message.
(pls give the full podcast a listen/read/watch, George went deep on tinygrad/tinybox/tinycorp and there's lots there he IS the authority on, and people are overly fixated on the GPT4 rumor https://www.latent.space/p/geohot#details )
Oh, is that how Bing's image descriptions work? If you upload an image and ask for a description, it'll give a really good one. But if you ask it where it got it from, it says it got it from another algorithm. So is it using another expert AI to describe, and then feed it into GPT4 for conversion into language?
Pretty sure this is how the human brain works. If we add models that can transform information into updates to existing models, then add a model that can process streams of information and determine which models to update with the information (and can also apply that process to itself), we start to approach a primitive brain, no?
You might be surprised to learn that dell "hpc engineers" (or maybe HPE?) have attempted to sell me hardware using that exact logic to compare servers.
Given a highly-parallelizable numerics job, and all other things being equal, a 20 core 2.0Ghz machine will run the job in the same time as a 40 core 1.0Ghz machine.
...though usually we'd just give it in FLOPS - though FLOPS doesn't describe non-float operation performance.
Probably just a sales/marketing person who learned a new technical word and wants to sound knowledgable to the customer but it back-fires.
At the risk of torturing this analogy beyond recognition, it’s more like having an 8 core CPU but with a weak 150W PSU such that you can only run one core at a time.
in my understanding with an 8 way MoE youre still running all 8 for every request. only if you’re making a router model then itd be the “weak PSU” analogy. and even then a sufficiently perfect router would essentially have to know enough about its routes to almost simulate them which idk if i can believe works
No, GPT-4 is 1.7T params in the same way that an AMD Ryzen 9 7950X is (24 times X million transistors). It's not wrong to say GPT-4 is 1.7T parameters, your statement is wrong. Parameters can be configured in different ways, just as transistors on a chip can. In the same way that having 24x transistors does not imply 24x performance, having 24x parameters does not imply 24x performance.
If you can add FLOPs, and you have a number of cores with the same fixed number of FLOPs/cycle, it follows that you can add their frequency.
For the vast majority of real-life workloads there are vast differences between one core having X GHz or Y FLOPs vs lots of cores that sum up to that number. Which is the point GP tried to make.
To be fair, this was already a common whisper at the time, so it could be a Chinese whisper effect. Even I thought this on release day: https://news.ycombinator.com/item?id=35165874
What is weird is how competitive the open-source models are to the closed ones. For instance, PaLM-2 Bison is below multiple 12B models[0], despite it being plausibly much bigger[1]. The gap with GPT-4 is not that big; the best open-source model on the leaderboard beats it 30% of the time.
From my perspective, there's a vast divide between open source models and GPT4 at present. The lmsys leaderboard rankings are derived from users independently engaging with the LLMs and opting for the answers they find most appealing. Consequently, the rankings are influenced not only by the type of questions users ask but also by their preference for succinctness in responses. When we venture into the realm of more complex tasks, such as code generation, GPT4 unequivocally outshines all open source models.
I agree. Also, ELO rankings can artificially deflate strong performers when having too few binary comparisons. The Guanaco paper points that out, giving instead a 326-point lead to GPT-4 over Guanaco-65B across 10k orderings, based on GPT-4’s opinion, which corresponds to Guanaco winning about 13% of the time. But even that only relies on the Vicuna benchmark set.
There are huge differences. I used OpenAI’s APIs in my last book on LangChain and LlamaIndex, and GPT-3.5 and GPT-4 are good enough right now to support building applications for years (although I look forward to improvements).
I am writing a new book Safe For Humans AI in which I am constraining myself to using open models that can be run on a high end PC or a leased GPU server. Yesterday I was exploring what I could do with T5-flan-XXL, and it is useful, but not as effective tool as the closed OpenAI models. I will do the same with Orca this week.
I think a lot of people see a larger gap between open-source models and GPT-4 than is present because they test out the 7B models which fit on their machine. Models like T5-FLAN-XXL are very far below the quality one can expect from the best in open-source, and barely usable for CoT or tool use.
Especially for LangChain, I recommend using ≥33B models like Guanaco or WizardLM. Guanaco-65B honestly feels on-par with ChatGPT-3.5. (To be clear, there is a large gap with GPT-4 though.) It is a costly test, although GPTQ (for instance on exllama) help make it affordable.
I haven’t tried Orca since they haven’t released the weights yet, but it doesn’t seem like they have a 33B version.
Thanks for your advice! I am trying to write the book examples using a A6000 (48G video memory), but I may have to go higher end. The hourly lease charge on a A6000 is very inexpensive and I wanted the examples to be "almost free" to run. I will see what is required for Guanaco-65B.
AI development is taking place at breathtaking velocity and GPT-4 isn't the end of the road. OpenAI's moat relies on them improving at a faster pace than competitors. I wouldn't take it for granted that OpenAI's lead will remain the same in a year as it is today. Very possibly they could expand their lead. Also possibly, their lead could vanish.
GPT-4 came roughly 12 months after GPT-3.5, and I'd say the competition is roughly approaching 3.5 now, so I'm hoping they will reach GPT-4 level early 2024. Not that far off!
Wish there was a bookmark/remind-me function so I could come back in 8 months' time and see how close I was.
It’s funny that this post is trending on HN right next to the post about a paper showing how to build a model 1000x smaller than 1.7T that can code better than LLMs 10x larger.
I don't find it funny, I find it scary and mind-blowing: the impact of these headlines is additive - this one confirms the effectiveness of combining models, and the other one suggests you could cut the model size a couple orders of magnitude if you train on clean enough data. Together, this points at a way to achieve both GPT-4 that fits on your phone, and a much more powerful model that's not larger than GPT-4 is now.
If it truly is the training data that's making models smart, then that would explain that there is both a minimum and maximum "useful" size to LLMs. The recent stream of papers seems to indicate that the cleaner the input data, the less size is required.
That would negate, at least partially, the "we have 20 datacenters" advantage.
I think it is. I also think this is what OpenAI did. They’ve carefully crafted the data.
I don’t think they have an ensemble of 8 models. First, this is not elegant. Second, I don’t see how this could be compatible with the streaming output.
I’d guess that GPT4 is around 200B parameters, and it’s trained on a dataset made with love, that goes from Lorem Ipsum to a doctorate degree. Love is all you need ;)
It's still large. But it might no longer need to be "only few entities on the planet can afford to make one, and not at the same time, since NVIDIA can pump out GPUs only so fast" large.
This doesn't seem likely in the near term. An iPhone 13 Pro has 6gb of memory, which might be enough for a single 1.3b parameter model if you purged everything else out of RAM. Combining 8 (or even 4) of them on a single phone won't happen anytime soon at the rate phones improve.
Plus, none of the smaller models are really appreciably close to GPT-4 on most metrics. It's not clear to me you could get there at all with 1.3b models. Maybe somebody gets there someday with 65b models, but then you're far out of the reach of phones.
Are you assuming 32 bit parameters? You definitely don't need to keep that much. People have been cramming llama into 3-4 bits. You wouldn't want 8x1.3b but it should fit.
Also Apple has never gone for high RAM in mobile devices. I could go get 12GB in a brand new phone for $400, and some high end phones have had 16GB since 2020.
So combined, you could do normal app stuff with 2.5GB and 20-25b 3-bit inference with the other 9.5GB.
It’s fine to say you don’t believe him, it’s fine to say he’s wrong, but if you want to claim he doesn’t have inside information, I would expect a standard of evidence as good as that you would expect for Geohot’s claim.
So tell me: how do you know the Geohot doesn’t have inside information?
It's up to Geohot to prove his claim.
Althoug he may have insider information, he isn't in a position known to have inside information. It might be more correct to phrase it that way but we all got the meaning.
That's not how it works out in practice. For example, journalists who have insider info from sources and leak it on the regular don't do anything to prove they have insider information, they build up reputation based on how many predictions they make.
There's no point in outing your sources, that's how you lose them.
It's up to anyone making a claim, if they want people who aren't yet convinced to become convinced, to provide some supporting evidence. This basic rule of discourse applies whether one makes a positive claim or a negative claim.
Bluedevilzn made a negative claim, so naturally I'm asking if they can prove their negative.
He doesn't strike me as the type of person to lie (except when trolling). His reputation is solid enough that I'm sure he's had discussions with people in the space.
I remember watching a live stream [1] of him going through Twitter engineering articles and trying to reverse engineer the Twitter frontend and backend, and he clearly had absolutely no clue of how anything remotely related to the web works.
He was just clicking around compulsively, jumping between stuff randomly without even reading it, while not understanding what he was looking at. I have no idea how he's successful in tech, judging from what I saw there.
> I have no idea how he's successful in tech, judging from what I saw there.
I think judging only from what you saw there is the issue. If you look somewhere like Wikipedia [0], you'll see he was the first person to jailbreak the iPhone, the first person to achieve hypervisor access to the PS3's CPU, he took first place in several notable CTFs (at least one time as a one-person team), he worked on Google's Project Zero team (and created a debugger called QIRA while he was there), creating comma.ai, and the list goes on.
Well he picked the right problem at the time which was (and still is!) search. I was really looking forward to some progress on this and then he deleted tweets and dropped off the system. I would love to see a podcast of him talking about that & fyi @realGeorgeHotz is back on it now to promote Tiny grad/box/corp.
As someone who has spent 20+ years working in large companies you can spot developers like him a mile away i.e. the sort of behaviour you see with skilled but arrogant grads/interns.
The right approach when you're new is to quietly pick a simple problem away from the core services where you can learn the processes and polices needed to get something into Production. More so when you're in a company that is undergoing a brain drain.
Then you can progressively move to solving more demanding and critical issues.
Large companies normally (unlike Twitter) have achieved strong product market fit and are all about tweaks that deliver marginal revenue. Given that they already have a big slice of the market, the multiplier on marginal revenue makes what you're doing worthwhile.
"Rocking the boat" is more likely than not to make things worse.
a) It is valued at about a 1/4 of what it was purchased at.
b) Twitter Blue has generated an irrelevant amount of revenue and churn is increasing [1].
c) Roadmap looks poor. Video is a terrible direction where only Google, Amazon, TikTok etc have been able to make the numbers work and that's because it is subsidised through other revenue sources. Payments is DOA given Twitter's inability to comply with existing regulations let alone how difficult KYC/AML is to manage.
d) Regulatory and legal risk increases by the day. Lawsuits continue to pile up and EU/FTC are hovering around as Twitter is not in compliance with previous agreements.
e) Brand safety continues to be a huge challenge that isn't solvable without effectively going back to what Twitter was previously.
f) BlueSky and Instagram are both releasing competitor apps to the broader public in the coming months. The market simply won't sustain this many text-based social media apps.
a) Meta had similar swings of over 3x between a 2021 high and a 2022 low. We can only guess what Twitter would be worth today if the stock was floated, but it isn’t. Substituting market valuations with our personal beliefs isn’t interesting.
b) Every social media company has a well populated graveyard of failed experiments behind them.
c) Opinion.
d) Business as usual for every social media company since the dawn of time.
e) Opinion. And advertiser behaviour suggests otherwise.
f) History is littered with new entrants which fail to unseat the incumbent. It does happen, but it’s statistically rare.
Except for data points like Twitter Blue subscribers, number of active lawsuits, failed video startups etc.
As well as comments from Musk himself about the decline in the value of Twitter and from advertisers themselves about the challenges around brand safety.
g) bot account traffic has gone through the roof after constraints were removed. Follows from random bots occur multiple times a day on accounts with a couple dozen followers
h) the tweets served are no longer weighted towards followers but instead whoever paid for blue checks
i) accounting on views is entirely wrong with a tweet registering a billion views
His biggest issue was trying to convince the remaining engineers and Elon that a refactor was more promising than continuing to build on the layers of spaghetti code that runs twitter.
I don't disagree with him, it's just clear that he didn't understand how dire the financial situation was and that even a progressive refactor starting with the most basic features would take considerable engineering hours and money.
He added some code to give himself tons of it and didn't report it to anyone, dismissed it as a joke or something when it was found and then disappeared after hyping the project. Sound familiar?
To me he joins the ranks of the most basic shady crypto types.
That's luckily not relevant here. Multiple sources have independently confirmed the same rumors. This means that there is a significant probability of it being true.
What exactly is "it"? What is being discussed here? The size of the model? Who cares? People will make the same API call to get the results. Everyone is acting like a glorified AI expert now smh
The rumor here is that GPT-4 isn't one giant model, but instead a "mixture of experts" combination of 8 smaller GPT-3 sized models with different training sets.
This has implications for what may be possible on consumer hardware.
I find it interesting that geohot says it is what you do “when you are out of ideas,” I can’t help but think that having multiple blended models is what makes GPT-4 seem like it has more “emergent” behavior than earlier models.
OpenAI has always stuck to “simpler” approaches that scale. It’s the best bet to make and it’s paid off time and time again for them. I don’t think it’s so much “they’re out of ideas”, it’s just an idea that scales well
Yeah, and who says that this is all that they're currently working on? They likely found this to work better, launched, and did other things in the meantime.
If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to.
It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference.
So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"?
PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks.
Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.
There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?
Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.
Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.
They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics.
The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word)
I guess that neural network has to have the capability of identifying the subject and know in every moment which network is the most capable for that subject, otherwise I can't understand how it could possibly evaluate which is the best answer.
Results of this sort of system frequently look almost random to human eyes. For example one expert might be the "capital letter expert", doing a really good job of putting capital letters in the right place in the output.
Democracy of descendant models that have been trained separately by partitioning the identified clusters with strong capabilities from an ancestor model, so, in effect, they are modular, and can be learned to be combined competitively.
Well at the end of the day you’ll also need a model for ranking the candidates, which becomes harder as the number of candidates grows. And the mean quality of any one candidate response will drop as the model size decreases, as will the max quality.
Because of memory bandwidth.
H100 has 3350gB/s of bandwidth, more gpus will give you more memory but not bandwidth.
If you load 175b parameters in 8bit then you can get theoretically
3350/175=19 tokens/second.
In MoE you need to process only one expert at a time so sparse 8x220b model would be only slightly slower than dense 220b model.
Okay, memory bandwidth certainly matters, but 19 tokens a second is not some fundamental lower limit on the speed of a language model and so this doesn't really explain why the limit would be 220b rather than say 440b or 800b?
It's not a fundamental limit. Google palm had 540B parameters as dense model.
But it's a practical limit because models with over 1T would be extremely slow even on newest gpus. Even now, OpenAI has limit of 25 messages.
You can read more here: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-fo...
I'm not trying to say memory bandwidth isn't a bottleneck for very large models. I'm wondering why he picked 220b which is weirdly specific. (To be honest although I completely agree the costs would be very high, I think there are people who would pay for and wait for answers at seconds or even minutes per token if they were good enough, so not completely sure I even agree it's a practical limit)
The issue, as pointed above, is primarily bandwidth (at inference), not addressable memory. Put simply, the best bandwidth stack we currently have is on-package HBM -> NVLink, -> Mellanox InfiniBand, and for inference speed you really can't leave the NVLink bandwidth (read, 8x DGX pod) for >100b parameters. And stacking HBM dies is much harder (read, expensive) than GDDR dies which is harder than DDR etc.
Cost aside, HMB dies themselves aren't getting significantly denser anytime soon, and there just simply isn't enough package space with current manufacturing methods to pack a significantly increased number of dies on the gpu.
So I suspect the major hardware jumps will continue to be with NVLink/NVSwitch. Nvlink 4 + NVSwitch 3 actually already allows for up 256x GPUs https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-ho... ; increased numbers of links will let ever increasing numbers of GPUs pool with sufficient bandwidth for inference on larger models.
As already mentioned, see this HN post about the GH200 https://news.ycombinator.com/item?id=36133226, which has some further discussion about the cutting edge of bandwidth for Nvidia DGX and Google TPU pods.
Offtopic, but as a VR gamer that article just made me very sad. I was really hoping to see NVidia produce some decent cards in the near future, but looks like their main revenue is really going to be gargantuan number-crunchers. They'll likely only keep increasing the VRAM of gaming cards by arbitrarily-small numbers once every few years :-(
> GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.
There was a post on HackerNews the other day about a 13B open source model.
Any 220B open source models? Why or why not?
I wonder what the 8 categories were. I wonder what goes into identifying tokens and then trying to guess which category/model you should look up. What if tokens go between two models, how do the models route between each other?
220B open source models wouldn't be as useful for most users.
You need two RTX 3090 24GB cards already to run inference with a 65B model that is 4bit quantized. Going beyond that (already expensive) hardware is out of reach for the average hobbyist developer.
You could run it quantized to 4 bits on CPU with 256GB ram, which is much cheaper to rent/buy. Sure it might be somewhat slow, but for lots of use cases that doesn't matter.
Benchmarks I've run with a Ryzen 7950x, 128 GB RAM with Nvidia GeForce 3060 12 GB VRAM show a slowdown less than half when not using the GPU, with LLama.cpp as the inference platform and various ggml open source models in the 7B-13B parameter range.
The Ryzen does best with 16 threads, not the 32 it is capable of, which is expected due to it having 16 CPU cores.
If it’s trivial then why does every other competitor suck at replicating it? Is it possible this is just a case of sour grapes that this intellectual is annoyed they’re not at the driving wheel of the coolest thing anymore?
I don't think he thinks its trivial - just that its not some revolutionary new architecture. It's an immense engineering effort rather than a fundamental breakthrough.
That's on the order of 25 4090 GPUs to run inference. Not a crazy number by any means. We will see consumer robots running that by the end of the decade, mark my words.
Language models have tiny inputs (words), and tiny outputs (words), and no tight latency requirements (inference takes seconds). That makes them perfect for running where compute is cheap (ie. next to a hydrothermal plant in iceland) and querying them over the internet.
1760 Bp takes ~ 1760 Gb of (V)RAM when 8-bit quantized. Plus you will need some memory for state.
So you will need at least 1760/24=74 consumer-grade GPUs (3090/4090) or 1760/80=22 professional-grade GPUs (A100/H100).
There are almost certainly algorithmic optimizations still available. LLM-scale computing inside of consumer robots should be achievable by the end of the decade. In fact electric cars are probably the best "housing" for this sort of hardware.
I think the question stands. You can't fit 25 4090s in a robot (unless we're talking about something massive with an equally massive battery), and even if you could an LLM wouldn't be appropriate for driving a robot.
Given the pace of improvements I don't see how you compress 25 4090s into a single GPU in 7 years. A 4090 isn't 25 times the power of a GTX 980, it's closer to maybe three times.
Teslas are already consumer items that rock massive batteries.
My 25 count is off. It's probably closer to 75 GPUs right now.
Let's say today's models are running vanilla transformers via pytorch without any of Deepmind's Flamingo QKV optimizations. In 10 years algorithmic optimizations, and ML platform improvements push that efficiency up 3-5 fold. We're down in the ballpark of 25 GPUs (again)
Now, we ditch the general purpose GPUs entirely and go for specially built AI inference chip. The year is 2033 and specialty inference chips are better and more widespread. Jettison those ray tracing cores, and computer rendering stuff. Another 3x improvement and we're at ~8 GPUs
Now you said the 4090 is about 3.5x faster than the 780 from a decade ago. We are now on the order of 2 chips to run inference.
These models won't just be running in Teslas, they will be running in agricultural vehicles, military vehicles, and eventually robot baristas.
I guess you’re onto something, that combined with reductive, concave, inference architecture gains and new flux capacitor designs we are on 10x performance maximum optimisation curve.
And to give a bit more context, it's one of the top consumer grade cards available (and has 24GB of RAM). It costs on the order of $1.6k instead of $15-25k of H100.
That just goes to show how huge the markup is on those H100 cards! Ten 4090 cards have more compute and more memory (240GB!) than a H100 card. The cost of the 4090s is the same or lower for about 4x the compute.
thanks :) posted it on HN but got no traction, definitely underperformed my expectations. my only explanation is i think i screwed up the title :/
i somewhat knew that people were going to obsess over the gpt4 tidbit, but hey ultimately George is just relaying secondhand info, and I wish more people focused on the tinycorp/tinybox/tinygrad story. I'm going to try again tomorrow to tell the story better.
At a minimum he glossed over the multimodal capabilities of GPT-4. If they use the same set of tokens, it’s unclear how this doesn’t pollute text training data. If they use separate tokens, the model size should be bigger.
I assume they just intersperse and/or use a special <IMAGE> token to prime the models appropriately. This has been done before by others training on e.g. HTML with <img> tags replaced with image tokens from a vision model.
This is the same strategy used by Palm-E and Mini-GPT4. They have a projection layer that transforms the vision embeddings into the token embedding space.
> In his book of the same name, Minsky constructs a model of human intelligence step by step, built up from the interactions of simple parts called agents, which are themselves mindless. He describes the postulated interactions as constituting a "society of mind", hence the title.
I often hear the idea of digital is faster then biology. This seems mostly derived from small math computations.
Yet it seems the current form of large language computations is much much slower then our biology. Making it even larger will be necessary to come closer to human levels but the speed?
If this is the path to GI, the computational levels need to be very High and very centralized.
Are there ways to improve this in its current implementation other the cache & more hardware?
OpenAIs modus operandi is basically "does it get better if we make it bigger". Of course they are constrained by economical factors like the cost of training and inference, but if they have the choice between making the model better or more efficient they choose the better model.
I believe over the next years (and decades) we will figure out that a lot of this can be done much more efficiently.
Another problem with the analogy to humans if obviously that these models know much more than any one human can remember. They are trained on our best approximation of the total sum of human knowledge. So the comparison to any one human will always be fraught with problems.
This is probably not the path to GI. First we would need a precise scientific formalism to accurately describe intelligence, which currently does not exist. Second, it may or may not end up being tied to consciousness, and there's a thing called the hard problem of consciousness, that possibly might not be solvable.
It might end up being the kind of thing where if you want to accurately model consciousness, you would need a computer the size of the universe, and it's gotta run for like 13.8 billion years or so, but that's just my own pure speculation - I don't think anybody even has a clue on where to start tackling this problem.
This is not to discourage progress in the field. I'm personally very curious to see where all this will lead. However, if I had to place a bet, it wouldn't be on GI coming any time soon.
Seems like it, doesn't it? I'd be curious to see if and how we can get there.
What I was highlighting are some serious challenges along the way that might end up leading us to insights about why it might be harder than we think, or why there may be factors that we aren't considering.
It's very easy to say "brain is made of fundamental particles and forces, all we have to do is create a similar configuration or a model of them," but it's precisely in the task of understanding the higher order patterns of those fundamental particles and forces where we seem to run into some serious challenges that as of yet remain unaddressed.
The AI/ML way of approaching this is more of a top-down approach, where we just sort of ignore the fact that we don't understand how our own brains/minds work and just try to build something kind of like it in the folksy sense. I'm not discouraging that approach, but I'm very curious to see where it will lead us.
We really have no idea how to directly compare the two.
Also, vast portions of the human brain are dedicated to the visual cortex, smelling, breathing, muscle control... things which have value to us but which don't contribute to knowledge work when evaluating how many parameters it would take to replace human knowledge work.
While those portions of the brain aren't specific to learning intellectual or academic information, they might be crucial to making sense of data, help in testing what we learn, and help bridge countless gaps between model/simulation and reality (whatever that is). Hopefully that makes sense. Sort of like... Holistic learning.
I wonder if our brains and bodies are not all that separate, and the intangible features of that unity might be very difficult to quantify and replicate in silica.
We can say that such and such part of the brain is "for" this or that. Then it releases neurotransmitters or changes the level of hormones in your body which in turn have cascading effects, and at this point information theory would like to have a word.
"If our small minds, for some convenience, divide this glass of wine, this universe, into parts -- physics, biology, geology, astronomy, psychology, and so on -- remember that nature does not know it!" -Richard Feynmann
I think it's even more interesting that the required amount of energy to do that high computational work isn't that high. Evolution has been working on it for a long time, and some things are really inefficient but overall it does an OK job at making squishy machines.
I had a good chuckle at "squishy machines". That's a really interesting way to think about it. It makes me wonder if, some day, we will be able to build "squishy machines" of our own, capable of outperforming silicon while using a tiny fraction of the energy.
We have no idea how to estimate the computational capacity of the brain at the moment. We can make silly estimates like saying that 1 human neuron is equivalent to something in an artificial network. But this is definitely wrong, biological neurons are far more complex than this.
The big problem is that we don't understand the locus of computation in the brain. What is the thing performing the meaningful unit of computation in a neuron? And what is a neuron really equivalent to?
The ranges are massive.
Some people say that computation is some high level property of the neuron as a whole, so they think each neuron is equivalent to just a few logic gates. These people would say that the brain has a capacity of about 1 petaFLOP/s. https://lips.cs.princeton.edu/what-is-the-computational-capa...
Then there are people who think every Na, K, and Ca ion channel performs meaningful computation. They would say the brain has a capacity of 1 zettaFLOP/s. https://arxiv.org/pdf/2009.10615.pdf
Then there are computational researchers who just want to approximate what a neuron does. Their results say that neurons are more like whole 4-8 layer artificial networks. This would place the brain well somewhere in the yottaFLOP/s range https://www.quantamagazine.org/how-computationally-complex-i...
And we're learning more about how complex neurons are all the time. No one thinks the picture above is accurate in any way.
Then there are the extremists who think that there is something non-classical about our brains. That neurons individually or areas of the brain as a whole exploit some form of quantum computation. If they're right, we're not even remotely on the trajectory to matching brains, and very likely nothing we're doing today will ever pay off in that sense. Almost no one believes them.
Let's say the brain is in the zettaFLOP/s range. That's 10^21 FLOP/s. Training GPT-3 took 10^23 FLOPS total over 34 days. 34 days has 2937600 seconds. 10^23/10^7 is about 10^16 FLOP/s. So by this back of the envelope computation the brain has about 4 orders of magnitude more capacity, or 1000x. This makes a lot of sense, they're using a pettaFLOP/s supercomputer basically which we already knew. We'll have zettaFLOP/s supercomputers soon, yottaFLOP/s, people are worried we're going to hit some fundamental physical limits before we get there.
All of this is a simplification and there are problems with every one of these estimates.
But, in some sense none of it means anything at all. You can have an extremely efficient algorithm that runs 1 million times faster than an extremely inefficient algorithm. Machines and brains do not run the same "software", the same algorithms. So comparing their hardware directly doesn't say anything at all.
This is an important point. On the one hand real neurons are a heck of a lot more complex than a single weight in a neural network, so exactly mimicking a human brain is still well outside our capabilities, even assuming we knew enough to build an accurate simulation of one. On the other hand, there's no intrinsic reason why you would need to in order to get similar capabilities on a lot of areas: especially when you consider that neurons have a very 'noisy' operation environment, it's very possible that there's a huge overhead to the work they do to make up for it.
Since it is already giving interesting results, let's say a "brain" is a connectome with its current information flow.
Comparing AI with a brain in terms of scale is somewhat hazardous but with what we know about real neurons and synapses, one brain is still several orders of magnitude above the current biggest AIs (not to mention, current AI is 2D and very local, as the brain is 3D and much less locality constrained).
The "self-awareness" zone would need a at leas 1000x bigger, redondant, with a saveable flow of information, 3D with less locality, connectome that of the current biggest AI. Not to mention, realtime rich inputs/outputs, and years of training (like a baby human).
Ofc, this is beyond us, we have no idea of what's going on, and we probably won't. This is totally unpredictable, anybody saying otherwise is either trying to steal money for some BS AI research, or a the real genius.
Now the question I have is how small a model could be that can fascinate a big population of smarter creatures. On one hand we know that the human brain has a lot of power but in the other it could be "manipulated" by lesser intelligente creatures.
I really wonder if it is the case that the image processing is simply more tokens appended to the sequence. Would make the most sense from an architecture perspective, training must be a whole other ballgame of alchemy though
I like how geohot takes these concepts that seem impossibly complex to an outsider (mixture of models, multi modality) and discusses them so casually that they seem accessible to anyone. The flippancy is refreshing.
Interesting. And makes sense. Eg I could see one of the eight being very close focussed and trained on GitHub like data. Could help it stay on tasks too
It's really 8 models that are 220B, which is not the same as one model that is 1.7T params. There have been 1T+ models via mixtures of experts for a while now.
Note also the follow up tweet: "since MoE is So Hot Right Now, GLaM might be the paper to pay attention to. Google already has a 1.2T model with 64 experts, while Microsoft Bing’s modes are different mixes accordingly"
There is also this linked tweet https://twitter.com/LiamFedus/status/1536791574612303872 - "They are all related to Switch-Transformers and MoE. Of the 3 people on Twitter, 2 joined OpenAI. Could be related, could be unrelated"
Which links to this tweet: "Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!"
Anyway... remember not to just read the headlines, they can be misleading.