"Announced at today's ISC23 keynote, the Aurora genAI model is going to be trained on General text, Scientific texts, Scientific data, and codes related to the domain."
Not only is the model not yet trained, but the datacenter that will train it has not yet been built "the Aurora supercomputer is said to launch later this year"
Just an FYI, that's when it'll be completed and access opened up to a wider audience. Many parts are already built and being tested. We're talking about a DOE computer, not a AWS datacenter.
“The 2 Exaflops Aurora supercomputer is a beast of a machine and the system will be used to power the Aurora genAI AI model.”
and instead of a data center with a distributed load of Nvidia cards, they’re going to do 1 supercomputer? Are these guys intentionally stuck in 2007 or am I missing something
> and instead of a data center with a distributed load of Nvidia cards
You're missing something. Most models are trained on supercomputers. You still use DDP, but you also use Slurm. A supercomputer is just a many node machine, but one where you REALLY care about interconnect. It's why the fabric is one of the big points here. I/O is generally your bottleneck in these systems. Each node should be as powerful of a system as possible because of this.
A datacenter doesn't need a PB/s connection between machines. Your I/O probably isn't your bottleneck and so the tradeoff between PCIe and ethernet (infiniband or whatever) isn't a big deal. Usually because your processes are __independent__ (containers). In a supercomputer, your processes may be parallel, but that doesn't mean they're independent. You'll spend a lot of time writing code to be non-blocking but you're going to have to map reduce at some point.
Racks full of GPU nodes are called supercomputers. Supercomputers that ran a single instance of an OS kernel are from even farther back, maybe the last century.
It's a datacenter, not a single machine obviously. All "supercomputers" are datacenter today. They can use Nvidia or other processing units, it doesn't really matter.
- 10,000 nodes with 2xCPU + 6xGPU with unified memory
- 10PB of agreggate RAM
- 230PB of storage
Scale matters in supercomputing. Keeping latency low between everything allows you to solve different classes of problems that gets no speed up benefit by simply offloading compute to nvidia cards. It makes a huge difference how everything is connected.
One H100 has the same memory bandwidth as the two CPUs in each of the compute nodes. But the supercomputer has >9000 compute nodes, so 62x H100 GPUs might fall short there.
They also might not be measuring the same thing. H100 GPUs have "3026 TFLOPS":
I can only assume someone is working on a BS detector. And holy fuck, I would pay cash money for a way to identify it without having delve into it myself. Imagine not having to dive into paper only to find out that the population sample is so low that it shouldn't even be considered for submission or better yet have online news qualified as 'propaganda beneficial to side x'.
One of the few interesting ideas from Neal Stephenson’s book Fall is the idea that the rich pay people to curate the information they see on the web and in their social feeds. The poor have to wade through all of the machine generated content and propaganda and figure out what is true on their own.
An AI that does that detection would be both wonderful and dangerous.
(That book abandoned its only interesting ideas and went totally off the rails a few chapters later IIRC.)
In a sense, it already is. My voice is drowned in a sea of SEO-optimized gibberish. Even if I had something interesting and novel to say, an average person would be hard pressed to 1) find me 2) convince me to exchange ideas 3) pay money for it. There is usually a market for experts with niche appeal, but, well, the market only needs so many.
Still, it is a great question and part of me wonders about the whatifs of this evolution.
According to Intel's own press release, "Argonne is spearheading an international collaboration to advance the project, including Intel; HPE; Department of Energy laboratories; U.S. and international universities; nonprofits; and international partners, such as RIKEN." Maybe the headline should say "Argonne Labs announces GPT-4 competitor"
I'll be interested to see how this performs if/when it is released.
I'm not convinced that simply increasing the number of parameters improves the models, and I don't think we should be putting out press releases "spec-chasing" increases in the number of parameters. I've also found in my research that larger models often do not perform better, but this is becoming more difficult to explain to non-technical folks due to irresponsible marketing. If we aren't careful about the bold marketing claims, we'll disillusion many people and reduce the potential for future growth—we don't want another AI Winter.
There's a significant environmental cost to training these large models, and it's harder to understand, use, and control these very large models (even as a researcher).
Parameter count is strictly better IF the number of tokens (and ideally better quality tokens) trained on increases, and if the training is done for longer (most LLMs are way undertrained)
Most of the huuuuuge models failed on most or all of these fronts and that's why they suck compared to Llama or Alpaca or Vicuna
That's not true. For the same number of training tokens, bigger is better. And for the same size, more tokens is better. So obviously more tokens and bigger is better.
>I'm not convinced that simply increasing the number of parameters improves the models
How so? Llama 33B is clearly worse than llama-64B and that's 'only' a 2x increase.
Are you aware of any example where a larger model fails to outperform a smaller one when all else is equal (tokens, architecture, data quality, etc.)?
Obviously for a fixed amount of training compute more parameters can be bad, but there's a trade off where more parameters means you train on fewer tokens.
> There's a significant environmental cost to training these large models, and it's harder to understand, use, and control these very large models (even as a researcher).
Exactly. The worse part is there is NO viable efficient way of training, fine tuning etc and inference with these massive deep learning models for more than a decade since deep neural networks used GPUs for training, and it still requires a substantial amount of compute power and energy that is incinerating the planet and the result is untrustworthy large black-box models that cannot be trusted or transparently explain their decisions, especially with safety-critical or high risk tasks.
Crypto at least has an alternative to the wasteful proof-of-work system, and Ethereum which was formerly PoW has shown it is possible to switch to a greener alternative consensus method.
Deep Learning however, still has not shown such a viable switch and still needs to burn the planet with more data-centers of GPUs, ASICs and FPGAs to create hallucinating models that have shown to break on a single pixel or to confidently regurgitate nonsense as the truth with little understanding in reasoning and also answer with demonstrably false information.
LLMs like this one is still essentially snake-oil BS generators hiding behind regurgitation and sophistry to pretend to show signs of 'intelligence'.
EDIT: It is all true. [0] There is no amount of green-washing to hide the problem of deep learning systems wasting essential resources like thousands of running taps of water. Literally.
Intel could just make every CPU/GPU they manufacture have a 'burn in test' for a week or so where it runs a bunch of self tests and then trains a massive network for them.
If sales slow down, instead of having the excess stock in a box in a warehouse, have it training their neural nets.
This kind of news isn't of interest until the model is made and tested side-by-side with GPT-4 and shown to at least be in the same ballpark of performance.
So far, nobody has claimed to exceed GPT-4's performance (either by academic metrics or by youtubers subjective evaluations)
Isn't there more and more research coming out that at a certain point (200B~), parameters have significantly decreasing returns and it's better to just then do some supervised learning ontop of the base model?
I love bashing Intel as much as the other person, but looking at the number of tokens LLaMA[1] was trained on (1,400B tokens), the number of parameters of GPT-3[2] vs. the number of tokens it was trained on (175B params vs. 300B), it's not like the announced 1,000B param model is unreasonable. Taking 175/300*1400 yields 816B parameters - which is fairly close to 1,000B.
Not outright the most efficient utilization of data, but as others have mentioned, there might still be something left to gain (from not solely optimizing for compute). E.g. look at Figure 9 in [3]. Although the model trained on the largest model obviously utilized the data most efficiently, it's not perfectly clear - to me at least - whether some over-parameterization will necessarily lead to a decrease of test/out-of-sample performance.
Of course, LLaMA was trained on way fewer parameters relative to data set size. I've just mentioned LLaMA as a point of reference to the largest dataset known to me.
There's questions about why this announcement matters if it isn't trained. There's a few:
1) This is going to be using a publicly funded computer, which is the most powerful in the world AND was also announced today (btw, the specs are better than what was initially planned). The program gives justification for the machine and that public money (though in comparison to many things governments spend money on, this is very cheap).
2) THAT'S A LOT OF GPUS. This will, as far as I'm aware, be the biggest training every. GPT4 was supposedly trained on 10k A100s. Remember that Summit, the #3 computer, has >27k GPUs (V100s), but Aurora has almost 64k. This is going to be a big engineering feat in of itself. Now it won't make training 6x faster, but it will definitely be much better. It does say that the US government is taking this very seriously.
That's the point of this kind of announcement: marketing, propaganda, and cool engineering project.
something to keep in mind is this datacenter will be doing so many other projects as well. it's not a dedicated LLM maker. Maybe the LLM training won't have access to all those GPUs, or maybe it will have only time sharing access split with the other projects.
Not really a datacenter, it is a supercomputer. And yes, it is important to keep in mind that there aren't just single users. But with almost 64k nodes it isn't unreasonable for multiple teams to get 10k+ nodes at a time. I've seen runs on Summit use >50% capacity. Let's be clear, the teams aren't just going to start off using all the nodes at once. You step up.
> the target size for the new model is 1 trillion parameters. Meanwhile, the target size for the free & public versions of ChatGPT is just 175 million in comparison. That's a 5.7x increase in number of the parameters.
Has there been any official confirmation of the number of parameters used in gpt4? Pages around the web seem to have conflicting reports on the number, ranging from 1T to 170T, but there doesn't seem to be anything official.
Just to expand your lower bar, some people suspect it is not much bigger than GPT 3.5 @ 175B parameters, but this is also speculation. I don't think there has been anything official.
Can't be less though, the latency difference between them implies that GPT-4 is significantly larger. This is unlike with PaLM, where latency decreased.
That is correct. The open science made at OpenAI managed to publish a 100 page GPT4 technical report without using the word parameters one single time :-) ( I checked ).
I don't see any reason why someone should train a 1T model, they probably don't have a big enough dataset to make it worth it, so with the same compute or less you can get a better model by training a smaller one (the chinchilla scale).
> ... the target size for the new model is 1 trillion parameters.
> Meanwhile, the target size for the free & public versions of ChatGPT is just 175 million in comparison. That's a 5.7x increase in number of the parameters.
5.7x seems off by several orders of magnitude, or do I need more caffeine?
"matter of a few months" / "on par offerings"... My (very very) limited impression is both of those statements aren't fair. Google/intel/etc have been working on lots of AI-related projects, even LLM efforts over years. It's not like any of these companies are bootstrapping LLMs or just posted their first AI engineer job a few months ago.
I also don't think I'd call what they're offering "on par" with GPT-4. They're competitive, or impressive substitutes, but my personal tests of Bard at least aren't getting me as useful results as GPT-4.
Source: too lazy to look up any. this is just my lame impression. feel free to downvote.
I also thought it was curious. There is a wide range of possibilities for this.
My favorite hypothesis is that OpenAI was navigating the unknown open sea trying to answer the question "Is it possible to achieve this?" without knowing if they would ever find a new land.
The rest of them already know the answer is "Yes it is" so it's not an unknown open sea anymore with a dubious destination. They know it's possible, it's just a matter of money and time. They probably hired whatever engineer was available, and they paid tons of money to come up with a competitive product.
My own theory is Google had this earlier but couldn't figure out a way to make money from it. I think they felt threatened by it and didn't release what they had.
There was a Google engineer last summer who leaked a whole giant chat transcript with one of their LLMs that was pretty impressive before ChatGPT went opened up at the end of the year.
LibreOffice can accomplish most office tasks. How much moat do Office and GSuite have?
Moats are less than 50% tech advantages, and way more than 50% user/product/ecosystem/API integrations. Even if the self-proclaimed 'on par' competition is actually on par, it doesn't matter at all until they release APIs and starts doing products. So far only Google has even begun doing that. The real battle will be over integrations - if OpenAI has that, they'll have their moat alright. Same applies to Google if they actually take off (which is why we should ignore the spin and look at what they're doing).
I think OpenAI is building a moat right in front of us through their plug-ins idea.
Executed well, plug-ins could become ChatGPT's App Store equivalent. Once that happens, OpenAI is undoubtedly going to convert a portion of search traffic into agent delegated work i.e., people will use ChatGPT to start and terminate their searches essentially delegating the work of collating data from multiple sources to ChatGPT.
FD - I've applied to release a ChatGPT plug-in and this space looks very interesting to me.
PS - Earlier today, I submitted to HN a link to my blog where I analyzed plugins. The HN post sank without a trace but the I wrote my blogpost using the Wolfram plugin in ChatGPT and it was a breeze. I genuinely feel I've seen the future.
This is only a factor if the underlying AI/LLM model becomes as good as OpenAI AND serves millions of people. While open source LLMs are soon going to gain parity with GPT-4/Bard, it is unlikely that they will hit the kind of usage numbers that an OpenAI/Google/Meta can deliver.
Even there, I'm not sure Google/Bing will cannibalize their own product to allow third parties to inject data into a search interaction.
ChatGPT is a fundamentally different product - it's a humanlike intelligence which is always ready to talk to you and assist you. We haven't ever had anything of comparable quality and reach. The nearest was Alexa but even she was limited to Amazon's catalog and shopping. It's like ChatGPT can be everyone's personal gofer and THAT is huge, imo.
No that's incorrect. It is auto-fill on steroids trained on years of internet postings from actual humans (ie, reddit, twitter, etc).
That input is going to dry up - no longer free (Reddit has said all future posts are going to cost $ to access). Good for OpenAI as it has such a headstart, but many companies can source/tap into legitimate alternative user data streams - FAAMG for sure, but also any company that can convince it's userbase to provide training data that already has a foothold.
Google has its finger on the pulse of everything that goes through google search, gmail, etc. There's a reason everyone 15 years ago thought that Google would produce the first game-changer AI.
OpenAI is getting stuff from Bing, but Microsoft controls that data.
> I think OpenAI is building a moat right in front of us through their plug-ins idea.
I think OpenAI is doing the opposite by massively degrading GPT-4 to support the load from all these integrations (as well as their new app). Their moat was the head-and-shoulders-above—their-competitors quality of GPT-4, which had now taken a huge nosedive. I’m not sure why anyone would pay for it instead of GPT 3.5-Turbo. It went from being a competent, if somewhat error probe coder to doing things like randomly inserting C# code blocks in text.
Unless, I'm missing something, the spec you expose to ChatGPT only tells them which api endpoint to hit. The code powering that endpoint is not visible to ChatGPT.
The only argument you could make is that you are giving it machine understandable text to describe an API endpoint. In future, it might not even need the text description if the API endpoint is named well. Quite a stretch though.
Do you find they are better than the opposite - ie an app using the GPT-4 API? To me the plugins seem back to front as far as optimal architecture is concerned.
1. single point of entry: you don't need ten different apps to achieve ten different outcomes. You can do it all right inside ChatGPT
2. Since I can have up to three plugins active inside ChatGPT (as of today), I can express more complex workflows than if I had one app each with no trivial way to stream data from one app to the next.
It's like Zapier for your ideas. You talk to ChatGPT > it extracts data from a plugin > you prod it along a bit more > it talks to a second plugin to do X > more chat > talk to third plugin etc etc.
Zapier itself only lets you flow data from one app to the next, not act on that data.
I expect that as the LLM evolves and matures, they will allow more plugins and start eating up different industries. E.g., why do you need a Google Drive when your ChatGPT can also store files for you?
>why do you need a Google Drive when your ChatGPT can also store files for you?
If the idea is to keep the actual files rather than summaries, than there's an entire world of requirements (data storage reliability, access control, auditing, integration) where OpenAI etc. have no competitive advantage, and their own issues (e.g. prompt injections). LLMs are a bad fit whenever you need it to act like a computer, they replace human style processing. For math, get a calculator.
In this case, I'd expect some way to interface to OneDrive or maybe even BackBlaze.
I don't think their brand is as strong as people believe. They are pejoratively referred to as "ClosedAI"
They have a lot of down time and change the model out from under people, only having 3 months support. It is very hard to build a product on OpenAI in this situation.
A close enough model that can run on your own hardware or cloud will get people to move. We are watching and waiting for this point for our own products. For now they are called experiments and demo applications.
The people referring to them as "ClosedAI" are not a significant portion fo the population. I've overheard strangers talking about ChatGPT (honestly half the time it's ChatGTP), none of them give a shit about OpenAI's philosophy. At most, they have a vague notion that Elon Musk was involved.
So this thing doesn't actually exist yet?