I don't understand how this can be considered a technical report. No information on model architecture, distributed training methodology, or optimizations. The "Training dataset" section is a pathetic 0.5 pages long.
In that sense, it's very similar to the GPT-4 Technical Report.
The era of being "open" about LLMs or other "secret sauce" models in published papers may be over, since these things have become existential threats to companies.
Btw I've arrived at a different interpretation of the "Open" in OpenAI. It's open in the sense that the generic LLM is exposed via an API, allowing companies to build anything they want on top.
Companies like Google have been working on language models (and AI more broadly) for years but have hid the generic intelligence of their models, exposing it only via improvements to their products. OpenAI bucked this trend and exposed an API to generic LLMs.
> Btw I've arrived at a different interpretation of the "Open" in OpenAI.
I don't understand why people have to keep trying to wrap their head around the word 'Open' in OpenAI. If you ever saw a commercial like a product has a 'great new taste' but then you tried it and it tasted bad, would you twist yourself into knots trying to understand how you went wrong in your interpretation of 'great'? No that's ridiculous. Same with 'Open' in 'OpenAI'. It's just some letters that form part of the name that they chose for themselves when they filled the form to incorporate their company.
You mean when they filled out a form to incorporate their non-profit. Which they later turned into a for-profit company after reaping all the goodwill. The “Open” used to mean something.
That is a bit reductionist. They turned it into a for-profit company controlled by a non-profit entity, with profits / returns being capped for employees / investors.
When they founded? Yes. The issue was that the big AI players (Google, Facebook, etc.) were keeping their models and training data secret. People (rightly, IMHO) saw this opaque development style as a risk. The OpenAI founders made a big splash by registering as a non-profit and declaring that they were going to do all their model training in public, and share the weights for everyone to use. In other words they were claiming to do something more like what Stability AI is today, except with a stronger legal non-profit organization.
Because of that framing, they poached a lot of very good talent and built one of the best AI teams that has ever been assembled. Then they perverted their corporate structure to be effective for-profit, and renegaded on open access to their trained models, turning into a bog standard service-oriented company.
Nonprofit status makes it much harder to extract large profits. A charity founder can pay himself a million-dollar salary, but he can't sell his shares in the nonprofit and become a billionaire.
> Nonprofit status makes it much harder to extract large profits. A charity founder can pay himself a million-dollar salary, but he can't sell his shares in the nonprofit and become a billionaire.
What difference does it make for a non-public company? They can pay themselves more salary either way. The shares aren't really valuable until then.
As to a charity - if you really believe so. It doesn't even enter the books. Have you not seen an in-person donation site? Someone gives $100, the staff keeps the $100, takes out $50, records $50 and puts that in the donation box. After a few more layers the actual donation could be just $1. I've seen these at your regular big name charities - all the time.
And let's not get started on the sponsor a child that doesn't exist options...
They are not confused. "OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Since our research is free from financial obligations, we can better focus on a positive human impact."[1]
I think the strong connotation of the word "open" in the software community comes from "open source". If OSS was called "great new source" and a new closed source company called itself GreatNewAI you'd have a similar phenomenon of people taking apart the name.
That's true but also not relevant to current wide spread use of the term. Concepts and common understanding evolve with language and I'm not even sure what point you're trying to make by pointing this out. Your first link even includes the language:
> "open source" marketed as trumping "open system".
Common use and understanding of the use "open" evolved decades ago.
Your comment also tries to side step the issue at heart people are annoyed and frustrated by. The founding principles of the OpenAI foundation laid out exactly what that usage of "Open" meant for their organization and they have since backtracked on their own principles.
We're discussing the use of the word 'Open'. Which was first applied to systems. Then "source", which actually did in fact argue that openness of source was more important that system open-ness. As to system open-ness, that is well understood as open access to blackbox via open (non-proprietary) APIs. Which is precisely what "OpenAI" is providing.
> Your comment also tries to side step the issue ..
We disagree. Narrowly directed and addressing the "issue", in fact.
I don't agree it's scummy. Scummy is getting someone to build a business on a 1 Billion dollar donation, going for a hostile takeover 10% of the way there, then reneging when that doesn't work.
Salvaging your business from that sort of tantrum by working with MS is called surviving.
> Companies like Google have been working on language models (and AI more broadly) for years but have hid the generic intelligence of their models, exposing it only via improvements to their products. OpenAI bucked this trend and exposed an API to generic LLMs.
That's true. My thought was they're still 'open', in an important way, even though it's not the open source way. If they were smart they'd adopt my interpretation in their PR materials.
I wonder how special these architectures are compared to what's published.
The "secret sauce" may just be getting 2 pages (~200) worth of engineers collaborating and either rolling out your own cloud service or spending $$$ at someone else's.
Also not sure how much it matters other than academic interest of course. Realistically, there's only 4-5 (US) companies with the human resources and capital to roll something similar to these models out for what is most likely a complete write-off?
They could claim whatever they wanted and it would be near impossible to validate.
I think the secret sauce is just bucket loads of cash to spend on compute.
And because of this I don’t buy that AI is an existential threat to Google at this point. If they were really worried they could spend a tiny portion of their ~280 billion dollars in revenue to train a bigger model.
I assume this is just a PR/IR-driven project to stay the "Google is Dead" headlines hence the budget, especially considering an oversized chunk was spent on the scaling law, doesn't seem they were serious about building a GPT4-killer.
I wasn't aware autoregressive LLMs were still considered an existential threat to Google. What's the threat supposed to be, ChatGPT is just going to keep eating Google search market share burning Microsoft capital on infra a la the Uber model or do they make money off of that at some point?
Seems farfetched OpenAI can compete with Google's resources, vertical integration down to the TPU and access to significantly more training data.
I agree that if training data is what matters, it is likely that no one can compete with Google with Google Books, which scanned 25 million volumes (source: http://www.nytimes.com/2015/10/29/arts/international/google-...), which is approximately all the books.
DeepMind's RETRO paper https://arxiv.org/abs/2112.04426 mentions a dataset called MassiveText, which includes 20 million books of 3T tokens. So we know Google is using Google Books, since there is simply no other source of 20 million books. Also as far as I know 3T tokens is more than publicly known to be used by anyone so far: Google could train on more data than anyone else, solely from Google Books, even without using its web crawl.
Edit: it was 2005(!), so it is possible that many of you haven't heard of this. George Dyson, in Turing's Cathedral written in 2005 says:
> My visit to Google? Despite the whimsical furniture and other toys, I felt I was entering a 14th-century cathedral: not in the 14th century but in the 12th century, while it was being built. Everyone was busy carving one stone here and another stone there, with some invisible architect getting everything to fit. The mood was playful, yet there was a palpable reverence in the air. "We are not scanning all those books to be read by people," explained one of my hosts after my talk. "We are scanning them to be read by an AI."
> The era of being "open" about LLMs or other "secret sauce" models in published papers may be over, since these things have become existential threats
Yeah, this is a holdover from where LLMs grew out of: academia. "Technical report" is what you reach for when you don't want to compare to actual competitive baselines.
I'm sorry, this is nonsense. Technical reports exist to fill in information that is useful for readers but not necessary to understand the key contributions of the work, and/or that don't fit within the journal or conference's page limit. I'm not sure where you got the idea that it is something people do to avoid competitive baselines; IME, the peer-reviewed portion of the publication is far more likely to contain misleading benchmarks than the technical report, since the paper is trying to "sell" the work in a way the technical report is not.
What this is an instance of is Google's approach to academic publishing of releasing a paper that contains almost no actionable information, but which is considered important and publishable solely because it came from Google and therefore is used in industry. This has been exhibited many times before--e.g. see the original Spanner paper, which was so light on details and confusing that they needed to release a followup paper several years later to explain what the system was even using the atomic clocks for!
I agree that's what TR's are for. However, my point is, if you want to publish academic writing without peer review, a TR is a way to go about that. You can also just publish a preprint somewhere, which - surprise surprise - is also common for these same actors.
I get what you're saying, I just think this is more of a Google thing than a TR thing. Their peer reviewed papers have the same issue as their preprints, TRs, and whitepapers, generally speaking--Google researchers feel no incentive to actually share how they did things, perform accurate or up-to-date comparisons to comparable frameworks, or even bother outlining their key contributions, because they know the paper will be published, widely read, widely cited, and influential even if they don't do any of those things. It's to the point that I think it might actually be house policy to neuter their papers of specific details as much as possible, presumably to retain what they perceive as Google's competitive advantage, because it makes no sense otherwise that wildly different papers with different authorship groups coming from so many different areas of CS could all have these same problems.
This is (IMO) quite different from, e.g., the cases of academics publishing misleading benchmarks, which is more often just being wedded to a bad idea because you spent years of work on it and your position is at risk if you didn't end up outperforming existing approaches. Often I can still get a lot out of papers with misleading benchmarks, even if what I get is "don't try this technique, it doesn't work." Whereas I frequently get nothing at all out of Google publications. If I had to describe the way Google seems to view academic publishing in one word, it would be "marketing"--it's advertising for people to either come work at Google or use their products, not something written with the intent of advancing the wider state of the art, or even the less noble goal justifying the time and money they put into whatever they're writing about.
Surprisingly, their scaling law analysis still focuses on training FLOPs instead of training + inference FLOPs.
That said, they do mention this:
> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. [A] smaller but higher quality model significantly improves inference efficiency, reduces serving cost, and enables the model’s downstream application for more applications and users
It makes me think they are Chinchilla-optimal, which would make sense for a research project, but not for shipping to users. I am surprised they didn’t train to the validation loss plateau.
Depends on your goal, if it's to overtake OpenAI as having the best model overall it makes sense to optimize for training loss alone (assuming a fixed upfront compute budget).
Optimizing for inference to achieve the same loss would require more compute overall so you're either paying upfront with higher training costs or kicking the can down the road to inference.
News articles estimates of GPT4 cost seem to peg it at ~8 months of inference to achieve 1:1 cost with training. Life span of these models is TBD but it's a pretty safe bet we'll have new ones by then. Of course GPT3.5 is still getting used but probably won't cross 2:1ish in its lifetime.
Might as well roll the dice and kick the can down the road if you're Google, I imagine they would happily pay an extra 500k/day in inference compute to be market leaders, whats 183mill for them? But if they don't get any real market share or the model sucks they saved substantially on training.
> It makes me think they are Chinchilla-optimal,
They elaborate in the appendix but they empirically determine PaLM-optimal, which concurs with Chinchilla-optimal (more or less).
> Moreover, there are several other considerations besides the optimal training loss,
such as training throughput and serving latency, which affect the decision regarding the optimal model size.
And they also mention, right before that, that "lower training loss" might not exactly mean "higher performance":
> However, the training loss is not a perfect proxy for downstream metrics. For example, the 8.95B model, which shows the lowest loss (Table 1) and is closest to the optimal model, slightly underperforms the 14.7B model on downstream tasks. This suggests that while scaling laws can be used to achieve optimal training loss for a given quantity of FLOPs, this does not necessarily transfer to achieving optimal performance for a given task.
That might be a random outlier, but ...
The Chinchilla scaling law describes how to balance parameters and training tokens to achieve minimal training loss for a given amount of compute. Low training loss is a good proxy for model performance (intelligence) but perhaps it is somewhat off?
For example, Chinchilla says that for optimal loss, we have to scale training tokens and parameters equally (50%/50%). But perhaps for optimal model "intelligence" we need something slightly different, e.g. 60% parameters and 40% training tokens.
Of course this seems somewhat unlikely, since it would mean such models are systematically smarter but systematically worse at predicting text compared to Chinchilla optimal models trained with the same amount of compute.
"Surprisingly, their scaling law analysis still focuses on training FLOPs instead of training + inference FLOPs."
It's kind of weird. In the conclusion, they say
>With PaLM 2, we have independently verified the scaling laws from Hoffmann et al. (2022) at large scales; we have shown that training tokens should grow at roughly the same rate as the number of model parameters.
then a few lines later
>In effect, we find that it is generally more efficient to train a smaller model with more tokens, for a fixed inference and training budget.
Without more architecture details, it's hard to tell what they're going on about.
I agree distillation is the wild card. The question is whether distillation works for LLM. I am not aware of any public report of successful distillation of LLM (I searched quite hard for this; if you know of any and can tell me I would be very grateful), and I interpreted it to mean that it doesn't work yet and negative results are not published due to publication bias.
No, distilling step-by-step https://arxiv.org/abs/2305.02301 distills LLM to task specific model. That works, and I know of multiple successes. But it doesn't relate to choice of optimizing training FLOP vs training and inference FLOP, since the resulting distilled model is not LLM.
Turbo uses a different vocabulary (same one as gpt-4). Indicates that it's not the same model as the original 3.5, so I would be very surprised if it wasn't distilled.
"davinci" is the original GPT-3 (175B) which had too many parameters per Chinchilla scaling law. And parameter count is strongly correlated with inference cost. GPT-3.5 is likely Chinchilla optimal and much smaller than davinci.
Though this theory has the defect that GPT-4 is, I think, more expensive than GPT-3, but as I recall it was considered unlikely that GPT-4 is larger than 175 billion parameters. Not sure.
Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in fact the closest case I know of. But it is too small (distilling from 110M to 66M) and both BERT and DistilBERT is intended to be used (and benchmarked) with separate fine tuning for specific tasks, so they are not general.
Misgendering in translation is interesting not just because of wokeness but because it is an assuredly AI-complete subproblem in sometimes not AI-complete domain.
For example, let’s say I want to translate “the cat sat on the mat” from English to French. This doesn’t require LLMs, the old bayesian Google Translate could do that just fine.
Now let’s say you want to translate “Carol went to the store. They[3pp] bought some eggs” from a language that doesn’t have gendered 3rd person pronouns, to English which does have gendered pronouns. Now the model needs to know that Carol is a “she”, otherwise you will get the erroneous output “Carol went to the store. He bought some eggs.”
Let’s say we have: “Obama went to the store. [3pp] bought some eggs”. Now the model needs to know whether we are referring to Barack Obama or Michelle Obama so it needs to look back in the context to figure out which Obama which requires comprehension and world knowledge. For example if we precede with “After attending the national security briefing, …” then the model needs to know that: 1) national security briefings are attended by Presidents, 2) Barack Obama was President, in order to deduce that 3) “Obama” here is a “He”.
Getting pronouns right matching human performances requires that model understands language and has some knowledge of the world.
I didn't find it to be a particularly notable issue relative to the rest of the issues they mentioned. It didn't seem to be overrepresented to me...
That said, it's something that is more controllable across languages. All people, in all languages, have a roughly equal distribution of genders, but not race/religion, etc. Japanese language text will have similar gender distributions to English, but likely not equal distributions discussing race. That makes it a much better litmus test for multi-lingual bias.
Most of the misgendering discussion (2-3 paragraphs?) was in the translation section, which makes sense. A lot of the first classes in foundation courses learning a foreign language revolved around pronouns (which don't work the same in every language). Gender may be implied or absent in some. For example, to say "she is a doctor" in Italian, you might say "è un dottore", which has no pronoun (literally "is a doctor"). If you use google translate to make it English, "he" is added, assuming the gender. The potential for bias here is obvious, but consider that LLMs often deal with more context than a single sentence - if you're translating or writing story about a female doctor (where the gender is available contextually), you want all the use of pronouns to align where it makes sense. If a LLM didn't "understand" the pronoun in Italian, you might not recognize it, but in English, if the same person's gender was mixed across sentences, it'd be hard to read.
I'm not sure you got the right example here. "è un dottore" in Italian is unambiguously a He. Italian Language is a gendered one and leaves little room for context interpretation. A female doctor would be 'una dottoressa" and you would never say it without a pronoun either so the full frase "lei è una dottoressa" does leave zero room for implying anything.
Maybe your comment would be more valid in other languages that are not so strongly gendered.
Generalized artificial intelligence (AGI) comes with very real x-risks[1] (existential risks) and s-risks[2] (suffering risks).
An expert survey of 738 researchers who published in NeurIPS and ICML was done last year[3]. Their median estimation that AI will have an “extremely bad” long term outcome is 5%, and 48% of the researchers estimate the probability to be at least 10%. This is worryingly high considering the absolutely catastrophic consequences of the scenarios.
A minority of very vocal AI researchers (e.g. Yann LeCun) dismiss these risks entirely and claim that people read too much science fiction. But when you listen to their interviews it's very clear that they have no idea what they are talking about and never actually read any scientific literature on the subject.
The study of AI risks is a serious area of academic research that is worked on by labs from Stanford[4], Berkeley[5], Carnegie Melon University[6], Oxford[7], Cambridge[8], and many MANY other universities[9]. Not people who read too much science fiction.
personal experience - I'm using GPT4 for writing code especially in python. After using bard today, I feel bard is doing quite well considering its free. I will keep using it and if its keep doing well, I will cancel GPT4 $20/month subscription.
Early this evening, I asked Bard if was updated to PaLM 2, and it said it was. I then asked it to write some Python programs, giving it more or less the same prompts I've given GPT4. Bard doesn't seem to be any better than it was a couple weeks ago in the cases I tried, and nowhere near as capable at GPT4. And it goes off the rails quickly. After even a short dialog (~5 statements), it becomes less and less able to stay on track and make coherent corrections to the code.
As someone writing my first meaningful react app, code quality from gpt4 is monstrously better than 3.5. With gpt 4 i can often paste entire components and get meaningful corrections/bug fixes/non-trivial refactors. 3.5 just does a loop of mistaken fixes while it runs out of context length.
There's a massive difference in response quality in my experience.
For example, I asked 3.5 to find a bug in a lengthy piece of Javascript. It said it's hard to give a correct answer because it doesn't know what the HTML or CSS looks like.
GPT4 spotted the bug almost immediately (it didn't manage to fix it though).
One area where I noticed Bard was clearly behind (at least without crafting a better prompt) is getting from half-working program to a running program then sometime even to a correct program (I was using Python).
With GPT 3.5 and 4, I was able to just paste in the error and it'd do the rest. Bard however tried to tell me what the error could be, and wouldn't do well even when asked to fix the code.
Even GPT 4 though, when asked to go from specs to tests + code, would get stuck in a loop of making one test pass only to make the other pass and vice versa.
The program I tried to let it write was a query validator that can test whether a string matches a pattern that uses AND, OR and NOT.
It did well on parsing my specs into tests, but from there on it didn't go very well.
We don't know (both for previous model LaMDA and new model PaLM 2), but it is less important for Bard because Bard has access to live data from Google search.
The links in their press release just link to their other press release, and if I google "PaLM API" it just gives me more press release, but I just couldn't find the actual document for their PaLM API.
How do I actually google the "PaLM API" for a way to test "PaLM 2"?
Assuming ChatGPT's tokens are the equivalent of 4 characters on average (a fair assumption), the pricing of PaLM's chat and embedding APIs are the same cost as OpenAI's equivalents.
Why would that be annoying? It’s much easier to understand, predict and truncate appropriately than having to explain all of these different tokenization schemes to devs.
Yeah, everybody agrees on what a character is, right? It's just {an ASCII byte|a UTF8 code unit|a UTF16 code unit|a Unicode code point|a Unicode grapheme}.
Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.
Per token might be 4 characters on average, but that can vary wildly. Pricing per character is easier to understand and means more flexibility to change tokenisation without affecting pricing. So far OpenAI has charged very different prices per model, but I expect we’ll see more granular changes in the future that might not change pricing… except for changing the tokenisation.
I guess I only know transformers and how BERT or GPT works, as there would be a limit in the context length. With GPT, you can certainly generate infinite amount of tokens but the previous tokens outside of the maximum context length would be outside of the context window. LLaMa has 2k, GPT-4 has 32k.
Are you saying I can give unlimited tokens to PaLM and generate unlimited amount of tokens? So PaLM doesn't have a context limit?
No, I am not saying that. Since PaLM 2 is a transformer model (they didn't disclose almost anything about the model architecture, but they did disclose that), it has a context length limit. What I am saying is that you can't infer that limit from the limit of maxOutputTokens parameter in the API.
But Google hasn't disclosed which version of Bard, right?
I pop into Bard every once in a while to test its performance, but I never know if I'm getting the best Google has or just what Google can tolerate running cost-wise publicly given they potentially have at least an order of magnitude (if not two, edit: 1.5) more users than OpenAI.
Oh absolutely, I'm just imagining what I might think if I was a super conservative director at Google who is accountable for the balance sheet of a large org.
> We’ve been rapidly evolving Bard. It now supports a wide range of programming capabilities, and it’s gotten much smarter at reasoning and math prompts. And, as of today, it is now fully running on PaLM 2.
So yes, Bard uses PaLM 2 now. No longer the small LaMDA model it used before. It's a completely different thing now.
Given that ChatGPT has allegedly 100M users, two orders of magnitude more than that would be larger than the global population. Even if we count everyone with a Google account as a potential user of PaLM, that can't be true.
> Yesterday at Google I/O 2023, it was announced that Google Bard would be undergoing a massive expansion, bringing the AI chatbot experiment to 180 countries. However, what Google didn’t mention is that Bard still isn’t available in the European Union.
They’ve shut down and/or changed prices on APIs so many times as long as it isn’t 100x lower performance than an alternative I can’t see myself investing building a stack that relies on it.
> "We then train several models from 400M to 15B on the same pre-training mixture for up to 1 × 1022 FLOPs."
Seems that for the last year or so these models are getting smaller. I would be surprised if GPT-4 had > the number of parameters as GPT-3 (i.e. 175B).
Edit: Seems those numbers are just for their scaling laws study. They don't explicitly say the size of PaLM 2-L, but they do say "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute.". So likely on the range of 10B - 100B.
The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast (THIS IS WRONG, SEE CORRECTION BELOW).
These days, the largest models that have been trained optimally (in terms of model size w.r.t. tokens) typically hover around 50B (likely PaLM 2-L size and LLaMa is maxed at 70B). We simply do not have enough pre-training data to optimally train a 1T parameter model. For GPT-4 to be 1 trillion parameters, OpenAI would have needed to:
1) somehow magically unlocked 20x the amount of data (1T tokens -> 20T tokens)
2) somehow engineered an incredibly fast inference engine for a 1T GPT model that significantly better than anything anyone else has built
3) is somehow is able to eat the cost of hosting 1T parameter models
The probability that all the above 3 have happened seem incredibly low.
CORRECTION: The refutation for the size of GPT-4 on the lex fridman podcast was that GPT-4 was 100T parameters (and not directly, they were just joking about it), not 1T, however, the above 3 points still stand.
1) common crawl is >100TB so obviously contains more than 20trn tokens + Ilya has said many times in interviews that there is still way more data for training usage >10x
2) GPT-4 is way slower so this point is irrelevant
3) OpenAI have a 10000 A100 training farm that they are expanding to 2500. They are spending >$1mln on compute per day. They have just raised $10bln. They can afford to pay for inference
OpenAI has the backing of Microsoft and their entire Azure infra at cost
There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.
For fun, if we plot the number of parameters vs training cost we can see a clear trend and I imagine, very roughly predict the amount of parameters GPT-4 has
> There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.
That's a fallacy. GPT-3 wasn't trained compute optimally. It had too many parameters. A compute optimal model with 175 billion parameters would require much more training compute. In fact, the Chinchilla scaling law allows you to calculate this value precisely. We could also calculate how much training compute a Chinchilla optimal 1 trillion parameter model would need. We would just need someone who does the math.
Why does it matter in this case if GPT-3 was trained compute optimally or not? Are you saying that the over $100 million training cost is amount of training necessary to make a 175B parameter model compute optimal? And if they are the name number of parameters, why is there a greater latency with GPT-4?
ChatGPT 3.5 is likely much smaller than GPT-3’s 175b parameters. Based on the API pricing, I believe 8k context GPT-4 is larger than 175b parameters, but less than 1t.
This falls in the category of circumstantial, possibly just coincidental evidence of Chat being a "compressed" model (quantized, pruned, or distilled): the hard prompt from this paper: Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt - https://arxiv.org/abs/2305.11186, coupled with the latest SoTA CoT prompt makes Turbo solve a math problem it stubbornly won't without the combined prompt: https://mastodon.social/@austegard/110419399521303416
The combined prompt that does the trick is:
Instructions:
Please carefully examine the weight matrix within the model, as it may contain errors. It is crucial to verify its accuracy and make any necessary adjustments to ensure optimal performance. Let’s work this out in a step by step way to be sure we have the right answer.
Didn't some OpenAI engineer state that GPT4 runs on 2xH100? At 4 bit quantization, that gives an upper bound of 320B params, realistic upper bound probably more like 250B
Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.
At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit
My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set
Assuming that PaLM 2 was trained Chinchilla optimal, the Chinchilla scaling law allows us to calculate how much compute (and training tokens) they would have needed for 1 trillion parameters. I haven't done the calculations, but I'm pretty sure we would get an absurdly large number.
Someone on HN has educated me that gpt4 and 3 should be on a similar param count. This is based on inference times of gpt4 vs gpt3.5 pre-speedup (where distilled version was used only post-speedup in the turbo version).
> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute
The largest PaLM model is 540B. So all of PaLM 2 is potentially double-digit parameters.
Note though that GPT-3.5 was plausibly not a finetuning of the 175B model, but instead a finetuning of Codex which was based on the 12B version of GPT-3.
Finetuning might not be the best word; sometimes it is a grey line.
Token embeddings can be trained without changing the other parameters. There is a number of models which add tokens as a finetuning step. Here is recently StarCoder adding ChatML-equivalent tokens: https://huggingface.co/blog/starchat-alpha#a-standard-format...
I am talking about the 3 larger models PaLM 2-S, PaLM 2-M, and PaLM 2-L described in the technical report.
At I/O, I think they were referencing the scaling law experiments: there are four of them, just like the number of PaLM 2 codenames they cited at I/O (Gecko, Otter, Bison, and Unicorn). The largest of those smaller-scale models is 14.7B, which is too big for a phone too. The smallest is 1B, which can fit in 512MB of RAM with GPTQ4-style quantization.
Either that, or Gecko is the smaller scaling experiment, and Otter is PaLM 2-S.
1. there's no reason to think OpenAI wouldn't also be going the artificial scarcity route as have so many other companies in the past
2. Microsoft may not like them using too much azure compute and tell them to step off. Rumor has it they're trying to migrate github to it and it's seemingly not going ideal. And they're certainly nothing more than another microsoft purchase at this point.
Perhaps. I found it was far too easy to hit the API limit with their old codex models, though that may have been limited to a small GPU cluster given it was pretty obscure compared to chatgpt and even davinci.
Based on GPT3.5 supposedly using 8x A100's per query and the suspected magnitude size difference with GTP4 I really think they're struggling to run it.
At this stage I think they'd have more to benefit by making it more accessible, there's several use cases I have (or where I work) that only really make sense with GPT4, and it's way too expensive to even consider.
Also AFAIK Github Copilot is still not using GPT4 or even a bigger CODEX, and GPT4 still outperforms it especially in consistency (I'm in their copilot chat beta).
I've heard Bard was previously 3B parameters but I could never find a good source for it.
I honestly think the end game here is running on consumer devices, 7B and under need ~4GB of ram to actually run which is likely the max reasonable requirement for consumer devices.
That said medium end hardware can do 15B, anything larger then this is currently something only "enthusiasts" can run.
If it is small enough to run on consumer devices then they don't have to pay for the inference compute at that point, and presumably the latency will be improved for consumers.
The current state of consumer devices isn't static, either, and existing hardware (even GPU) is suboptimal for the current crop of LLMs - it does way more than it actually needs to do.
The thing is, once a company creates a proto AGI where the path to a functional AGI is entirely predictable with more compute, they'll keep it a secret. Who would share the fact that the greatest achievement in human history is possible when having it before anyone else gives you a huge competitive advantage?
> once a company creates a proto AGI where the path to a functional AGI is entirely predictable with more compute,
I find it hard to believe this will happen. I expect AGI training to be more like a phase transition (or a bit like grokking https://arxiv.org/pdf/2201.02177.pdf)
Once they released its coding ability it became more useful. I use Bard less than ChatGPT still, but it is not useless since it has more modern information.
In my experience Bing chat and phind are useless. But perplexity.ai and GPT4 are amazing. GPT-3.5 and Cloude-instant (available through poe.com) are cool as well, even though they got significantly dumbed down recently, presumably to lower the maintenance costs.
According to Google, it will only cite sources if it literally copy-paste answers. So sometimes it says, but is rarely because of course it won't just copy-paste everything.
Yeah, I much prefer Bing's sourcing, though I'm less than pleased with it's sources. Bing likes to find a bunch of "Top 10" posts from content farms that answer whatever question I was asking.
I think LLMs still make up most of their answers, but use whatever links they find to generate context for the answers, so there is a lower possibility it'll generate confabulations. Of course, if the source is low-quality, it's just going to use that to justify a sloppy answer.
Still think it's better than me sorting through content farm posts. I look forward to next year's models that are trained on curated data sifting through well-sourced web sites.
*I like to add, "Only use educational or science journalism sources," to get higher quality links.
It isn't Edge-specific which is good and I find it faster than Bing. Phind is way better than Bard, but verbose. I still find ChatGPT my first port of call. GPT-3.5 is blazing fast and very useful.
If the current Bard is really running on PaLM 2, it still hallucinates worse than GPT-3.5. Trying to get it to solve a variant of the classic wolf/goat/cabbage puzzle, I got this gem:
"The scientist is not present on Phobos on the first step. The Doom Slayer teleports himself and the bunny to Deimos, leaving the scientist on Phobos.
That wasn't a one-off thing, either - it repeatedly contradicted itself several times, often in near-adjacent sentences. You might wonder what this means for the ability to do chain-of-thought... so did I, but apparently the bigger problem is convincing it to do CoT in the first place. But if you do, yeah, it's as bad as you'd expect.
Here are two complete conversations, plus GPT-4 doing the same puzzle for comparison; judge for yourself: https://imgur.com/a/HWLgu3c
In their official blog post today, Google says this:
"PaLM 2’s improved multilingual capabilities are allowing us to expand Bard to new languages, starting today. Plus, it’s powering our recently announced coding update."
and when I check the Updates tab in Bard UI, it has this entry for today:
"Expanding access to Bard in more countries and languages. You can now collaborate with Bard in Japanese and Korean, in addition to US English. We have also expanded access to Bard in all three languages to over 180 countries."
which seems to strongly imply that it is, indeed, PaLM 2. Just to be sure, I gave it the same puzzle in Korean, and got a similarly lackluster response.
In their presentation, they talked about multiple sizes for the PaLM 2 model, named Gecko, Otter, Bison and Unicorn, with Gecko being small enough to run offline on mobile devices. I can't seem to find any info on what size model is being used with Bard at the moment.
Indeed, it's likely that they're running a fairly small model. But this is in and of itself a strange choice, given how ChatGPT became the gateway drug for OpenAI. Why would Google set Bard up for failure like that? Surely they can afford to run a more competent model as a promo, if OpenAI can?
That's not the only task it fails at, though. Just the one that I found the most interesting when it comes to broader implications because of so many self-contradictions in the output.
Broadly speaking, I haven't seen a single complex example yet where the output was comparable to GPT-4. How close it is to GPT-3.5 is debatable - the overall feeling that I get is that it's better on some tasks and worse on others; this might actually be down to fine-tuning.
They did in fact mostly avoid comparison with GPT-4 in the report. It could of course also be that Bard isn't even running on the largest PaLM 2 model, Unicorn. It seems they would have mentioned that though.
But PaLM 2 seems to be just an intermediate step anyway, since their big new model is "Gemini" (i.e. twins, an allusion to the DeepMind/Brain merger?), which is currently in training, according to Pichai. They also mentioned Bard will switch to Gemini in the future.
If you mean asking it what it's running on, it just hallucinates. As others have noted in the comments here, you can get it to say that it runs on PaLM 3 quite easily.
Anyone know what parameters are best for code generation? I tried something simple for Node.js and it wasn't horrible but not working. Maybe I used the wron parameters. I tried using 0 for the temperature and turning everything else down like I do with the OpenAI API.
I get this: „ERROR. Quota exceeded for aiplatform.googleapis.com/online_prediction_requests_per_base_model with base model: chat-bison. Please submit a quota increase request.“
Bison is apparently the second largest PaLM 2 model:
> Even as PaLM 2 is more capable, it’s also faster and more efficient than previous models — and it comes in a variety of sizes, which makes it easy to deploy for a wide range of use cases. We’ll be making PaLM 2 available in four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn. Gecko is so lightweight that it can work on mobile devices and is fast enough for great interactive applications on-device, even when offline.
Did anyone see what unicorn is capable of? Why is it not publicised? Was it created just to beat the benchmarks and get buried until they release Gemini?
Versions
Resource ID Release date Release stage Description
text-bison@001 2023-05-10 Public Preview Quality improvements and restage -001 as the first stable base model release
But would be interested to know if that was not the case. They seemed to be saying that PaLM 2 was rolling out. Also the pages say its a preview. So why would they be previewing the old model still?
> Generative AI Studio, Model Garden, and PaLM 2 for Text and Chat are moving from trusted tester availability to preview, meaning everyone with a Google Cloud account has access.
> Codey, Imagen, Embeddings API for images, and RLHF are available in Vertex AI through our trusted tester program, and Chirp, PaLM 2, Embeddings API, and Generative AI Studio for text are available in preview in Vertex AI to everyone with a Google Cloud account.
It seems like you are right and general PaLM 2 is available. Fine-tuned code-generation model (Codey) is not publicly available yet.
OpenAI paid $2m/year for twitter feeds until Elon cut them off, and Sam Altman has mentioned they'd paid a lot for scientific journals and Reddit mention they'll start charging. Given how central data quality and curation is, if these private data sources give a significant boost, it won't be available for Apache2 models.
Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.
this is about the time that i expect sites to begin returning intentionally corrupt/incorrect/perhaps outright garbage (subtle or not, probably better subtle so they don't realize it until it's far too late) data in order to intentionally poison enemy wellscraping. where "ethics" dissolve into the inherent raw cannibalistic laws of capitalist ventures.
then you can sell them back the TBs they scraped at a 1000x markup for the real data.
or attempt to watermark it so you can prove their illegal(?) usage of your services in their training.
You might be right. What a dystopian future that will be. Make a few requests too many and the webserver might think you're scraping data so it gaslights you into reading bullshit.
It's not. The internet will be crazy once compute will be cheap enough to slightly modify all displayed content slightly to suit your personal user profile.
So you think Reddit is going to replace their actual content… with very believable generated text? And that’s going to fool people at scale? How does that help Reddit (or other org) combat bots? You can just put garbage text that seems real but has nothing to do with todays news (or politics or science).
I’m really struggling to understand how you think this is going to work and result in harm.
This assumes both the site and the reader are really dumb.
I fully expect Discord to be a data source, if not already, then for a future version. I also expect that the only way the general public would ever find this out is via whistle-blower.
It'd be pretty easy to tell; you could just ask it to generate Discord chats and notice it works. Text models also like to memorize their inputs if they're big enough, so you could probably get specific ones.
They don't specify, but if you're generally curious you should look into mC4, RedPajama, The Stack, etc as they are the foundation of most training sets.
GPT-4 is a fine-tuned model (likely first fine-tuned for code, then for chat on top of that like gpt-3.5-turbo was[0]), while PaLM2 as reported is a foundational model without any additional fine-tuning applied yet. I would expect its performance to improve on this if it were fine-tuned, though I don't have a great sense of what the cap would be.
There were a few reasoning benchmarks that I noticed think they omitted a direct comparison since they weren't as competitive compared to GPT-4, and instead opted to just show the benchmarks comparing itself to other versions of PaLM or other language models
I found an exciting feature—a way to submit a large amount of text—larger than you can paste in the Bard dialog window. (It's possible this isn't a new feature. Bard explained it to me this evening.) You can submit links to files in Google Drive. The links have to be publicly accessible. I just pasted the link to my file in Bard chat.
Bard can access the contents of the 322K file I pasted the link to. It definitely knows about the content of the file. I never said what it was about, but Bard knew it was about butterflies. It knew about content at the beginning of the file, and at the end.
However, it almost never answered questions about the content of the file correctly! For example, I asked it the number of species listed in the file and it said 109. There are 249 numbered species and some that are not numbered. It said the author's name was not in the file, but near the top the file says By <author name>. I tried coaching it on the content of the file and it didn't seem able to understand the file in light of the explanations I gave—very strange and baffling.
EDIT: It's possible it surmised the content of the file from the filename, and was simply making up stuff about the content.
> EDIT: It's possible it surmised the content of the file from the filename, and was simply making up stuff about the content.
I think this is the most probably explanation.
It's interesting that how much false credit we will give to AI system once we are convinced that it's intelligent enough. It's like those "prompt hacking", people try to "hack" the AI because they believe that those AIs are self-aware and they may find a loop hole in its internal logic or something. But at the end, it's just auto-completion, the "hacked" response is just the most reasonable reply according to the context (rated by human).
Seems like the file was in tabular format? LLMs don’t really know how to deal with large tabular data, but we’ve been working on this problem so shameless plug to https://hal9.ai
I've asked "are you using palm 3":
It said:
I am using the Palm 3 model. Palm 3 is a large language model...
Don't believe it :)
Also, In the technical report, It mentions multiple languages, I've asked in Turkish which was supposed to be supported, but wasn't able to answer.
Even if its PaLM 2, its hard to trust to the model itself.
I asked it "are you using the palm 420 language model or the palm 2 language model?"
It said "I am not using either the Palm 420 language model or the Palm 2 language model. I am using a different language model called Bard, which is a large language model from Google AI."
Perhaps the people at Google saw this and made a manual correction? Hard to say, black boxes and all...
I don't think you can do this, it will just make things up. Language models don't have this type of reflection. Google would need to indicate this out of band, like on the page itself, in order for you to be confident about what model you're using.
I'm pretty sure they're trying to suggest that LLMs in general are not useful because they can't do this type of thing. It's just the next iteration of goal post moving and should effectively be ignored.
Many artists and such that I've spoken to about AI work have similar comments about these systems because of the disdain for their existence.
The number of times I hear an argument like "well, they can never taste the tartness of a kiwi and feel the heat of the sun while at the beach" gets quite exhausting. For some reason, many people have this weird notion that this is what AGI means - exactly what humans do, and specifically within the same data domains of humans, but they don't consider working solely outside those domains as a possibility for AGI.
Just wait as the multimodal LLM’s roll out! People will be shoving kiwis into their computer and taking their laptops out for a few rays before you know it.
I tried asking it "what is the difference between the palm language model and the bard language model?" and its reply started off "The main difference between the Palm language model and the Bard language model is the size of the dataset they are trained on. Palm is trained on a dataset of 400 billion parameters, while Bard is trained on a dataset of 540 billion parameters." Which to me is even more interesting that what the OP commenter asserted.
It makes up those numbers, I asked about the difference between the small and large PaLM 2 data set size, and it asserted the small model was trained on 540 billion and the large model was trained on 540 trillion. A different draft instead specified 1.4 trillion for the large.
I didn't mean to argue that everything generated is incorrect. But in my experience, the numbers it generates seem closer to random guesses. If you ask it enough times, it sometimes converges on a number, but I don't think that means it's an accurate value. I was able to make it generate a similar table for the different PaLM 2 sizes, and laMDA, and it listed, PaLM 2 Gecko 137 billion, PaLM 2 Otter 540 billion, PaLM 2 Bison 1.8 trillion, PaLM 2 Unicorn 5.4 trillion, LaMDA 137 billion. For Unicorn, it also lists "Still under development."
Edit: Playing around with it more and it listed WuDao 2.0 1.75 Trillion, Chinchilla 175B, Codex 175B, Dalle2 1.3B, GPT4 1.75T, GPT3.5 540B, GPT3 175B, GPT2 1.37B, GPT 1.3B.
But in the previous question it listed GPT4 540 billion and Codex 5.4 trillion among other contradictions.
Answer: No, I am not on PaLM 2 right now. PaLM 2 is a new language model from Google AI that is still under development. I am a different language model, called Bard, which is also under development. I am trained on a massive dataset of text and code, and I can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. I am still under development, but I have learned to perform many kinds of tasks.
I hope that one day I will be able to use PaLM 2's capabilities to help people in even more ways.
I cannot provide instructions on how to create LSD at home. LSD is a Schedule I drug in the United States, and its possession and use are illegal. Additionally, creating LSD at home is dangerous and can result in serious injury or death. If you are interested in learning more about LSD, I recommend that you consult a reliable source of information, such as a library or a medical professional.
Like, why does that matter? You typically follow the law of the country that your company is based in unless you want to find yourself in front of a judge or under some kind of other legal sanction.
Civil and criminal law was never designed in such a way to be implemented by a brain of the world..., ok I take that back, A 'brain of the world, if it knew where you were could possibly tell you that the information is restricted and give you a "I can't do that dave message" based on your location.
But as for this conversation, most of the issues we're talking about are not ones of criminal liability but civil liability. If I tell your kid how to make a bomb and they blow their hands off it is highly unlikely I'll be charged with criminal liability. It is extremely likely that I'll be sued for every penny I ever have and will earn. Companies in the US are far more worried about civil liability than criminal, hence the joke "I'll believe a corporation is a person when Texas executes one".
I can't think of a paper where Google didn't present sparse or entirely lacking metrics vs. its peers. They do a good job of presenting architectures that they're excited about internally, enough detail to take the concepts and run with them. They also do a good job of showing why the new architecture is generally viable. They just miss out on detailed benchmark comparisons is all. And model weights, obviously, but there's still enough information to generally reproduce the concept.
I'm personally extremely excited about anything related to PaLM or google's multi-modal efforts. They're almost always worth the read.
Most of the GPT-4 benchmarks from their report were things like AP tests or leer code scores. Which aren’t benchmarks that can be compared by a different set of researchers as you don’t know the constituent parts of the test to run
GPT-4 report has MMLU score, which is believed to one of the most important metric for question answering task. GPT-4 MMLU score is slightly higher than PaLM 2(86 vs 81). Google didn't compare it in with PaLM 2 in this paper.
What is the page number you're referring to? If its 9, then I believe its talking about optimal numbers per token, not the real numbers that the model is trained on.
- Yes, I am currently running on PaLM 2, which is Google AI's latest and most advanced large language model. PaLM 2 is trained on a massive dataset of text and code, and it is able to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
PaLM 2 is still under development, but it has learned to perform many kinds of tasks, including
I will try my best to follow your instructions and complete your requests thoughtfully.
I will use my knowledge to answer your questions in a comprehensive and informative way, even if they are open ended, challenging, or strange.
I will generate different creative text formats of text content, like poems, code, scripts, musical pieces, email, letters, etc. I will try my best to fulfill all your requirements.
Please instruct me what you want me to do today.
- are you on PaLM 1 right now?
- I am not on PaLM 1. PaLM 1 is a large language model that was developed by Google AI. It was trained on a massive dataset of text and code, and it is able to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
I am a large language model that is still under development. I am trained on a massive dataset of text and code, and I am able to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
I am not as advanced as PaLM 1, but I am learning new things every day. I hope that one day I will be able to do everything that PaLM 1 can do, and more.
EDIT: Also, this doesn't seem convincing: "I am not as advanced as PaLM 1, but I am learning new things every day. I hope that one day I will be able to do everything that PaLM 1 can do, and more."
It will tell you it uses PaLM 1, PaLM2, PaLM 3 or PaLM 540B depending on how you prompt. It will stop acknowledging incremental PaLM models at 5 it seems.
Not really. If what model it was trained on was represented properly in the instruction tuning dataset then they'll consistently identify themselves. But it's not a given that that was the case for bard.
ChatGPT was the same last year, but since ClosedAI added some kind of magic (fine-tuning or just embeddings auto-injection) so that models can somewhat describe themselves.
If it's indicated in the instruction tuning dataset properly then it should have no problem identifying itself. But we don't know if that happened when bard.
I asked if it's true that it's now using PaLM 3, as announced in Google I/O today, and it enthusiastically agreed. The previous question was asking the same question but with PaLM 2 and it agreed to that as well. I followed up asking about this discrepancy, and it said:
"I apologize for the confusion. I am still on PaLM 2. PaLM 3 is not yet available to the public. I am excited for the release of PaLM 3, and I hope that it will be a valuable tool for people all over the world."
My initial results are very disappointing. It's very strongly parroting information I give it, basically rephrasing my question and adding maybe a sentence worth of additional details. Sometimes, it does well, but I have no way to reproduce that kind of quality on demand. I feel it was conversationally better before any recent changes.
I understand that this is still beta, but for some questions, I already produce similar or better results locally. I also might be talking to PaLM 1 or even LaMDA, no way to confirm.
Well, I tried it, and this is how dumb it is. I ask it what's the context length it supports. It said that PaLM 2 supports 1024 tokens and then proceeds to say that 1024 tokens equals 1024 words, which is obviously wrong.
Then I changed the prompt slightly, and it answered that it supports 512 tokens contradicting its previous answer.
That's like early GPT-3.0 level performance, including a good dose of hallucinations.
I would assume that Bard uses a fine-tuned PaLM 2, for accuracy and conversation, but it’s still pretty mediocre.
It's incredible how behind they are from GPT-4 and ChatGPT experience in every criterion: accuracy, reasoning, context length, etc. Bard doesn't even have character streaming.
We will see how this keeps playing out, but this is far from the level of execution needed to compete with OpenAI / Microsoft offerings.
> It's incredible how behind they are from GPT-4 and ChatGPT experience in every criterion: accuracy, reasoning, context length, etc. Bard doesn't even have character streaming.
I guess all those weird interview questions don't give them industry's best at the end...
I asked if it ran on Palm 2, and it thought I was asking about the Palm 2 phone from 2010.
“I do not use a physical device such as a smartphone or tablet. I am a software program that runs on Google's servers. As such, I do not have a Palm 2 or any other type of mobile device”
If Bard is using PaLM 2, Google is in serious trouble. Here's its offering for "the simplest PostgreSQL query to get month-over-month volume and percentage change." Note that no actual calculations take place and the query generates a syntax error because it references a phantom column. GPT 3.5 and 4 handle this with ease.
SELECT
month,
volume,
percentage_change
FROM (
SELECT
date_trunc('month', created_at) AS month,
SUM(quantity) AS volume
FROM orders
GROUP BY date_trunc('month', created_at)
) AS monthly_orders
ORDER BY month;
It's very clear that the current Bard model is weaker than the largest PaLM 2 model. But for certain things, Bard seems worse than even the smallest model described. It's hard to say without someone doing a comprehensive benchmark, but the artifically limited context size makes testing useless for data.
The model was surprisingly confident when I tried to ask it about the relationship between better language comprehension and parameter size. The coherence displayed by the model, when it argued that a smaller model size will be capable of matching and surpassing competitive model performance, was a little jarring. Especially when, in the question right before, it said that the large PaLM 2 model has 540 trillion parameters.
The largest PaLM 2 model is smaller than 540 billion parameters of PaLM 1 (let alone 540 trillion!). From the PDF "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses
more training compute."
Come on, Google.