Hacker News new | past | comments | ask | show | jobs | submit login
Chinchilla Scaling: A replication attempt (arxiv.org)
124 points by tosh 13 days ago | hide | past | favorite | 68 comments





> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.


Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.

And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.


Yeah it's very annoying especially these days when there's no real excuse to not have a copy. You can easily store all code and data for free and in an accessible manner. Even just GitHub for 90+% is good enough. Hugging face helps, and there's many other ways too.

I remember my first year in grad school I was trying to replicate a work by a very prestigious university. It definitely wasn't reproducible from text but I did my best. Couldn't get close to their claims so I email the lead author (another grad student). No response. Luckily my advisor knew their advisor. Got a meeting and then I got sent code. It was nothing like what they claimed in the paper so I have no idea what they gave me. Anyways, my paper never got published because I couldn't beat them. It is what it is.


So be fair, sometimes (e.g. in the case of scatter plots with many dots) pdf renderers become very slow and/or mess up the rendering. In this case the easiest option is rasterizing it (for performance and consistency of the appearance)

That is certainly true (and why added a general "embed plot data as bitmap into SVG/PDF" option to https://github.com/Vindaar/ggplotnim that works not only for raster heatmaps). But realistically such plots are often not ideal anyway (too many data points in a plot is often a sign that a different type of plot would be better; typically one that aggregates in some way) and it's just another argument to make the data for plots available as well.

If you have the misfortune of having to use Word for writing manuscripts and/or have scatter plots with a good number of points, SVGs will ruin your day in my experience.

(Yes, I'd much rather use LaTeX)


Somebody tell them that huggingface, github, gitlab, codeberg etc exist.

In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source

we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that


Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.


we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data

what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake

so our reconstruction appears to be good enough


I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/


Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.

They claimed that they did ask several times in one of the replies.

> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.


https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.

I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!

Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!


Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.

Yes, could be. Not sure how or even if anyone could prove it, though.

This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.


I agree, but my question was not whether distance between data points tends to decrease as dataset size grows, but whether that is the reason why the number of training tokens required per parameter declines. It could be, but proving it would require a better understanding of how and why these giant AI models work.

Wasn't your question about how *independent* the data is?

We could talk about this in different ways, like variance. But I'm failing to see how I didn't answer your question. Did I misscommunicate? Did I misunderstand?

The model is learning off of statistics so most of your information gain would be through more independent data. Think of this space we are talking about as "knowledge." And our "intelligence" as how easy it is to get to any point in this space. The vector view above might help with understanding this one, because you can step in the direction of any vectors you have and then how you combine them to get to your final point. The question is how many you have to use (how many "steps" away)? And of course, how close you can get to your final destination. As you can imagine, from my previous comment, that doing this for any given point you'll reduce "steps" to your final destination if you have more vectors, but you can also understand that the utility of each vector decreases as you add more. (Generally. Of course if you have a gap in knowledge you can get a big help from a single vector that goes into that area but let's leave that aside).

Does this help clarify? If not I might need you to clarify your question a bit more. (I am a ML researcher fwiw)


> Wasn't your question about how independent the data is?

No. My original (top) comment was about how the number of training tokens required per parameter slowly declines as models become larger. dzdt suggested it could be because the independence of training points declines as the dataset size grows. I said it could be, but I'm not sure how one would go about proving it, given how little we know about the inner working of giant models. Makes sense?

Otherwise, I agree with everything you wrote!


Oh I see. It's because yes, we expect this to happen once we get to sufficient coverage. As we linearly increase the number of parameters the number of configurations increases super linearly. In other words, the information we can compress.

There's a lot we didn't know, but it isn't nothing. There's a big push for ML not needing math. It's true you can do a lot without, especially if you have compute. But the math helps you understand what's going on and what are your limits. We can't explain everything yet, but it's not nothing.


I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.

This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.

No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.


How much language input does a human need to become intelligent if he doesn’t receive any other input?

Do they? What is the total size of all visual, audio, touch, locomotive, scent, and taste data collected between birth and when a human reaches IQ 100? There are multiple high-bandwidth feeds running into the brain 24/7.

Vision is not necessary for language acquisition.

Proof: blind and partially sighted people exist.


> language input

Yes, but LLMs come out of training as experts in approximately any single thing you can think of, and then some, and all that in dozen of languages. Humans don't achieve even a fraction of this kind of breadth.

LLMs are experts at everything except what the user is an expert in.

Gell-Mann Amnesia effect

> You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.

> In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.

-Michael Crichton

Edit: Found the speech this is from.

https://web.archive.org/web/20190808123852/http://larvatus.c...


This is not quite accurate, but complex because measurement is hard. The things they are being tested on are almost surely within the dataset. Let's take the bar exam for instance. Sure, we don't know what's in GPT data, but we know it has reddit, and we know reddit has many similar if not exact questions on it. We know that the first GPT4 did not have good semantic similarity matching because they just used a 3 substring matching on 50 chararcters (Appendix C) and they only consider the false positive nature. Then there's this line...

  The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
But my favorite is the HumanEval. I'll just remind everyone that this was written by 60 authors, mostly from OpenAI

  We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. ... __It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.__
The problems? Well they're leetcode style... Can you tell me you can write leetcode style questions that

  Human Eval 2

  Prompt:
  def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ 

  Solution:
  return number % 1.0 

  Human Eval 4

  Prompt:
  from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 

  Solution
  mean = sum(numbers) / len(numbers) 
  return sum(abs(x - mean) for x in numbers) / len(numbers) 
You really want to bet that that isn't on github? Because I'll bet you any dollar amount you want that there are solutions in near exact form that are on github prior to their cutoff date (Don't trust me, you can find them too. They're searchable even). Hell, I've poisoned the dataset here!

LLMs are (lossy) compression systems. So they're great for information retrieval. And a lot of what we consider intelligence (and possibly even creativity) is based on information retrieval. Doesn't mean these things are any less impressive but just a note on how we should be interpreting results and understanding the limitations of our tools. Measuring intelligence is a really difficult thing and we need to be aware that the term isn't universally agreed upon and so people are often talking past one another and also some people are conflating the differences as if they are the same.


LLMs are super-intelligent at mimicking already, it won't take much time to find some kind of RL loop there.

Like a corporation then. We should ban them until we can figure out how to align them!

ASI is nothing like a corporation

No, they're not. Corporations have known, concrete impacts on the world, whereas the dangers of AI are, so far, corporations. ASIs are (as yet) fictional.

Another difference: most corporations will avoid doing illegal stuff if the penalties are large enough: the corporation alignment problem is political. Pretty much no extant AI systems can be instructed in this way: we don't know how to align AIs even in theory.


For organisms the ultimate punishment is death. How do you delete an AI from the internet?

sudo rm * -rf

That won't provide any motivation: no AI system yet created fears death (except perhaps some of the really simple, evolved ones – but I'd question whether they're sophisticated enough to fear).

> Corporations have known, concrete impacts on the world

I hate to do this, but can you enumerate them?


Is very much like a corporation; a corp is effectively an AGI, just running very slowly - at the speed of bureaucracy.

It's only bad news if you don't want a dangerously superintelligent AI.

No one should want this.

The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250

Lovely, they are also open sourcing data.

The scientific process at work!

TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.


we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?


Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."


Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.

The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.

Compute optimal for what, training? There's nothing optimal in blowing up model size beyond the absolute minimal needed or you'll spent the equivalent of a country in electricity trying to scale inference later.

Yes, compute-optimal for training only. The purpose of the paper wasn’t to determine the most economically practical model one could build, the most “intelligent” model one could build given some amount of compute.

Quite. The big question at the time was "how much data do we need to train GPT-3 equivalent models". Open models had failed to live up to GPT performance, even ones with a massive number of parameters. So getting results that suggested a reason why other models were massively undertrained was important.

Meanwhile, people noticed that for deployed models, inference cost often outweighs the initial training costs. It's sometimes better to train a smaller, faster model longer on more data, because it has lower overall cost (including environmental impact) if you're expecting to run the model a few million or billion times (e.g., [1]). So training past the Chinchilla optimum point became a lot more common, particularly after Llama.

[1] https://arxiv.org/abs/2401.00448


Training yes.

Doubling your parameter count past that ratio will yield a better model than doubling your data and is much easier and cheaper to do.


That suggests that it's likely memorizing more special cases rather than distilling general principles. They generalize to some degree but clearly there's room for improvement.

It doesn't really suggest anything. Neither model will even close to saturation and all else equal, bigger models perform better in every way, including generalization.

But why do bigger models perform better? Arguably because there's a larger state space that can be used to remember more contexts, which helps with both generalization and case-specific processing.

>But why do bigger models perform better?

No one really knows the answer to this question.

>Arguably because there's a larger state space that can be used to remember more contexts, which helps with both generalization and case-specific processing.

What I'm trying to say is that both models in either scenario are very over-parameterized and under-trained.

You say the answer is extra space ? The smaller model has not used anywhere near the space it has. They both have extra space.

It's like arguing a bigger drum is better because of extra space when all the water you plan to store will not take even half of your smaller drum


Blow up model size, get lots of space and parameters to do the double-descent grok thing in, then distill it way way down?

No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.


Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".


one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: