Chinchilla Scaling: A replication attempt | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

		Chinchilla Scaling: A replication attempt (arxiv.org)
		124 points by tosh 13 days ago \| hide \| past \| favorite \| 68 comments

magnio 13 days ago | [–]

> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.

mxwsn 13 days ago | | [–]

Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.

V1ndaar 13 days ago | | | [–]

And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.

godelski 13 days ago | | | [–]

Yeah it's very annoying especially these days when there's no real excuse to not have a copy. You can easily store all code and data for free and in an accessible manner. Even just GitHub for 90+% is good enough. Hugging face helps, and there's many other ways too.

I remember my first year in grad school I was trying to replicate a work by a very prestigious university. It definitely wasn't reproducible from text but I did my best. Couldn't get close to their claims so I email the lead author (another grad student). No response. Luckily my advisor knew their advisor. Got a meeting and then I got sent code. It was nothing like what they claimed in the paper so I have no idea what they gave me. Anyways, my paper never got published because I couldn't beat them. It is what it is.

WanderPanda 13 days ago | | | | [–]

So be fair, sometimes (e.g. in the case of scatter plots with many dots) pdf renderers become very slow and/or mess up the rendering. In this case the easiest option is rasterizing it (for performance and consistency of the appearance)

V1ndaar 12 days ago | | | [–]

That is certainly true (and why added a general "embed plot data as bitmap into SVG/PDF" option to https://github.com/Vindaar/ggplotnim that works not only for raster heatmaps). But realistically such plots are often not ideal anyway (too many data points in a plot is often a sign that a different type of plot would be better; typically one that aggregates in some way) and it's just another argument to make the data for plots available as well.

jszymborski 13 days ago | | | | [–]

If you have the misfortune of having to use Word for writing manuscripts and/or have scatter plots with a good number of points, SVGs will ruin your day in my experience.

(Yes, I'd much rather use LaTeX)

mirekrusin 12 days ago | | | | [–]

Somebody tell them that huggingface, github, gitlab, codeberg etc exist.

acc_297 13 days ago | | | [–]

In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source

ege_erdil 13 days ago | | | [–]

we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that

saurabh20n 13 days ago | | | [–]

Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

ege_erdil 12 days ago | | | [–]

we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data

what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake

so our reconstruction appears to be good enough

levocardia 13 days ago | | | [–]

I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/

dynm 13 days ago | | | [–]

Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.

Ajoo 13 days ago | | | [–]

They claimed that they did ask several times in one of the replies.

polygamous_bat 13 days ago | | | [–]

> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.

sp332 13 days ago | | | [–]

https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.

williamdclt 13 days ago | | | [–]

I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!

cs702 13 days ago | | [–]

Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!

dzdt 13 days ago | | [–]

Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.

cs702 13 days ago | | | [–]

Yes, could be. Not sure how or even if anyone could prove it, though.

godelski 13 days ago | | | [–]

This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.

cs702 12 days ago | | | [–]

I agree, but my question was not whether distance between data points tends to decrease as dataset size grows, but whether that is the reason why the number of training tokens required per parameter declines. It could be, but proving it would require a better understanding of how and why these giant AI models work.

godelski 12 days ago | | | [–]

Wasn't your question about how *independent* the data is?

We could talk about this in different ways, like variance. But I'm failing to see how I didn't answer your question. Did I misscommunicate? Did I misunderstand?

The model is learning off of statistics so most of your information gain would be through more independent data. Think of this space we are talking about as "knowledge." And our "intelligence" as how easy it is to get to any point in this space. The vector view above might help with understanding this one, because you can step in the direction of any vectors you have and then how you combine them to get to your final point. The question is how many you have to use (how many "steps" away)? And of course, how close you can get to your final destination. As you can imagine, from my previous comment, that doing this for any given point you'll reduce "steps" to your final destination if you have more vectors, but you can also understand that the utility of each vector decreases as you add more. (Generally. Of course if you have a gap in knowledge you can get a big help from a single vector that goes into that area but let's leave that aside).

Does this help clarify? If not I might need you to clarify your question a bit more. (I am a ML researcher fwiw)

cs702 12 days ago | | | [–]

> Wasn't your question about how independent the data is?

No. My original (top) comment was about how the number of training tokens required per parameter slowly declines as models become larger. dzdt suggested it could be because the independence of training points declines as the dataset size grows. I said it could be, but I'm not sure how one would go about proving it, given how little we know about the inner working of giant models. Makes sense?

Otherwise, I agree with everything you wrote!

godelski 11 days ago | | | [–]

Oh I see. It's because yes, we expect this to happen once we get to sufficient coverage. As we linearly increase the number of parameters the number of configurations increases super linearly. In other words, the information we can compress.

There's a lot we didn't know, but it isn't nothing. There's a big push for ML not needing math. It's true you can do a lot without, especially if you have compute. But the math helps you understand what's going on and what are your limits. We can't explain everything yet, but it's not nothing.

sebzim4500 13 days ago | | | | [–]

I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.

Kronopath 13 days ago | | | [–]

This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.

kelseyfrog 13 days ago | | | [–]

No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.

p1esk 13 days ago | | | [–]

How much language input does a human need to become intelligent if he doesn’t receive any other input?

HeatrayEnjoyer 13 days ago | | | | [–]

Do they? What is the total size of all visual, audio, touch, locomotive, scent, and taste data collected between birth and when a human reaches IQ 100? There are multiple high-bandwidth feeds running into the brain 24/7.

zarzavat 13 days ago | | | [–]

Vision is not necessary for language acquisition.

Proof: blind and partially sighted people exist.

cubefox 13 days ago | | | | [–]

> language input

TeMPOraL 13 days ago | | | | [–]

Yes, but LLMs come out of training as experts in approximately any single thing you can think of, and then some, and all that in dozen of languages. Humans don't achieve even a fraction of this kind of breadth.

sdenton4 13 days ago | | | [–]

LLMs are experts at everything except what the user is an expert in.

andai 13 days ago | | | [–]

Gell-Mann Amnesia effect

> You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.

> In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.

-Michael Crichton

Edit: Found the speech this is from.

https://web.archive.org/web/20190808123852/http://larvatus.c...

godelski 13 days ago | | | | [–]

This is not quite accurate, but complex because measurement is hard. The things they are being tested on are almost surely within the dataset. Let's take the bar exam for instance. Sure, we don't know what's in GPT data, but we know it has reddit, and we know reddit has many similar if not exact questions on it. We know that the first GPT4 did not have good semantic similarity matching because they just used a 3 substring matching on 50 chararcters (Appendix C) and they only consider the false positive nature. Then there's this line...

  The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.

But my favorite is the HumanEval. I'll just remind everyone that this was written by 60 authors, mostly from OpenAI

  We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. ... __It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.__

The problems? Well they're leetcode style... Can you tell me you can write leetcode style questions that

  Human Eval 2

  Prompt:
  def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ 

  Solution:
  return number % 1.0 

  Human Eval 4

  Prompt:
  from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 

  Solution
  mean = sum(numbers) / len(numbers) 
  return sum(abs(x - mean) for x in numbers) / len(numbers)

You really want to bet that that isn't on github? Because I'll bet you any dollar amount you want that there are solutions in near exact form that are on github prior to their cutoff date (Don't trust me, you can find them too. They're searchable even). Hell, I've poisoned the dataset here!

LLMs are (lossy) compression systems. So they're great for information retrieval. And a lot of what we consider intelligence (and possibly even creativity) is based on information retrieval. Doesn't mean these things are any less impressive but just a note on how we should be interpreting results and understanding the limitations of our tools. Measuring intelligence is a really difficult thing and we need to be aware that the term isn't universally agreed upon and so people are often talking past one another and also some people are conflating the differences as if they are the same.

mirekrusin 12 days ago | | | | [–]

LLMs are super-intelligent at mimicking already, it won't take much time to find some kind of RL loop there.

exe34 13 days ago | | | | [–]

Like a corporation then. We should ban them until we can figure out how to align them!

tehsauce 13 days ago | | | [–]

ASI is nothing like a corporation

wizzwizz4 13 days ago | | | [–]

No, they're not. Corporations have known, concrete impacts on the world, whereas the dangers of AI are, so far, corporations. ASIs are (as yet) fictional.

Another difference: most corporations will avoid doing illegal stuff if the penalties are large enough: the corporation alignment problem is political. Pretty much no extant AI systems can be instructed in this way: we don't know how to align AIs even in theory.

andai 13 days ago | | | [–]

For organisms the ultimate punishment is death. How do you delete an AI from the internet?

exe34 12 days ago | | | [–]

sudo rm * -rf

wizzwizz4 12 days ago | | | [–]

That won't provide any motivation: no AI system yet created fears death (except perhaps some of the really simple, evolved ones – but I'd question whether they're sophisticated enough to fear).

kelseyfrog 12 days ago | | | | [–]

> Corporations have known, concrete impacts on the world

I hate to do this, but can you enumerate them?

TeMPOraL 13 days ago | | | | [–]

Is very much like a corporation; a corp is effectively an AGI, just running very slowly - at the speed of bureaucracy.

pfdietz 12 days ago | | | | [–]

It's only bad news if you don't want a dangerously superintelligent AI.

Kronopath 12 days ago | | | [–]

No one should want this.

gwern 13 days ago | | [–]

The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250

mirekrusin 12 days ago | | [–]

Lovely, they are also open sourcing data.

anonymousDan 12 days ago | | | [–]

The scientific process at work!

cgearhart 13 days ago | | [–]

TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.

ege_erdil 13 days ago | | [–]

we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?

newfocogi 13 days ago | | [–]

Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."

moffkalast 13 days ago | | [–]

Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.

og_kalu 13 days ago | | | [–]

The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.

moffkalast 13 days ago | | | [–]

Compute optimal for what, training? There's nothing optimal in blowing up model size beyond the absolute minimal needed or you'll spent the equivalent of a country in electricity trying to scale inference later.

rfw300 13 days ago | | | [–]

Yes, compute-optimal for training only. The purpose of the paper wasn’t to determine the most economically practical model one could build, the most “intelligent” model one could build given some amount of compute.

ijk 13 days ago | | | [–]

Quite. The big question at the time was "how much data do we need to train GPT-3 equivalent models". Open models had failed to live up to GPT performance, even ones with a massive number of parameters. So getting results that suggested a reason why other models were massively undertrained was important.

Meanwhile, people noticed that for deployed models, inference cost often outweighs the initial training costs. It's sometimes better to train a smaller, faster model longer on more data, because it has lower overall cost (including environmental impact) if you're expecting to run the model a few million or billion times (e.g., [1]). So training past the Chinchilla optimum point became a lot more common, particularly after Llama.

[1] https://arxiv.org/abs/2401.00448

og_kalu 13 days ago | | | | [–]

Training yes.

Doubling your parameter count past that ratio will yield a better model than doubling your data and is much easier and cheaper to do.

naasking 13 days ago | | | [–]

That suggests that it's likely memorizing more special cases rather than distilling general principles. They generalize to some degree but clearly there's room for improvement.

og_kalu 13 days ago | | | [–]

It doesn't really suggest anything. Neither model will even close to saturation and all else equal, bigger models perform better in every way, including generalization.

naasking 12 days ago | | | [–]

But why do bigger models perform better? Arguably because there's a larger state space that can be used to remember more contexts, which helps with both generalization and case-specific processing.

og_kalu 12 days ago | | | [–]

>But why do bigger models perform better?

No one really knows the answer to this question.

>Arguably because there's a larger state space that can be used to remember more contexts, which helps with both generalization and case-specific processing.

What I'm trying to say is that both models in either scenario are very over-parameterized and under-trained.

You say the answer is extra space ? The smaller model has not used anywhere near the space it has. They both have extra space.

It's like arguing a bigger drum is better because of extra space when all the water you plan to store will not take even half of your smaller drum

FeepingCreature 13 days ago | | | | [–]

Blow up model size, get lots of space and parameters to do the double-descent grok thing in, then distill it way way down?

eldenring 13 days ago | | | | [–]

No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.

warbaker 13 days ago | [–]

Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".

ege_erdil 13 days ago | [–]

one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact