Hacker News new | past | comments | ask | show | jobs | submit login

> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.




Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.


And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.


Yeah it's very annoying especially these days when there's no real excuse to not have a copy. You can easily store all code and data for free and in an accessible manner. Even just GitHub for 90+% is good enough. Hugging face helps, and there's many other ways too.

I remember my first year in grad school I was trying to replicate a work by a very prestigious university. It definitely wasn't reproducible from text but I did my best. Couldn't get close to their claims so I email the lead author (another grad student). No response. Luckily my advisor knew their advisor. Got a meeting and then I got sent code. It was nothing like what they claimed in the paper so I have no idea what they gave me. Anyways, my paper never got published because I couldn't beat them. It is what it is.


So be fair, sometimes (e.g. in the case of scatter plots with many dots) pdf renderers become very slow and/or mess up the rendering. In this case the easiest option is rasterizing it (for performance and consistency of the appearance)


That is certainly true (and why added a general "embed plot data as bitmap into SVG/PDF" option to https://github.com/Vindaar/ggplotnim that works not only for raster heatmaps). But realistically such plots are often not ideal anyway (too many data points in a plot is often a sign that a different type of plot would be better; typically one that aggregates in some way) and it's just another argument to make the data for plots available as well.


If you have the misfortune of having to use Word for writing manuscripts and/or have scatter plots with a good number of points, SVGs will ruin your day in my experience.

(Yes, I'd much rather use LaTeX)


Somebody tell them that huggingface, github, gitlab, codeberg etc exist.


In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source


we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that


Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.


we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data

what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake

so our reconstruction appears to be good enough


I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/


Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.


They claimed that they did ask several times in one of the replies.


> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.


https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.


I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: