I am actually quite surprised at the figure of 73% research-related code package...

rovr138 · on Feb 22, 2021

Same. But it could be an issue with the sample. 213 in a span of 14 years is not a lot.

Also, a question. If you publish a paper with a repo, what would be the best way to handle the version in the paper matching the repo in the future?

An opinion, there is such a thing as software being ‘done’ and ‘as is’. Software solves a need. After that’s meet, that’s it.

There’s also this part that strikes me,

>Given a tangled mess of source code, I think I could reproduce the results in the associated paper (assuming the author was shipping the code associated with the paper; I have encountered cases where this was not true).

And it strikes me as weird. The main issue to reproduce results is usually data. And depending on the dataset, it’s very hard to get. To be able to reproduce the code, I just need the paper.

The code may have bugs, may stop working, may be in a different language/framework. The source of truth is the paper. This is why the paper was published.

medstrom · on Feb 22, 2021

>The source of truth is the paper. This is why the paper was published.

Speaking as someone who's not the best at math, I find it easier to understand what a paper is saying after I run the code and see all the intermediate results.

When the code doesn't work, it takes me 20 times longer to digest a paper. They could do with only uploading code -- to me it's the shortest and most effective way to express the ideas in the paper.

rovr138 · on Feb 22, 2021

>Speaking as someone who's not the best at math, I find it easier to understand what a paper is saying after I run the code and see all the intermediate results.

As long as you understand the paper after, that's okay.

> When the code doesn't work, it takes me 20 times longer to digest a paper.

What if the data isn't available? That's another issue. I see where you're coming from, but that's why the paper itself is the source of truth. Not the implementation.

Another case, what if the implementation makes assumptions on the data? Or on the OS it's being run on?[0][1]

> They could do with only uploading code -- to me it's the shortest and most effective way to express the ideas in the paper.

In my opinion, no. The math and algorithm behind it is more important than an implementation and better for longevity.

[0] https://science.slashdot.org/story/19/10/12/1926252/python-c...

[1] https://arstechnica.com/information-technology/2019/10/chemi...

speters · on Feb 22, 2021

> Also, a question. If you publish a paper with a repo, what would be the best way to handle the version in the paper matching the repo in the future?

You can include the hash of the commit used for your paper.

rovr138 · on Feb 22, 2021

oh, that's good. Or even a tag

jimmyvalmer · on Feb 22, 2021

> The source of truth is the paper.

Yes, although truth of the flimsiest kind. A lowly but wise code monkey once said "Talk is cheap. Show me the code."

rovr138 · on Feb 22, 2021

Here's some code. Data is proprietary. There's no paper explaining the data, prep, steps to gather, caveats, assumptions, etc.

What now?

jimmyvalmer · on Feb 22, 2021

No need for the reductionist strawman. Some experiments cannot be reproduced for proprietary data. Those that can should be.

jonnycomputer · on Feb 22, 2021

It turns out that maintaining a package is a lot of work, and the career benefit post-publishing said package and accompanying paper is really low.

- writing general purpose software that works on multiple platforms and is bug free is really really hard. So you're just going to be inundated with complaints that it doesn't work on X

- maintaining software is lots of work. Dependencies change, etc.

- supporting and helping an endless number of noobs use your software is a major pita. "I don't know why it wouldn't compile on your system. Leave me alone."

- "oh that was just my grad work"

- its hard to get money to pay for developing it further. great when that happens though.