A few cs conferences have artifact evals. Most all research in cs doesn't actually have any sort of code review at all. No field is implementing the thing you are expecting.
I think it will be an uphill battle no doubt, but I think the only alternative would be to share the whole dataset and have reviewers re-implement the analysis to confirm the results. That would also be a huge improvement, but it seems like a much bigger burden on reviewers.
Well theoretically it already is, given you normally have multiple authors and reviewers. It's just done poorly, just as a code review can be done poorly.
Analyzing data in academia seems like a disaster. It's almost guaranteed to produce errors like this.
You have:
- people with no coding experience and, in some cases (especially in social sciences), a strong aversion to math
- code that isn't unit tested, so answering the question, "Did it run correctly?" is often softened into, "Does this look plausible to me?"
- a strong incentive to end up with certain results
I dated a quantitative geneticist for a while, and her coding education was almost zero. She was writing code in R and essentially just changing lines until the output "looked right". It was insanely complicated math, so there was no way to make sure the output was good. The code had to be an exact match for the algorithm she had written out in mathematical notation, and there was essentially no chance of that.
It got worse. She'd write the algorithm in R and then end up with batches that would take, in some cases, years to finish running. Obviously she ended up with even more dubious hacks.
(For anyone curious, she's had a fairly decorated academic career under an acclaimed advisor who reviewed all of this code to some extent, and she's worked with most of the top genetics programs in the US.)