From the article: Although Decety’s paper had reported that they had controlled ...

lmkg · on Sept 27, 2019

The variable for Country should have been treated as a categorical variable, but was instead processed as a numeric variable.

This mistake would be downright trivial to make in R. Just declare that Country is a Factor (which is the built-in type for categorical variables), and then throw the data into a library whose attitude towards errors is to coerce everything to numbers until the warnings go away.

Background: Factors in R are the idiomatic way to work with categorical data, and they work somewhat like C-style enums except the variants come from the data rather than a declaration. So if you take a column of strings in a data frame and cast it to a Factor, it will generate a mapping where the first distinct value is coded as 1, the second distinct value is coded as 2, etc. Then it replaces the strings with their integer equivalents, and saves the mapping off to the side.

I forget the exact rules (if there are rules, R is a bit lawless), but it's not very hard to peek under the hood at the underlying numeric representation. Many built-in operations "know" that Factors are different (e.g. regressing against a Factor will create dummy variables for each variant), but it's up to each library author how 'clever' they want to be.

shriphani · on Sept 27, 2019

This makes the most sense to me. I don't work in the dataframes world but without this explanation it seemed like someone would have to go out of their way to make that error.

tempguy9999 · on Sept 27, 2019

Right then...

...strong typing: for or against?

(To be fair even strong typing won't save you if you don't use it. But fuuuuuk, what an error. I noted that paper mentally and would have quoted from it)

LorenPechtel · on Sept 27, 2019

Yup, I'm all for extremely strong typing. In 40 years of writing code I can't say I've ever had any real trouble with strong typing other than when dealing with libraries that reinvent wheels. Weak typing, though--nuke it from orbit.

leereeves · on Sept 27, 2019

It's really very easy to do, roughly the stats equivalent of declaring a variable signed instead of unsigned.

Many algorithms work on both categorical and continuous variables, with different results depending on the variable's type.

human20190310 · on Sept 27, 2019

At risk of embarrassing my self statistically, what exactly happens when you do this?

I.e., if you're controlling for country, that means you're bucketing by country, and looking at each subset, right? So if country is represented by a non-discrete value... what exactly happens?

Fomite · on Sept 28, 2019

So let's pretend there's three types of trees we want to study: Oak, Maple and Aspen, which we code as 0, 1, and 2 for reasons (there are some good reasons to do this).

Statistically, if you treat them as a continuous variable, the estimates you get will act like there's an ordering there, and give you the effect of a one unit increase in tree. So it will tell you the effect of Oak vs. Maple and Maple vs. Aspen, assuming those are proportional and that Oak vs. Aspen will be twice that.

This is...nonsense, for most categorical variables. They don't have a nice, ordinal stepping like that.

kachnuv_ocasek · on Sept 27, 2019

In short, ANOVA is usually what you want to do: https://en.wikipedia.org/wiki/One-way_analysis_of_variance

In practice, if you have n countries, you'll add n-1 binary variables to your regression equation. The first country is the reference level (all zeros), for the second country set the first new variable to one, the rest to zero, etc.

_0ffh · on Sept 27, 2019

So one-hot encoding, plus one "none-hot" base case. Why not just one-hot for all? To save one input?

UncleMeat · on Sept 27, 2019

You've never had a bug in your code that seemed insane after the fact?

joelhoffman · on Sept 27, 2019

This is why we have code review processes. It's long past time for that to be part of formal scientific peer review.

UncleMeat · on Sept 27, 2019

A few cs conferences have artifact evals. Most all research in cs doesn't actually have any sort of code review at all. No field is implementing the thing you are expecting.

joelhoffman · on Sept 30, 2019

I think it will be an uphill battle no doubt, but I think the only alternative would be to share the whole dataset and have reviewers re-implement the analysis to confirm the results. That would also be a huge improvement, but it seems like a much bigger burden on reviewers.

superhuzza · on Sept 28, 2019

Well theoretically it already is, given you normally have multiple authors and reviewers. It's just done poorly, just as a code review can be done poorly.

Fellshard · on Sept 27, 2019

This seems like 101, 'another pair of experienced eyes would catch this' kind of malpractice.

smt88 · on Sept 27, 2019

Analyzing data in academia seems like a disaster. It's almost guaranteed to produce errors like this.

You have:

- people with no coding experience and, in some cases (especially in social sciences), a strong aversion to math

- code that isn't unit tested, so answering the question, "Did it run correctly?" is often softened into, "Does this look plausible to me?"

- a strong incentive to end up with certain results

I dated a quantitative geneticist for a while, and her coding education was almost zero. She was writing code in R and essentially just changing lines until the output "looked right". It was insanely complicated math, so there was no way to make sure the output was good. The code had to be an exact match for the algorithm she had written out in mathematical notation, and there was essentially no chance of that.

It got worse. She'd write the algorithm in R and then end up with batches that would take, in some cases, years to finish running. Obviously she ended up with even more dubious hacks.

(For anyone curious, she's had a fairly decorated academic career under an acclaimed advisor who reviewed all of this code to some extent, and she's worked with most of the top genetics programs in the US.)

browningstreet · on Sept 27, 2019

Maybe that can't be true as it took them 3-4 years to close this error?

avs733 · on Sept 27, 2019

Simple...data representation is not the same as data meaning.

I teach an introductory stats course and we hammer this in. Categorical data are often represented as numbers or other short indicators for storage purposes. Typically I fmultiple choice the encoding is by the order of the choice options.

I not infrequently see average of gender because male = 0 and female = 1 or vice versa and someone generates a table without thinking.

shriphani · on Sept 27, 2019

The bigger issue here seems to be the use of ordinals in the data collection process. For instance, a lot of my CSVs don't have them and R and pandas are perfectly capable of enumerating. Why do you even need to put ordinals in the dataset? Does excel want this sort of thing or something?

avs733 · on Sept 28, 2019

Your don't need to people just do... I've kinda always assumed or connects to olden days and efficient storage and memory use. Male is a four character string, 1 is an integer.

LorenPechtel · on Sept 27, 2019

Aha! Now we see why there are so many transgendered people these days! :):):)

xibalba · on Sept 27, 2019

Agreed, but maybe you have to assume that the scientist knows very little about coding for data science, which is effectively what we're talking about here.

I think a major contributing factor to problems like this is people going into the soft/social sciences being more likely to be math/stats AND programming averse. Meanwhile, all sciences continue their long term trend towards applied math via programming. This leads to people using the math/stats via code without understanding very well what it is they are doing, and, naturally, the end result is lots and lots of mistakes.

not2b · on Sept 27, 2019

Social sciences programs require students to take statistics courses. That's no guarantee that statistics will be correctly applied.

acbart · on Sept 27, 2019

Or that the material will be taught effectively. Or that the students are contextually prepared to understand the material at that point. Or any of a number of other things that go wrong when people suggest that education is the solution/root to a very hard problem :)

_rpd · on Sept 27, 2019

Yeah, that goes beyond sloppy and into negligence.