Although Decety’s paper had reported that they had controlled for country, they had accidentally not controlled for each country, but just treated it as a single continuous variable so that, for example “Canada” (coded as 2) was twice the “United States” (coded as 1).
I mean I don't even understand how this seemed like a normal thing to do?
The variable for Country should have been treated as a categorical variable, but was instead processed as a numeric variable.
This mistake would be downright trivial to make in R. Just declare that Country is a Factor (which is the built-in type for categorical variables), and then throw the data into a library whose attitude towards errors is to coerce everything to numbers until the warnings go away.
Background: Factors in R are the idiomatic way to work with categorical data, and they work somewhat like C-style enums except the variants come from the data rather than a declaration. So if you take a column of strings in a data frame and cast it to a Factor, it will generate a mapping where the first distinct value is coded as 1, the second distinct value is coded as 2, etc. Then it replaces the strings with their integer equivalents, and saves the mapping off to the side.
I forget the exact rules (if there are rules, R is a bit lawless), but it's not very hard to peek under the hood at the underlying numeric representation. Many built-in operations "know" that Factors are different (e.g. regressing against a Factor will create dummy variables for each variant), but it's up to each library author how 'clever' they want to be.
This makes the most sense to me. I don't work in the dataframes world but without this explanation it seemed like someone would have to go out of their way to make that error.
(To be fair even strong typing won't save you if you don't use it. But fuuuuuk, what an error. I noted that paper mentally and would have quoted from it)
Yup, I'm all for extremely strong typing. In 40 years of writing code I can't say I've ever had any real trouble with strong typing other than when dealing with libraries that reinvent wheels. Weak typing, though--nuke it from orbit.
At risk of embarrassing my self statistically, what exactly happens when you do this?
I.e., if you're controlling for country, that means you're bucketing by country, and looking at each subset, right? So if country is represented by a non-discrete value... what exactly happens?
So let's pretend there's three types of trees we want to study: Oak, Maple and Aspen, which we code as 0, 1, and 2 for reasons (there are some good reasons to do this).
Statistically, if you treat them as a continuous variable, the estimates you get will act like there's an ordering there, and give you the effect of a one unit increase in tree. So it will tell you the effect of Oak vs. Maple and Maple vs. Aspen, assuming those are proportional and that Oak vs. Aspen will be twice that.
This is...nonsense, for most categorical variables. They don't have a nice, ordinal stepping like that.
In practice, if you have n countries, you'll add n-1 binary variables to your regression equation. The first country is the reference level (all zeros), for the second country set the first new variable to one, the rest to zero, etc.
A few cs conferences have artifact evals. Most all research in cs doesn't actually have any sort of code review at all. No field is implementing the thing you are expecting.
I think it will be an uphill battle no doubt, but I think the only alternative would be to share the whole dataset and have reviewers re-implement the analysis to confirm the results. That would also be a huge improvement, but it seems like a much bigger burden on reviewers.
Well theoretically it already is, given you normally have multiple authors and reviewers. It's just done poorly, just as a code review can be done poorly.
Analyzing data in academia seems like a disaster. It's almost guaranteed to produce errors like this.
You have:
- people with no coding experience and, in some cases (especially in social sciences), a strong aversion to math
- code that isn't unit tested, so answering the question, "Did it run correctly?" is often softened into, "Does this look plausible to me?"
- a strong incentive to end up with certain results
I dated a quantitative geneticist for a while, and her coding education was almost zero. She was writing code in R and essentially just changing lines until the output "looked right". It was insanely complicated math, so there was no way to make sure the output was good. The code had to be an exact match for the algorithm she had written out in mathematical notation, and there was essentially no chance of that.
It got worse. She'd write the algorithm in R and then end up with batches that would take, in some cases, years to finish running. Obviously she ended up with even more dubious hacks.
(For anyone curious, she's had a fairly decorated academic career under an acclaimed advisor who reviewed all of this code to some extent, and she's worked with most of the top genetics programs in the US.)
Simple...data representation is not the same as data meaning.
I teach an introductory stats course and we hammer this in. Categorical data are often represented as numbers or other short indicators for storage purposes. Typically I fmultiple choice the encoding is by the order of the choice options.
I not infrequently see average of gender because male = 0 and female = 1 or vice versa and someone generates a table without thinking.
The bigger issue here seems to be the use of ordinals in the data collection process. For instance, a lot of my CSVs don't have them and R and pandas are perfectly capable of enumerating. Why do you even need to put ordinals in the dataset? Does excel want this sort of thing or something?
Your don't need to people just do... I've kinda always assumed or connects to olden days and efficient storage and memory use. Male is a four character string, 1 is an integer.
Agreed, but maybe you have to assume that the scientist knows very little about coding for data science, which is effectively what we're talking about here.
I think a major contributing factor to problems like this is people going into the soft/social sciences being more likely to be math/stats AND programming averse. Meanwhile, all sciences continue their long term trend towards applied math via programming. This leads to people using the math/stats via code without understanding very well what it is they are doing, and, naturally, the end result is lots and lots of mistakes.
Or that the material will be taught effectively. Or that the students are contextually prepared to understand the material at that point. Or any of a number of other things that go wrong when people suggest that education is the solution/root to a very hard problem :)
Although Decety’s paper had reported that they had controlled for country, they had accidentally not controlled for each country, but just treated it as a single continuous variable so that, for example “Canada” (coded as 2) was twice the “United States” (coded as 1).
I mean I don't even understand how this seemed like a normal thing to do?