This paper is awesome because it transparently folds the analytical approach into the experiment being conducted.
There are two kinds of scientific study: those where you can run another (ideally orthogonally approaching to the same question) experiment along with rigorous controls, and those where you can't.
The first type is much less likely to have results vary based on analytical technique (effectively the second experiment is a new analytical technique). Of course it does happen sometimes and sometimes the studies are wrong, still more controls and more experiments are always more better.
However, studies were you're limited by ethical or practical constraints (i.e. most experiments involving humans) don't have that luxury and therefore are far more contingent on decisions made at the analysis stage. What's awesome with this paper is it kind of gets around this limitation by trying different analytical methods, effectively each being a new "experiment" and seeing if they all reach the same consensus.
Interestingly, very few features in the analysis were shared among a large fraction of the teams, (only 2 features were used by more than 50% of teams) which suggests that no matter the method, the result holds true. A similar approach to open data and distributed analysis would be a really great way to eliminate some of the recent trouble with reproducibility in the broader scientific literature.
The blog post is a great overview as well as useful context, thanks for sharing it.
TL:DR summary: Scientific results are highly contingent on subjective decisions at the analysis stage. Different (well-founded) data analysis techniques on a fairly simple and well-defined problem can give radically different results.
It's very interesting research -- a great real-life example supporting the models Scott Page et. al. use for the value of cognitive diversity. The thrust of the blog post is about where crowdsourcing analysis can be helpful (as well as reasonable caveats about where it might not apply), which is certainly an interesting question. Obvioulsy, there are a lot of other implications to this as well.
Off-topic, but you seem well positioned to answer: Why do you say "TL:DR" here when summarizing a short blog post that you enjoyed? Clearly the meaning has diverged from the original abbreviated insult of "Too long; didn't read", but I don't understand what people mean when they use it today. Why did you phrase it this way? Are you a native English speaker? If not intended to be derogatory, does the dissonance bother you?
tl;dr stated off as a way of saying "this is too long and therefore requires too much effort for me to read it." Then that gave rise to people accompanying long reads with a "tl;dr version," which is usually a one or two sentence summary. Now that the latter is common and understood people just write tl;dr and then follow it with the summary, with the understanding that those who are unwilling to read the full version will read that instead.
Like 'justinlardinois I just use "TL:DR summary" to mean a "short summary." I didn't mean it as derogatory, and it doesn't feel dissonant to me -- although given how you view it I can see why you do. And yes, I'm a native English speaker.
How do you see it as derogatory? I'm a native English speaker and have never thought of it that way. I didn't click the link, but did appreciate his short summary – and upvoted him for it. ;)
I also appreciated the summary; my question was just about the phrasing. In it's literal usage, it's saying that the article had nothing useful to say: http://knowyourmeme.com/memes/tldr.
That clearly wasn't the case here, so I was wondering why the author choose to use it. I realize that meaning has changed over time, but I was wondering what meaning he (and others) intend when it is used.
'tl;dr' is often a troll response to a long post that that someone has obviously spent a lot of time on. Bonus troll-points if the long post was in response to another troll.
Example:
poster1: only retards use vi, notepad rules
poster2: huge list of reasons why vi is better than notepad
I was quite peeved when I first saw a "tl;dr" comment concerning one of my blog posts. My thought was, and still is somewhat, "if you didn't read it, how can you say it was too long for what it needed to cover? Why do you feel the need to tell others that you have the attention span of a fly?"
We already have terms like "summary", "digest", and even "précis"; why create a new term imbued with snark?
But not always. You have to consider how it was used to tell whether it was meant with ill will, the term tl;dr on its own doesn't necessarily tell you enough.
> I don't understand what people mean when they use it today
By now, it often is used to be friendly. There's a subconscious acknowledgement that long words take people's time. The speaker can even be talking about his own work and tell everyone "tldr version: " at the top.
Reminds me of the idea (Robin Hanson's, I think?) to add an extra layer of blindness to studies: during peer review, take the original data, and write a separate paper with the opposite conclusion. Randomize which reviewers get which version. Your original paper is then only accepted if they reject the inverted version.
"The primary research question tested in the crowdsourced project was whether soccer players with dark skin tone are more likely than light skin toned players to receive red cards from referees."
This seems like a topic where one indeed typically winds-up with a multitude of competing conclusions.
Among other factors for we have:
* Pre-existing beliefs on the part of researchers.
* Lack of sufficient data.
* Difficulty in defining hypothese (is there a skin tone cut-off or should one look for degrees of skin tone and degrees of prejudice, should one look all referees or some referees).
Given this, I'd say it's a mistake to expect just numeric data at the level of complex social interactions to be anything like clear or unambiguous. If studies on topics such as this have value, they have to involve careful arguments concerning data collection, data normalization/massaging, and only then data analysis and conclusions.
But a lot of the context comes from prevalence shoddy studies that expect you can throw data in a bucket and draw conclusions, further facilitated having those conclusions echoed by mainstream media or by the media of one's chosen ideology.
I understand how tempting it is in our age of big data and all that stuff to perceive this as some curious new phenomena, but it really is not. This is precisely the reason why we've come up with some criteria for "science" quite a while ago. And in fact, all this experiment is pretty meaningless.
So, for starters: 29 students get the same question on the math/physics/chemistry exam and give 29 different answers. Breaking news? Obviously not. Either the question was outrageously bad worded (not such a rare thing, sadly), or students didn't do very well and we've got at most 1 correct answer.
Basically, we've got the very same situation here. Except our "students" were doing statistics, which is not really math and not really natural science. Which is why it is somehow "acceptable" to end up with the results like that.
If we are doing math, whatever result we get must be backed up with formally correct proof. Which doesn't mean of course, that 2 good students cannot get contradicting results, but at least one of their proofs is faulty, which can be shown. And this is how we decide what's "correct".
If we are doing science (e.g. physics) our question must be formulated in a such way that it is verifiable by setting up an experiment. If experiment didn't get us what we expected — our theory is wrong. If it did — it might be correct.
Here, our original question was "if players with dark skin tone are more likely than light skin toned players to receive red cards from referees", which is shit, and not a scientific hypothesis. We can define "more likely" as we want. What we really want to know: if during next N matches happening in what we can consider "the same environment" black athletes are going to get more red cards than white athletes. Which is quite obviously a bad idea for a study, because the number of trials we need is too big for so loosely defined setting: not even 1 game will actually happen in isolated environment, players will be different, referees will be different, each game will change the "state" of our world. Somebody might even say that the whole culture has changed since we started the experiment, so obviously whatever the first dataset was — it's no longer relevant.
Statistics is only a tool, not a "science", as some people might (incorrectly) assume. It is not the fault of methods we apply that we get something like that, but rather the discipline that we apply them to. And "results" like that is why physics is accepted as a science, and sociology never really was.
In physics they can do experiments to get highly confident results. In medicine, economics, and any other science dealing with people, data is much harder to collect and there are harder ethics involved. We could learn so much if we simply performed controlled experiments on the global economy, politics be damned! Or if we could just make the professionals play more soccer matches in a controlled setting (but just like a pro match in every other regard!). Or if we were more aggressive with human trials of drugs. But we can't. Scientists in some fields are stuck with data sets where they'll never get 5 sigma confidence. Does that mean they should stop using statistics? Hell no. There are still very useful things to be learned. It's just much harder to get right.
Your rant makes no sense. I flip a coin 100 times and it comes up tails 99 times. You are basically saying that asking "Is the coin more likely to come up tails" isn't a real scientific question. That's just silly.
Physics uses statiatics all the time, e.g. detecting the higgs boson at cern. Do you have a formal proof thay each time they fired the accelerator it was going to be i.i.d.?
I don't understand what it is that you want to say, even if we were to accept all of your premises. That other than math and physics we actually know nothing? That knowledge (even if partial) is meaningless unless it is as rigorous as the one we can attain in physics, the simplest of sciences? Obviously we can be less confident of results in the complex/intractable/inexact sciences than we can in the simple/tractable/exact ones. You want to call only the latter group "science" and the former something else? Fine. Does that mean we should completely ignore all results in disciplines which aren't science?
It makes it really hard to work out what's happening, especially if you want the result to match existing standards.
For a real world example of this see deworming schoolchildren.
People looking at the educational effects of deworming children reach different conclusions because some of them use a medical model and some of them use an economics model.
Hey, the null hypothesis is powerful and valuable. I, for one, and happy that all three types are well-represented; all three are healthy in moderation.
I also think that the quote fits in quite nicely here, it's not a wholesale rejection of statistics.
The quote itself isn't, but just posting a link to the Wikipedia article without explaining how they think it applies here is pretty much a wholesale rejection of statistics.
There are two kinds of scientific study: those where you can run another (ideally orthogonally approaching to the same question) experiment along with rigorous controls, and those where you can't.
The first type is much less likely to have results vary based on analytical technique (effectively the second experiment is a new analytical technique). Of course it does happen sometimes and sometimes the studies are wrong, still more controls and more experiments are always more better.
However, studies were you're limited by ethical or practical constraints (i.e. most experiments involving humans) don't have that luxury and therefore are far more contingent on decisions made at the analysis stage. What's awesome with this paper is it kind of gets around this limitation by trying different analytical methods, effectively each being a new "experiment" and seeing if they all reach the same consensus.
Interestingly, very few features in the analysis were shared among a large fraction of the teams, (only 2 features were used by more than 50% of teams) which suggests that no matter the method, the result holds true. A similar approach to open data and distributed analysis would be a really great way to eliminate some of the recent trouble with reproducibility in the broader scientific literature.