My background is physics, and I used to think that, but now working as a data sc...

_rpd · on July 19, 2016

I think part of the problem is with public interpretation of the results, often as (mis)assisted by the popsci press. Often a sober data scientist will look at a result and conclude "we have gained slight, if any knowledge," while the press will be filled with headlines recommending completely reevaluating life choices and/or the structure of society based on the result.

veddox · on July 19, 2016

Thank you for pointing this out! I come from biology, and yes, you almost never get your model to fit the data as well as a physicist would with his - because there are just way too many factors involved! (In fact I would probably treat too high a correlation in a publication as suspicious.) When you're working with living organisms, their sheer complexity and diversity is going to limit the statistical significance of your results (which might be why the urge to p-hack can be so strong). This doesn't mean we're doing bad science; on the contrary, I would argue that we are simply doing harder science...

Retric · on July 19, 2016

I think your deluding yourself. If you can't get good data accept that and do something else. Pretending the best you can do must inherently be acceptable is an insidious idea that causes a lot of wasted effort.

imh · on July 19, 2016

I think you misread me completely. I am not saying that it's hard, but this is the best we can do, so we have to do it. I'm saying it's hard, but it turns out to be worth doing despite that. It just turns out that using the scientific method and using sound statistical reasoning from data is a really effective tool for learning about things, even when you cannot make bulletproof true/false statements. Think of it in a bayesian way. A small-medium experiment still moves your belief distribution, despite the fact that you remain relatively uncertain.

Retric · on July 19, 2016

A huge range of things are worth doing without being science. Further, just looking at vast amounts of data is not the scientific method.

It's like using vast amount of information to create a self driving car vs. actually letting the car run on real roads. The first can only tell if it approximates reasonable driving, the second can tell if it avoids getting into dangerous situations. You can collect a lot of information on the US economy, but in the real world the FED is actively trying to manage things and you can remove that factor from the data.

imh · on July 20, 2016

Where is anyone proposing just looking at vast amounts of data? With certain kinds of data, you can still learn causal effects observationally, like in econometrics. That's the closest I can find in this discussion. I mean, you're totally right naive data analysis is bad and more data doesn't help that, but nobody is advocating for doing that.

Retric · on July 20, 2016

You only have two choices, look at data you don't control or data you do. The entire point of experiments is to narrow the range of uncontrolled data as much as possible. Looking at raw data does not help. Looking at huge data-sets of minimally controlled experimental data does not help.

Physicists's for example can't change the age of the universe they are operating in. It's a rather large unknown, but not exactly an unknown unkown.

At the other end, people trust surveys of eating habits. I don't care if you send out a billion of those things it's still bad data in systematic and changing ways.

In between, most animal studies in mice are looking at disease analog X, in a population of fat, minimally stimulated, etc.

veddox · on July 19, 2016

Define "good data". It really does depend on your field. One of our biology professors once told us that any r^2 value above 0.5 could be considered "not bad" on a linear regression in biology - I daresay physicists are used to rather nicer fits than these. But where you draw the line between "good" and "bad" data really is relative to your field of study, and science needs to be evaluated in that context.

Retric · on July 19, 2016

While fields may consider X good enough, that does not mean it is good enough. One measurement might be what percentage of published papers are junk. And in that context many fields fail any reasonable metric.