Hacker News new | past | comments | ask | show | jobs | submit login

My background is physics, and I used to think that, but now working as a data scientist that has to deal with human data, I've come to think the opposite. The signal to noise ratio in human data is humongous, nothing is static, and there are zillion of moving parts. That means you need immense volumes of data to run well powered experiments that are designed to get at real causal factors. And there are ethics involved so you have to deal with potential early stopping in a statistically sound way. In some fields (like econ), you don't even really get to do experiments at all on most topics of interest, having to rely on the observations available.

But it turns out that it's worth applying the scientific method to these fields, so what you're left with are tough choices. To deal with these problems the way we would in a physics experiment would be prohibitively expensive, in the literal sense of prohibitive. You have to come to terms with the fact that you can only afford to get enough data that there's a non-negligible possibility of being misled. It's worth doing science here, and we can, but it's just plain hard. I didn't appreciate that before I started having to deal with it. Don't blame the subject for some practitioners' failings.




I think part of the problem is with public interpretation of the results, often as (mis)assisted by the popsci press. Often a sober data scientist will look at a result and conclude "we have gained slight, if any knowledge," while the press will be filled with headlines recommending completely reevaluating life choices and/or the structure of society based on the result.


Thank you for pointing this out! I come from biology, and yes, you almost never get your model to fit the data as well as a physicist would with his - because there are just way too many factors involved! (In fact I would probably treat too high a correlation in a publication as suspicious.) When you're working with living organisms, their sheer complexity and diversity is going to limit the statistical significance of your results (which might be why the urge to p-hack can be so strong). This doesn't mean we're doing bad science; on the contrary, I would argue that we are simply doing harder science...


I think your deluding yourself. If you can't get good data accept that and do something else. Pretending the best you can do must inherently be acceptable is an insidious idea that causes a lot of wasted effort.


I think you misread me completely. I am not saying that it's hard, but this is the best we can do, so we have to do it. I'm saying it's hard, but it turns out to be worth doing despite that. It just turns out that using the scientific method and using sound statistical reasoning from data is a really effective tool for learning about things, even when you cannot make bulletproof true/false statements. Think of it in a bayesian way. A small-medium experiment still moves your belief distribution, despite the fact that you remain relatively uncertain.


A huge range of things are worth doing without being science. Further, just looking at vast amounts of data is not the scientific method.

It's like using vast amount of information to create a self driving car vs. actually letting the car run on real roads. The first can only tell if it approximates reasonable driving, the second can tell if it avoids getting into dangerous situations. You can collect a lot of information on the US economy, but in the real world the FED is actively trying to manage things and you can remove that factor from the data.


Where is anyone proposing just looking at vast amounts of data? With certain kinds of data, you can still learn causal effects observationally, like in econometrics. That's the closest I can find in this discussion. I mean, you're totally right naive data analysis is bad and more data doesn't help that, but nobody is advocating for doing that.


You only have two choices, look at data you don't control or data you do. The entire point of experiments is to narrow the range of uncontrolled data as much as possible. Looking at raw data does not help. Looking at huge data-sets of minimally controlled experimental data does not help.

Physicists's for example can't change the age of the universe they are operating in. It's a rather large unknown, but not exactly an unknown unkown.

At the other end, people trust surveys of eating habits. I don't care if you send out a billion of those things it's still bad data in systematic and changing ways.

In between, most animal studies in mice are looking at disease analog X, in a population of fat, minimally stimulated, etc.


Define "good data". It really does depend on your field. One of our biology professors once told us that any r^2 value above 0.5 could be considered "not bad" on a linear regression in biology - I daresay physicists are used to rather nicer fits than these. But where you draw the line between "good" and "bad" data really is relative to your field of study, and science needs to be evaluated in that context.


While fields may consider X good enough, that does not mean it is good enough. One measurement might be what percentage of published papers are junk. And in that context many fields fail any reasonable metric.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: