The real issue is that experiments that give "expected" results are not subject to this kind of scrutiny. Thus experiments are much less trustworthy than one would assume. It sometimes takes decades for errors in experiments to come out - eg. from "Surely your joking, Mr. Feynman":
Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of - this history - because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong - and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that...
Actually, most major experiments in particle physics these days (including the OPERA experiment) avoid this sort of confirmation bias by being run "blind." The scientists write all of their data reduction pipelines before taking any actual data and test their pipelines on simulated data. When they are confident that their pipeline is running as expected they run the experiment, put the data through their pipeline and publish the result, no matter how unexpected it is.
As the OPERA result showed, it has the problem that if you don't understand everything in your experiment perfectly (which is difficult to do in a very large, complicated experiment) you run the risk of embarrassing yourself by making some obvious-in-retrospect mistake and publishing an obviously absurd result. But in the long run it's not so bad a price to pay to avoid the sort of confirmation bias that Feynman was talking about.
Physics is far ahead of other disciplines in this regard. Choosing your statistical test after you gather the data, selectively removing "outliers" after you gather the data, non-blind interpretation of pictures by humans who have a stake in the outcome and only publishing statistically significant results are all par for the course in e.g. neuroscience.
This paper [1] points out that commonly-used measures of statistical significance are downright meaningless when additional degrees of freedom are hidden in the way you describe.
It's even worse than that paper describes and this is something that every statistics 101 class worth its salt points out: if you are allowed to choose a statistical test after you've gathered your data you can prove any conclusion you want with arbitrarily high confidence. Note that the paper does not list choosing a test before you gather the data as a requirement. The only way to do meaningful statistics is the way splat describes: describe exactly how you're going to analyze it before you gather the data, then send the paper to a journal which decides whether to publish it before the data has been gathered, and then complete the paper by actually doing the experiment and adding the data to the paper.
>The scientists write all of their data reduction pipelines before taking any actual data and test their pipelines on simulated data. When they are confident that their pipeline is running as expected they run the experiment, put the data through their pipeline and publish the result, no matter how unexpected it is.
How does that help when the prediction matches the measurement but the experiment is flawed ?
What you describe is more or less what Lakatos coined the "Research Programme" after. [1]
In a nutshell, we know that science more or less progresses from one theory to another after the existing theory has been falsified (Popper), but this transition does not happen overnight, even after a falsifying result has been put forth. That is because science is really structured with a "hard core" hypothesis (say, in physics, that nothing moves faster than c), and various "auxiliary hypotheses" that bear the brunt of criticism before the core belief is attacked (say those that posit the integrity of the measurement instrument, for example). And that is all that is going on here. There are many, many theories that are going to be dismantled before physics transitions to something post-relativity. It's just the way the assumptions are structured.
While this may seem trivial, we see this playing out in the battle of classical against behavioral economics. Are the results of experimental psychology admissible in the criticism of certain economic models? Models don't claim to be perfect, after all...
Surely this is just a reflection of the fact that science is performed by humans with limited resources. I mean, there is something to be said for getting the approximately result correctly many times over.
And don't forget there is a check and balance in the form of meta-studies and statistics that are sometimes uncanny can exposing bias and flaws.
Finally - this highlights the dual nature of 'experimental' vs 'theoretical' physics, and they work so well in tandem.
Agreed, its the nature of science. That's why theory is so important and why physicists pay as much importance to non-experimental features like "beauty" when judging theories. Experiments cannot be trusted unless explained by theory!
To me the real lesson of the Millikan experiment is that science moves slower than you think. It does take decades, sometimes, for people to notice the tiny but consistent discrepancy between their expectation and the data, but that's not necessarily a sign of failure; that's just the time scale of success.
When Millikan's experiment was new it was new; people hadn't done it before, there was slop to work out, and the theory was miles behind. These things take time to converge on a consensus. Sometimes a long time. Also, money.
The lesson isn't that Millikan screwed up, everyone can screw up at some point. The lesson is that folks had a chance to show Millikan was wrong simply by rerunning his experiment and coming up with the correct result. But they didn't. They didn't have the integrity to challenge Millikan's result. They turned every dial they could to fudge the numbers enough to end up with a result near Millikan's.
How would such dynamics pan out for experiments that aren't amenable to the sort of polite gradual adjustment as Millikan's was?
The real lesson is that this sort of thing (fudging numbers, lack of scientific integrity) is quite prevalent but most of the time goes unnoticed because there isn't something as dramatic as with the Millikan experiment to show it plainly.
A similar case is Hubble's measurement of the expanding Universe. The Universe is expanding, and Hubble's linear velocity = Hubble_constant*distance law is an excellent first approximation, but:
1. Hubble was actually measuring the local motion of nearby galaxies, the so called Local Group. These galaxies are too close for Hubble expansion to matter, they're just gravitating.
2. The actual value for the Hubble constant he measured was way off, by something like 10x.
1. Hubble's draftsman made an error while preparing his famous diagram. From Figure 1 of his paper (http://adsabs.harvard.edu/abs/1931ApJ....74...43H), you'll notice that the vertical axis is in "km" rather than "km/sec."
2. Measurements of the Hubble Constant over time: http://www.pnas.org/content/101/1/8/F2.large.jpg For some reason it's not plotted logarithmically so it's hard to see the convergence towards ~70 km/s/Mpc in the last twenty years.
Interestingly, there was a long fight in the field of cosmology from ~1960 until ~1990 as to whether the Hubble constant was 50 or 100. It wasn't until within the past twenty years or so that the community began to agree upon the now-accepted value of ~70 km/s/Mpc.
Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of - this history - because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong - and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that...