Nature publishes 17 parameter fits to 20 plus data points

mad44 · on Feb 5, 2010

http://www.nature.com/nature/journal/v427/n6972/full/427297a... How one intuitive physicist rescued a team from fruitless research.

Quote from Freeman Dyson:

In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, “How many arbitrary parameters did you use for your calculations?” I thought for a moment about our cut-off procedures and said, “Four.” He said, “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

smanek · on Feb 5, 2010

That anecdote was worth it just for 'Johnny von Neumann' ;-)

I've generally heard that most scientists of the day considered von Neumann the smartest man in science (he'd have to be - he single handedly revolutionized several branches of CS, Physics, Math, Economics, etc. Somehow hearing him called 'Johnny' makes him much less intimidating though ...

halo · on Feb 5, 2010

Non-paywall mirror: http://cmm.cit.nih.gov/~hassan/dyson.pdf

hyperbovine · on Feb 5, 2010

Funny, but not as good as the econ paper I once came across extolling the virtues of a method which allowed one to "sidestep the issues associated with negative degrees of freedom" (or something to that effect.) In other words, fitting a line to one data point :-)

houseabsolute · on Feb 5, 2010

As a non-statistician, am I right in thinking that the problem here is that with this many parameters, you can fit almost any data? Or something?

cduan · on Feb 5, 2010

Why yes indeed. Although it's more an issue of the ratio of parameters to data points. Consider, for example: if you had 20 data points and used 20 parameters to fit them, you would generate a model that fits all 20 parameters perfectly! But the model would probably be useless for any new data points, because it takes into account all the irrelevant aspects of the 20 data points (their "randomness").

Have a look at: http://en.wikipedia.org/wiki/Overfitting

larsr · on Feb 5, 2010

It is probably a case of overfitting, but it could also be the case that they are fitting an existing model that really has 17 parameters. Without more context, it's a little hard to judge. If I remember, I'll look at the full paper tomorrow. It might make a nice example.

duskwuff · on Feb 5, 2010

If the model really does have 17 parameters, though, then they need a heck of a lot more than 20 data points to demonstrate its correctness.

larsr · on Feb 5, 2010

Probably, but all I see is a graph and the abstract. It also depends on how the model is being used in the paper. Unless I'm mistaken, Nature puts papers through peer review, which for me personally, anyways, means I'd want to actually read the whole paper before fire and pitchforks.

keefe · on Feb 5, 2010

That's pretty outrageous if the claim in the paper is true - this is a really common problem that I ran into as an undergraduate doing modeling to fit bacterial conjugation rates to differential equations for predator prey dynamics.

nearestneighbor · on Feb 5, 2010

In general, it's OK to use more parameters than data points even, if you properly use regularization (such as weight decay), for example. Other times, even significantly fewer parameters than data points can be wrong.

jey · on Feb 5, 2010

That sounds pretty ad-hoc. Sure, you can throw an L1 or L2 regularizer on your objective function, but it should be well-motivated.

Should probably just use Gaussian process regression if you want to do inference over the space of all[1] functions in a principled (i.e. Bayesian) manner.

1. (or the space of all polynomial functions or something. I forget)

_delirium · on Feb 5, 2010

I don't think there's any consensus in the statistics community on which of the many ways to do inference over "nearly all" functions is the better one; nonparametric regression is basically a whole field, and pretty in flux over the past 10 years.