Hacker News new | past | comments | ask | show | jobs | submit login

Probably because when treating it as continuous they only look for linear regression.

When categorizing in buckets, they probably do an ANOVA. This technique posits that the average does vary per category exactly as measured, and asks the question: If I tell you the category, how much is the variance of your data reduced? If the variance falls a lot (relatively to what it was), it means there's a statistically significant effect between the category and your variable.

And, in their defense, they can't really go fishing for different continuous relationships once they have the data, as that'd reduce their statistical power.

Of interest too is the number of adjustable parameters of the model:

If instead of four age categories you use, say, four hundred, you end up splitting each doctor into one category. The predictive power of that model is greatest, with very good statistical significance, but you have achieved no insight at all.

Similarly when taking age as continuous; if instead of a straight line you fit a curve with four hundred free parameters, you overfit it to the point of destroying any insight.

So in that sense it's "unfair" that they used four age categories, vs two free parameters of a linear regression. And there would need to be some explanation as to the age ranges they used for each category.




that's an awesome explanation. thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: