I always find it concerning when authors dichotomize a variable without an extre...

_bz2r · on May 20, 2017

great point. they mention the same thing here:

Physician age was modeled both as a continuous linear variable and as a categorical variable (in categories of <40, 40-49, 50-59, and ≥60) to allow for a potential non-linear relation with patient outcomes.

can you explain what it means? Specificially, how does making it categorical allow for it, and keeping it continuous prevent it?

Any ideas?

euyyn · on May 20, 2017

Probably because when treating it as continuous they only look for linear regression.

When categorizing in buckets, they probably do an ANOVA. This technique posits that the average does vary per category exactly as measured, and asks the question: If I tell you the category, how much is the variance of your data reduced? If the variance falls a lot (relatively to what it was), it means there's a statistically significant effect between the category and your variable.

And, in their defense, they can't really go fishing for different continuous relationships once they have the data, as that'd reduce their statistical power.

Of interest too is the number of adjustable parameters of the model:

If instead of four age categories you use, say, four hundred, you end up splitting each doctor into one category. The predictive power of that model is greatest, with very good statistical significance, but you have achieved no insight at all.

Similarly when taking age as continuous; if instead of a straight line you fit a curve with four hundred free parameters, you overfit it to the point of destroying any insight.

So in that sense it's "unfair" that they used four age categories, vs two free parameters of a linear regression. And there would need to be some explanation as to the age ranges they used for each category.

_bz2r · on May 20, 2017

that's an awesome explanation. thanks.

Antrikshy · on May 20, 2017

But sensationalism and page views!