It's also important that people understand what "control for" means in an observational study.
Nearly all observational studies that you read draw their conclusions from regression analysis (or something, such as a ANOVA, which is a can be trivially implemented as regression analysis).
This just means that a linear model such as this is used:
Then, unlike in machine learning, researchers look at the value of the coefficients and most importantly the confidence in the coefficient value. In a simplified example:
Where b_intercept will represent the life expectancy in general. Then b_coffee (since drinks coffee is binary here) will represent whether or not coffee adds to your years of life. If it's negative it means drinking coffee reduces your life expectancy and if it's positive it means it increases it. In statistics we also look at how certain we are in b_coffee in terms of standard error and p-values. For example if b_coffee is say 5, meaning it adds 5 years to your life, but our standard error in this estimate is 4 (ie 95% chance that the real impact is between -3 and 13 years). In this case the p-value for this coefficient will be higher than necessary to conclude "statistical significance".
But suppose the standard error is very small, like 1 year, where we are virtually certain that coffee does improve life expectancy. To control for say, college education, we just add that coefficient to our model.
The magic of regression analysis is that if, in fact, people that go to college life longer and people that go to college also drink more coffee, the our coefficient for b_coffee will change in this new model to reflect this. If for example it were to become negative now (with a low p-value) what we would conclude from this model is that coffee is in fact bad for you, and college is good for you and it just happens that a lot of people who go to college also drink coffee.
Just as even the most advanced AI is often just a lot of matrix multiplication with non-linear transforms, the vast majority of observational studies are drawing conclusion using linear models. In practice, as you add more variables to your model, you tend to get a lot of tricky to interpret results. Regression analysis is a remarkably powerful tool, but it is important to remember when you see publications this is what is really happening and there is a lot of room for subtlety when interpreting these models.
Yet all you have to do is miss a factor and you've ended up with correlation.
As a simple question, what about people who have weak stomachs or hearts? My mother doesn't drink coffee because it "makes her heart beat too hard". How, with no actual medical data for "coffee makes my heart beat hard", do you control for that? Is that something to control for?
This is the "caffeine and healthy pregnancy" problem. We know women who consume less than ~200mg of caffeine tend to have healthier pregnancies.. but if you can drink 5+ cups of coffee while pregnant and not get overtaken with nausea, that might indicate something is already wrong.
An important limitation which is often overlooked is that when you "control for" something by entering it into a regression, as you describe, you are only controlling for the linear effect of that thing.
It seems to me that this problem is totally fatal for large-scale epidemiological studies with many factors, of which many are sure to have nonlinear effects.
Well put. I ultimately view observational studies as hypothesis generators that can spur research into more targeted questions all the way down to the biochemical level.
Nearly all observational studies that you read draw their conclusions from regression analysis (or something, such as a ANOVA, which is a can be trivially implemented as regression analysis).
This just means that a linear model such as this is used:
outcome ~ b_coffee * drinks_coffee + b_income_level * income_level ....
Then, unlike in machine learning, researchers look at the value of the coefficients and most importantly the confidence in the coefficient value. In a simplified example:
years_of_life ~ b_coffee * drinks_coffee + b_intercept
Where b_intercept will represent the life expectancy in general. Then b_coffee (since drinks coffee is binary here) will represent whether or not coffee adds to your years of life. If it's negative it means drinking coffee reduces your life expectancy and if it's positive it means it increases it. In statistics we also look at how certain we are in b_coffee in terms of standard error and p-values. For example if b_coffee is say 5, meaning it adds 5 years to your life, but our standard error in this estimate is 4 (ie 95% chance that the real impact is between -3 and 13 years). In this case the p-value for this coefficient will be higher than necessary to conclude "statistical significance".
But suppose the standard error is very small, like 1 year, where we are virtually certain that coffee does improve life expectancy. To control for say, college education, we just add that coefficient to our model.
years_of_life ~ b_coffee * drinks_coffee + b_college + has_degree + b_intercept
The magic of regression analysis is that if, in fact, people that go to college life longer and people that go to college also drink more coffee, the our coefficient for b_coffee will change in this new model to reflect this. If for example it were to become negative now (with a low p-value) what we would conclude from this model is that coffee is in fact bad for you, and college is good for you and it just happens that a lot of people who go to college also drink coffee.
Just as even the most advanced AI is often just a lot of matrix multiplication with non-linear transforms, the vast majority of observational studies are drawing conclusion using linear models. In practice, as you add more variables to your model, you tend to get a lot of tricky to interpret results. Regression analysis is a remarkably powerful tool, but it is important to remember when you see publications this is what is really happening and there is a lot of room for subtlety when interpreting these models.