But if diagnosis are multimodal and rely upon large, multidimensional analysis o...

spywaregorilla · on Oct 21, 2022

I did mean samples (n size) not the number of features. But also, no your point isn't right. If you have a ton of variables, you'll be better able to overfit your models to a training set (which is bad). However, that's not to say that a fairly basic toolkit can't help you avoid doing that even with a ton of variables. What really matter is the effect size of the the variables you're adding. That is, whether or not they can actually help you predict the answer, distinctly from the other variables you have.

Stupid example: imagine trying to predict the answer of a function that is just the sum of 1,000,000 random variables. Obviously having all 1,000,000 variables will be helpful here, and the model will learn to sum them up.

In the real world, a lot of your variables either don't matter or are basically saying the same thing as some of your other variables so you don't actually get a lot of value from trying to expand your feature set mindlessly.

> if you increase the amount of datapoints you collect by half you would have to quadruple the amount of samples you have to retrieve any meaningful benefit, no?

I think you might be thinking about standard error. Where you divide the standard deviation of your data by sqrt of the number of samples. So quadrupling your sample size will cut the error in half?

tappio · on Oct 21, 2022

You are right, but I feel you misunderstood op.

I understood that op meant increase number of samples, not variables.

dirheist · on Oct 23, 2022

I think I did misunderstand you are right, definitely increasing the number of samples will increase the feasibility of the model I was incorrect