This is entirely unsurprising and has a very simple solution: keep adding more data. Our measurements of the accuracy of AI systems are only as good as the test data, and if the test data is too small, then the reported accuracies won't reflect the true accuracies of the model applied to wild data.
Basically, we need an accurate measure of whether the test data set is statistically representative of wild data. In healthcare, this means that the individuals that make of the test dataset must be statistically representative of the actual population (and also have enough samples).
An easy solution here is that any research that doesn't pass a population statistics test must be up-front declared to be "not representative of real word usage" or something.
> Here’s why: As researchers feed data into AI models, the models are expected to become more accurate, or at least not get worse. However, our work and the work of others has identified the opposite, where the reported accuracy in published models decreases with increasing data set size.
That's not a contradiction per se. It's easier to get spurriously high test scores with smaller datasets. It does not clearly demonstrate that the models are actually getting worse.
But if diagnosis are multimodal and rely upon large, multidimensional analysis of symptoms/bloodwork/past medical history, wouldn't adding more dimensions just increase dimensional sparsity and decrease the useful amount of conclusions you are able to draw from your variables?
It's been a long time since I remember learning about the curse of dimensionality but if you increase the amount of datapoints you collect by half you would have to quadruple the amount of samples you have to retrieve any meaningful benefit, no?
I did mean samples (n size) not the number of features. But also, no your point isn't right. If you have a ton of variables, you'll be better able to overfit your models to a training set (which is bad). However, that's not to say that a fairly basic toolkit can't help you avoid doing that even with a ton of variables. What really matter is the effect size of the the variables you're adding. That is, whether or not they can actually help you predict the answer, distinctly from the other variables you have.
Stupid example: imagine trying to predict the answer of a function that is just the sum of 1,000,000 random variables. Obviously having all 1,000,000 variables will be helpful here, and the model will learn to sum them up.
In the real world, a lot of your variables either don't matter or are basically saying the same thing as some of your other variables so you don't actually get a lot of value from trying to expand your feature set mindlessly.
> if you increase the amount of datapoints you collect by half you would have to quadruple the amount of samples you have to retrieve any meaningful benefit, no?
I think you might be thinking about standard error. Where you divide the standard deviation of your data by sqrt of the number of samples. So quadrupling your sample size will cut the error in half?
If there's one thing I learned with biomedical data modeling and machine learning, it's that "it's complicated". For biomedical scenarios, getting more data is often not simple at all. This is especially the case for rare diseases. For areas like drug discovery, getting a single new data point (for example, the effect of a drug candidate in human clinical settings) may require a huge expenditure of time and money. Biomedical results are often plagued with confounding variables, hidden and invisible, and simply adding in more data without detection and consideration of these bias sources can be disastrous. For example, measurements from lab #1 may show persistent errors not present in lab #2, and simply adding in more data blindly from lab #1 can make for worse models.
My conclusion is that you really need domain knowledge to know if you're fooling yourself with your great-looking modeling results. There's no simple statistical test to tell you if your data is acceptable or not.
I think this is a key point - the training set is very important, because biases, over-curation, or wrong contexts will mean the model may perform very poorly for particular scenarios or demographics.
I can't find the reference now of a radiology AI system which had a good diagnosis rate of finding a pneumothorax on a chest x ray (air in the lining of the lung). This can be quite a serious condition, but is easy to miss. Turns out that the training set had a lot of 'treated' pneumothorax. The outcome was correct - they did indeed have a pneumothorax, but they also had a chest drain in, which was helping the prediction.
Similar to asking what the demographic of training set is, is what the recorded outcome was. How was the diagnosis made. There is often no 'gold standard' of diagnosis, and some are made with varying degrees of confidence. Even a post-mortem can't find everything...
It should be "statistically representative" wrt to the true causes, and all other factors should be independent. Instead, ML models, and certainly large NNs, allow every bit of data that correlates a tiny bit to contribute.
Since we don't know what the true causes are, nor how to represent them in the data, adding more data might just as well not work.
Basically, we need an accurate measure of whether the test data set is statistically representative of wild data. In healthcare, this means that the individuals that make of the test dataset must be statistically representative of the actual population (and also have enough samples).
An easy solution here is that any research that doesn't pass a population statistics test must be up-front declared to be "not representative of real word usage" or something.