Why is using a test set for evaluation a deadly sin?
As I understand it, you fit() with training, then do parameter tuning with validation and the best parameter tuned model is used on test.
Now I'm still a little confused as to why we don't just fit() then do hyperparameter tuning with the test set (best-tuned model wins, no need for test). Why would calling predict() on a model cause it to update its weights and overfit?
It is not the ML model that is updated with information, but the predictive modeler herself is updated. She now finds parameters that make the model perform well on that specific test set. This gives you overly optimistic estimates of generalization performance (thus unsound science, and, in business, it is better to report too low performance, than too high, because a policy build on a model that is overfit like this can ruin a company or a life). For smarter approaches to this problem, see the research on reusable holdout sets.
I think the idea is that "calling predict and tuning on a model with the test set" is the "overfitting". It's not actual overfitting like we know in ML; it's as if the researcher is performing "descent" to get the best hyper-parameters. Problem is, if we use the test set to find these hyper-parameters, we'll have no idea how well it does in the real-world/in general. We'd need another set to figure that out - and we're back where we started.
It's a deadly sin considering that many (most?) researchers do not treat it as a test set, but as a "validation set #2". Basically, you tune your hyperparameters (up to the random seed!) to fare better on the test set. So, as shown in the cited paper, the results are not generalization results anymore.
You could easily achieve perfect accuracy on the test set by just hardcoding the entire test set into your "model" and the entire model is "1. See image, 2. Look up image in test set, 3. Read off answer".
It would be interesting if someone would see whether they could sneakily (Sokal-style) publish a paper like the following: "We took (popular model X) and augmented it with an additional lexicon of specific lookup data, and the result blows away all the competition. This is deeply profound and implies that built-in lexicons could be the key to true general intelligence!" (When in fact all they did was hard-code the test set or part of the test set into their model.) Then see how many popular presses churn out sensational articles.
As I understand it, you fit() with training, then do parameter tuning with validation and the best parameter tuned model is used on test.
Now I'm still a little confused as to why we don't just fit() then do hyperparameter tuning with the test set (best-tuned model wins, no need for test). Why would calling predict() on a model cause it to update its weights and overfit?