The problem is that outside of the machine learning community people don't hear "within-subjects analysis" they hear (and are told) "better than human performance". Within the community I think you are right; people are working on a shared set of assumptions and have the same expectations about performance in the real world (that the results will not transfer without massive negative deltas), but that is definitely not what 10000's of web developers downloading scikit-learn or tensorflow believe.