Top papers claim 99.5% accuracy. Since this is measured on a 10,000-sample evaluation set, there are 50 misclassified results.
I'd love to see a table of them. Back in the day, papers claiming top performance on MNIST would just start to enumerate all of the misclassified test samples, e.g. https://www.researchgate.net/figure/All-of-the-misclassified... from 2002. Newer papers have whittled this down to five misclassified digits, e.g. see §4.4 from https://arxiv.org/pdf/2001.09136v6.pdf for this particular flourish, though this isn't quite standard protocol (sample is counted as correct if any models in the ensemble match, not the majority)
Update from 2024:
https://github.com/tysam-code/hlb-CIFAR10 Train to 94% on CIFAR-10 in <6.3 seconds on a single A100. Or ~95.79% in ~110 seconds (or less!)
https://paperswithcode.com/sota/image-classification-on-cifa... 99.5% SOTA
so that's embarrassing :)