You'd think recognising a square (containing squares and more squares if need be) would be relatively simple and not require advanced machine learning / training. Or that recognising a square doesn't take as much. The demo also indicates the sudoku needs to be fairly accurately scanned, similar to a QR code.
I think he meant for character recognition -- it's explained in the post that they went for an in-house dataset (from sudoku magazines I understand) instead of MNIST. They ran into some issues, found the way to solve them, and improved their training set. This allowed them to reach 98.6% accuracy, and after a few updates to the app over 99%.
It's unclear whether Vision uses machine learning behind the scenes though. It's kind of implied in their docs that it uses CoreML behind the scenes (which makes sense with the other things it does like Face recognition and object tracking).
The nice thing is it detects "projected rectangular regions" so even if the puzzle isn't aligned with the camera it still works.
I do wish I had more control though; it runs into trouble sometimes and there's not much I can do other than apply heuristics afterwards to determine whether I should throw out the sample or continue.