Since the author spent so much time on the optical character recognition step, it's worth mentioning that you don't even really need OCR for this task.
You can just find the squares, group the characters in the squares into visual equivalence classes, assign each class an arbitrary number, solve the puzzle in terms of those numbers, then fill in each empty square with the (average?) image of the equivalence class it matches.
This would allow you to solve a Sudoku puzzle with letters or WingDings instead of numbers, and the output font would naturally match that of the original puzzle.
If the app still wanted to support puzzles that had notes scribbled in them then it would need some kind of OCR to tell the difference between a "starter" cell with known-good data and "puzzle" cells it needs to solve.
But treating the symbols in the starter cells as arbitrary is ingenious, imo!
> By the time we launched the app it was trained on over a million images of Sudoku squares.
This is super cool, but I can't help but think that something is missing if it takes hundreds of thousands of examples of digits for a machine learning algorithm to be able to differentiate them. It wouldn't take a human child this many. The available machine learning algos are not using near the amount of information available.
FWIW the MNIST dataset they talked about (handwritten digits) only has 60,000 training samples - and that dataset surely has more variance than printed puzzles from magazines.
I think today's machine learning algorithms mimic more evolution than they do how humans actually learn. I think current techniques in machine learning however can one day teach a machine how to learn by building a fixed neural net that has memory that is the learning algorithm.
You'd think recognising a square (containing squares and more squares if need be) would be relatively simple and not require advanced machine learning / training. Or that recognising a square doesn't take as much. The demo also indicates the sudoku needs to be fairly accurately scanned, similar to a QR code.
I think he meant for character recognition -- it's explained in the post that they went for an in-house dataset (from sudoku magazines I understand) instead of MNIST. They ran into some issues, found the way to solve them, and improved their training set. This allowed them to reach 98.6% accuracy, and after a few updates to the app over 99%.
It's unclear whether Vision uses machine learning behind the scenes though. It's kind of implied in their docs that it uses CoreML behind the scenes (which makes sense with the other things it does like Face recognition and object tracking).
The nice thing is it detects "projected rectangular regions" so even if the puzzle isn't aligned with the camera it still works.
I do wish I had more control though; it runs into trouble sometimes and there's not much I can do other than apply heuristics afterwards to determine whether I should throw out the sample or continue.
Really enjoyed reading this. The process is really explained well including all the fun rabbit holes and unexpected pitfalls on launch day, and the technical steps to overcome.
Interesting limitations to work around such as vertical planes vs horizontal, and focal length.
Not at all surprised they saw better performance with almost immediate payback by training models on their own $1,200 hardware than running in the cloud.
Very interesting they trained their own character recognition model and not only that but built their own custom crowd-sourced image labeling system complete with accuracy checks and review screens.
This whole thing (including the backend tools) took me about the equivalent of 1 month of full-time work (I was doing it mostly nights & weekends though since our games are what pay the bills).
I brought in one of my (excellent) designers from Hatchlings a couple days before launch to make the cool grid "scanning" animation and to do our branding and logo.
Only 70? I can see lots of work creating accurate (in size and in colors) 3D models of every item in their catalog that look good, and maybe even more discussing with management whether the current model accurately portrays the product.
I don’t know whether the functionality is present (last time I checked, the app wasn’t available in ‘my’ App Store), but integrating the app with their inventory system(s) and translating it also can’t be free.
I think, in a year or two, someone will build a crossword puzzles solver using AR, ML, and computer vision. Granted, it is more difficult because we need to recognize the alphabets and solving crossword puzzles is much harder than solving sudoko. At least, the crossword puzzles solver can give word recommendation if it can not solve the puzzle completely.
Crosswords are pretty tough, partially because many puzzles (such as the Tuesday, Thursday, and Sunday puzzles in the New York Times) have enough wordplay in their answers (not just the clues) that they break normal crossword rules, in a specific way that the solver has to determine. I think Dr. Fill (https://arxiv.org/abs/1401.4597) is still the state of the art.
I found this very interesting, regarding crowdsourcing the training data:
"After the first pass I had enough verified data that I was able to add an automatic accuracy checker into both tools for future data runs (it would periodically show the user known images and check their work to determine how much to trust their answers going forward)."
Does anyone know how the detection of sudokus on vertical planes can be achieved. Great article esp on the crowdsourcing and machine vision fronts, but the authors explanation of this aspect left a lot to be desired.
Sorry, I was pretty hand-wavy with that because it was basically just trial and error until it worked sufficiently well.
The data I had available to mess with was the difference in width of the top of the puzzle and the bottom (with some trig you can determine its angle relative to the camera) and the projection matrix of the camera relative to the scene origin.
It's not perfect but it works better than having nothing at all.
Definitely cool, and I like the application of ARKit. But using ML to solve a sudoku seems like overkill. I remember writing a constraint-based solver as the first assignment for an undergrad-level AI course back in uni. Surely this implementation is less efficient? Someone let me know if I am wrong.
If the aim was to simply learn new tech, though, then I get it. I am just wary of ML being a hammer used on anything even remotely resembling a nail.
It says they used a "traditional recursive algorithm", probably referring to the backtracking solution. In my experience it's fast enough to not matter for this sort of application (the other things that are going on are 1000x more complex).
You can just find the squares, group the characters in the squares into visual equivalence classes, assign each class an arbitrary number, solve the puzzle in terms of those numbers, then fill in each empty square with the (average?) image of the equivalence class it matches.
This would allow you to solve a Sudoku puzzle with letters or WingDings instead of numbers, and the output font would naturally match that of the original puzzle.