In the field of speech, you have the concept of *forced alignment*, aligning an ...

In the field of speech, you have the concept of forced alignment, aligning an audio recording with a transcript by way of text-to-speech and pattern matching, which can certainly be used to detect some sorts of errors consistently and various other errors with some probability. (I’m not up to the point of doing this, but I’m early in making software for audiobook recording and plan to see what I can manage in error detection eventually. Unfortunately from what I’ve seen just with a little bit of casual playing around, small errors are not generally well identified because there will be too many false positives.) You could probably adapt a lot of that stuff to music without too much trouble, though it’d certainly require changes and integration effort. My vague feeling is that this direction, with MIDI rendering and pattern matching, would probably be surprisingly effective and I suspect better than trying to more directly match the recording to the score with the help of just Fourier transformations, or converting the audio to a score with something like this and comparing them, though even with the forced alignment adaptations I imagine you’d be changing it to use mostly Fourier transformations for feature detection.

(I’m a good programmer in general, but still decidedly a layman in audio matters. Take my thoughts here with a grain of salt.)