For those interested in audio fingerprinting, There was a company called Echo Nest that open sourced a full audio fingerprinting stack (server + client) called Echoprint [1] but it seems they've been bought by Spotify now and there's been no development since on echoprint.
I used it back in the day and I recall it working very well.
Shazam, for me, (maybe Maps is a close second) the most magical app on my phone, the thing from the future that I would never have imagined being real if you told the Me from 20 years ago. It is sometimes so fast, under 5 seconds. Just wonderfully great. It makes me feel superhuman. It really is one of those extra sensory powers that we have recently gained with these universal gadgets in our pockets.
It really is a marvel in its simplicity and reliance on "older" concepts (frequency transforms, hashing, etc.) as opposed to any crazy ML (AFAIK) -- surprisingly not that cutting edge. And I'm not knocking that / there's nothing wrong with the age of a technique. If it's useful, it's useful. It's incredible. It's a perfect example of bringing together several previously somewhat disparate domains of math with just the right application and an Internet-connected smartphone. In that regard, it's a bit crazy to think that the math to enable Shazam has probably been around for...several decades? Just took a stroke of ingenuity to bring it all together.
I just followed that link, which led to a page containing another "previous discussion" link, and then a couple more, and ended back on a slashdot article from Monday August 28, 2000!!
Shazam has always been a pretty amazing product, especially when you consider that it was first released nearly 20 years ago. It has also been fairly poorly understood, so I am glad that people are taking the time to revisit and review the technology behind it.
"Good idea" is a tough subjective question to answer. Could you do better than FFT/DCT/frequency-based transform? Maybe/possibly given that music tends to be quite "harmonic." Taking a step back, when trying to classify, the best features are ones that separate your classes out better. Certain representations / changes thereof can lead to better separations depending on the the properties of your input source.
The cepstrum is good for providing energy compact/sparse representation of harmonic features. This is why it's used (/was used) a lot in speech recognition. Speech sounds tend to have harmonic properties (see: formants). Frequency-based transforms tell you how much of a frequency (repeating/periodic signal) is present. If you have harmonics, those can sooort of be thought of has repeating patterns in the frequency domain. So taking a frequency transform of a frequency transform (which is super loosely a cepstrum) gets you a nice compact (separable) representation of inputs that tend to have harmonic features.
Most music tends to be pretty damn harmonic... so maybe?
Also an argument to be made that if you have a big enough network of the right kind that's just taking in windowed time domain data (almost surely involving recurrence), it might not be surprising that you could find some cepstral-like stuff naturally pop out.