I only have a cursory knowledge of DSP, but it sounds like this was implemented ...

I only have a cursory knowledge of DSP, but it sounds like this was implemented by using the 5 second clip to build a matched filter kernel. This would require convolving the samples of the 5 second clip with the entire song to find out where it matches, which should work super well, but I think it would be really tough to scale this up to millions of songs. Is this how it was implemented?