That could work, I think denoising a song to be a perfect match to the original recording is probably a very hard problem, so hard that your model will still need to be robust to some deviation from the original track, and therefore you need to do what I said above anyway.
Generally it's much easier to generate noised pairs from clean input than it is to do the reverse, i.e. go record lots of noised inputs from the wild and match to the original song. So the denoising problem you mention would be tougher still due to covariate shift. I think the features you learn trying to fingerprint the song through noise will probably be a bit more robust, but I don't have a mathematical proof.
Generally it's much easier to generate noised pairs from clean input than it is to do the reverse, i.e. go record lots of noised inputs from the wild and match to the original song. So the denoising problem you mention would be tougher still due to covariate shift. I think the features you learn trying to fingerprint the song through noise will probably be a bit more robust, but I don't have a mathematical proof.