I agree, the samples sound very natural. I ask myself though how similar they are to the data that has been used for training, as it would be trivial to rearrange individual pieces of a large training set in ways that sound good (especially if a human selects the good samples for presentation afterwards).
What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.
A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.
Filtering out certain notes from a piano chord can be done by e.g. Melodyne, but that seems far from what's necessary to generate speech, so it would surprise me, if WaveNet can do that?
What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.