Compared to the first NaturalSpeech[1] I'm hearing a lot of white noise in the background. Singing is pretty cool but it feels like we need a few iterations before it can match the ground truth in the way speech does.
Thanks for your interests in NaturalSpeech and NaturalSpeech 2!
NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.
NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.
[1] https://speechresearch.github.io/naturalspeech/