Hacker News new | past | comments | ask | show | jobs | submit login

It will be more useful if it can narrate text along with those background effects.



You can already achieve that by combining models - use a dedicated speech synthesis model for the narration, then layer that over background effects from AudioGen.

Given that, I don't think AudioGen particularly needs to add full narration. That seems like a very different problem to me, likely requiring a completely different architecture.


What is the current state of the art speech synthesis model?


It was Nvidia's Tacotron2[0] but now I believe it's NaturalSpeech[1]

[0]https://paperswithcode.com/method/tacotron-2 [1]https://speechresearch.github.io/naturalspeech/


Thank you




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: