I absolutely love how good the voices are in the VCTK-VIS dataset (109 of them!). I found it easy to install Coqui on WSL and it is able to use CUDA + the GPU quite effectively. p236 male and p237 female are my choices, but holy cow, 109 quality voices still blows my mind. Crazy how you had to pay for a good TTS just a year ago, but now, it's commoditized. Hope you find this useful:
CUDA_VISIBLE_DEVICES="0" python TTS/server/server.py --model_name tts_models/en/vctk/vits --use_cuda True
def play_sound(response):
#learning : you have to use a semaphore to serialize calls to winsound.PlaySound(), which freaks out with "Failed to play sound" if you try to play 2 clips at once
semaphore.acquire()
try:
winsound.PlaySound(response.content, winsound.SND_MEMORY | winsound.SND_NOSTOP)
finally:
# Always release the permit, even if PlaySound raises an exception
semaphore.release()