This is an amazing side project. Would you share some details about how you've hosted them? Also, if possible about the training details like how long did it take for you train them? Did you do it on your local GPU or some cloud provider or free service like Colab?
I'm pretty serious, so I've put some money into it. And even more time.
I've got a 2x1080Ti setup I used locally back in the day, but it's really slow. I still train stuff on it, but only things I know will train successfully for a long time (eg the Melgan model).
I use rented V100 GPUs to train the speaker models. They're quick and allow me to refine the datasets and parameters much more quickly than if I was doing all of it on my own box. Colabs are great and I could probably get along with them if I wasn't running so many experiments in parallel.
I can get reasonable results in a few hours on an 8xV100. Once I hone in on a direction I like, I'll let it train for a few days. (The David Attenborough model is a result of this.)
I still have a ton of refinement to do. I'm also working on singing models, and these should be ready by the weekend.
I've thought about buying beefy GPUs at this point as I've proven to myself it's not just a temporary hobby. Cloud compute is expensive.
The models are hosted on Rust microservices (a frontend proxy that fans out into multiple model servers), and this is deployed to a Kubernetes cluster. I'm planning to add more intelligence to the proxy and individual model containers so they independently scale.