Your service looks interesting, but I think you need to be more transparent about your infrastructure. To which "inference providers" do you proxy to, and when? Who is hosting the GPU clusters?
Also, a privacy policy and ToS document are pretty important, even at this stage.
1. If Together.ai has the model, we proxy to them, since they're faster than us. We might switch to Fireworks for the Llama-3.1 models since they offer them at lower cost; in general that's why I didn't specify the inference providers, since we'll probably change and optimize that quite a bit. (Also would be interesting to try using Groq, since they're so fast.)
2. If the model isn't hosted somewhere else, e.g. a lot of the Llama 3 finetunes, we run the models on our own GPU clusters hosted in Fly.io. This will probably change in the future as well, since some models would really benefit from NVLink (which Fly doesn't support currently).
Your service looks interesting, but I think you need to be more transparent about your infrastructure. To which "inference providers" do you proxy to, and when? Who is hosting the GPU clusters?
Also, a privacy policy and ToS document are pretty important, even at this stage.