Congrats on the launch! Your service looks interesting, but I think you need to ...

reissbaker · 2024-07-24T02:50:35 1721789435

Good callout! Currently:

1. If Together.ai has the model, we proxy to them, since they're faster than us. We might switch to Fireworks for the Llama-3.1 models since they offer them at lower cost; in general that's why I didn't specify the inference providers, since we'll probably change and optimize that quite a bit. (Also would be interesting to try using Groq, since they're so fast.)

2. If the model isn't hosted somewhere else, e.g. a lot of the Llama 3 finetunes, we run the models on our own GPU clusters hosted in Fly.io. This will probably change in the future as well, since some models would really benefit from NVLink (which Fly doesn't support currently).

brianjking · 2024-07-24T02:47:54 1721789274

yeah, I called this out too.

How this compares to OpenRouter is something I'd like to know as well.

https://openrouter.ai/models/meta-llama/llama-3.1-405b-instr...

adr1an · 2024-07-28T18:52:45 1722192765

iirc openrouter doesn't support any vLLM or huggingface pasted link.