I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.
The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.
(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)
To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.
---
For folks in MLops, deploying models with streaming APIs:
1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?
2. How are you currently serving these models as an API and what upcoming tools are you exploring?
For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?
IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.
The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.
(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)
To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.
---
For folks in MLops, deploying models with streaming APIs:
1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?
2. How are you currently serving these models as an API and what upcoming tools are you exploring?
For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?