I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.
The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.
(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)
To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.
---
For folks in MLops, deploying models with streaming APIs:
1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?
2. How are you currently serving these models as an API and what upcoming tools are you exploring?
For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?
IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.
* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2
* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.
Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?
That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k
There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).
Awesome! I applaud everyone training new models and attempting different techniques!
I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).
I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.
I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.
This is the least detailed foundational model release I have seen. Llama paper offers lot more details like ablations, loss curves etc. Falcon has data preparation details etc. Google's model release papers like T5 are some of the best and includes many ablations.
I mean "I am become death, destroyer of worlds" bullshit about AI safety/ethics/etc that is included in every press release from Google/Meta/OpenAI and even much smaller players.
Yes, but in many cases paper is written by many people including ethicists who believes that and add that to the paper. It doesn't deplete the value of people who actually made it work.
Ablations are important because they tell why the model is better. Here the model is of similar size of llama 7b, trained on 1/3rd the dataset still their claim is that the performance is better. Now this could happen due to lot of things like relu squared or better dataset or 16k tokens. We just don't know why it performed better.
1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?
2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?
Llama2 chat performs worse and wasn't included for that reason.
The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.
Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.
The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.
Can you share details of the build failure on the github? We'll try to help.
The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.
It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.
My advice is stream first then make synchronous convenience wrappers on top of that. Also, lean on community standards for PoC. I’m guessing your investors are interested in making this scale as cheaply as possible, but that is probably the least important feature for people evaluating your model’s quality locally.
> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.
I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.
Automatic kernel fusion (compilation) is a very active field, and most major frameworks support some easy-to-use compilation (e.g. jax's jit, or torch.compile which iirc uses openai's triton under the hood). Often you can still do better than the compiler by writing fused kernels yourself (either in cuda c++ or in something like triton (python which compiles down to cuda) but compilers are getting pretty good.
edit: not sure why op is getting downvotes, this is a very reasonable question imo; maybe the characterization of kernel compilation as "AI" vs. just "software"?
Both AI and compilers are just software and right now the optimizers are written manually which is kinda weird because the whole point of LLMs is to generate sequences of tokens that minimize some scalar valued loss function. In the case of compilers the input is some high level code in python expressing tensor operations and the output is whatever is executable by GPUs as fast as possible by combination of kernels which are formally equivalent to the tensor operations expressed in Python (or whatever higher level language is used to write the tensor specifications to be optimized for the task at hand). Everything in this loop has a well defined input with a well defined output and an associated scalar valued metric (execution time) and even a normalization factor (output length with shorter sequences being "better").
The whole thing seems obviously amenable to gradient based optimization and data augmentation with synthetic code generators. It is surprising that no one is pursuing such approaches to improving the optimization pipeline in kernel compilation/fusion/optimization because it is just another symbol game with much better defined metrics than natural language models.
i don't know why people downvote, but writing highly performant gpu code across multiple languages is still in the realm of only a few people with a lot of the right experience can do well, and while ai can help assist those people it's not a problem that can be fully solved by an ai at this moment, maybe a few years with a large feedback loop of iterating, testing, benchmarking, repeating. i guess one day but not now
If the bottleneck is writing performant code then seems like that's the first thing AI companies should solve with AI. If that's solved then building applications on top of that foundation is very easy. Are there any companies working on this problem?
HN keeps throttling after 2 comments. Maybe they need more AI to detect bots instead of whatever algorithm they are using at the moment.
Back to the question at hand. How exactly do AI companies plan to build AGI if they can not optimize the AI development loop with their tools and techniques?
I would have to guess it has something to do with that task not actually being suitable for language models at their current stage. Even if they could be trusted to perform the task, its actually not that much work to just... write code to handle keeping this kind of thing in sync. It's really really not that much more work. You really don't even need to do it, both training and inference can be done within PyTorch or in C++.
If it was necessary for some reason... Running a language model to keep something like this is sync over long term training and iteration would likely be more expensive than a developer's time AND block the researcher in a verification loop on the output which still probably needs to be checked by the developer (they could be the same person which will just deepen the frustration they experience).
The use of a lot of garbage accounts in this thread and lack of model details also looks pretty shady...
I'm confused. If these tools aren't good enough for AI research then why would they be good enough for consumer applications? If language models can not help with the AI development loop then the technology is not going to be useful for consumer use cases. Code can be very easily verified by linters and type systems so the problem of verification is much simpler than in consumer use cases without linters and type systems.
>Code can be very easily verified by linters and type systems so the problem of verification is much simpler than in consumer use cases without linters and type systems.
you are confusing (syntactic) validation from verification. verifying code is an incredibly hard problem.
You can get a lot of value out of a models even if they are not capable of AI development because most people aren't doing things that are as complicated as AI development.
I don't think AI development is complicated. It's just a bunch of functions with parameters which are optimized with gradient descent. The AI development loop is extremely simple and most AI "research" is basically stacking standard tensor operations and seeing what works. It's surprising there is no company that is applying AI to AI development since it is essentially a game with symbols and very well defined measurable outcomes.
I don't have an opinion. I am legitimately surprised that very easy problems in AI research have not already been solved with some foundation model. Translating and optimization of code from one formal language to another seems like a very obvious application of AI and yet most of the work is still done manually.
I don't think you read what I said, or you don't know what you're talking about. Language models are actively being researched to be useful, but as of right now _they are not ready, capable, or trustworthy enough_ to perform tasks without supervision, they're especially bad at complex code and code that doesn't have many examples in the internet corpuses such as optimized CUDA kernels...
>The model has 70k unused embeddings for multimodal extensions,
Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?
Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).
The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.
Appreciate the release! Since you're hosting the downloads directly, I'd recommend throwing an integrity hash for each of the files alongside the download links so users can verify there wasn't any corruption in transfer.
Since it is coming from Adept, maybe they are building 8B models for UI automation, the inputs are usually large and latency required is low. It's basically a task of information extraction and UI action generation.
You can run them either for general purpose inference. You can also fine-tune them and get improved performance for specific use cases.
It's safe to assume they're worse at every task than larger models, so I wouldn't look at use cases in terms of what tasks they can do compared to larger models.
But what's good about them is they're smaller so they can run on smaller and cheaper hardware. So an example would be to fine-tune and then run on some sort of local user device rather than in the cloud. This might become more practical in the future as hardware improves.
say you had very vertical trained models, such that you had like 1000 separate LLMs trained on specialized data and then others LLMs trained on which LLM is most likely to have the data you need, sort of like the way Wikipedia is interlinked, or hierarchical, or essentially like a db index, over nested LLMs, performance would scale higher with many more highly focused models, at least that's my understanding of what possible here.
The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.
(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)
To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.
---
For folks in MLops, deploying models with streaming APIs:
1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?
2. How are you currently serving these models as an API and what upcoming tools are you exploring?
For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?