Hacker News new | past | comments | ask | show | jobs | submit login
Persimmon-8B (adept.ai)
175 points by jgershen on Sept 7, 2023 | hide | past | favorite | 56 comments



I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.

The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.

(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)

To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.

---

For folks in MLops, deploying models with streaming APIs:

1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?

2. How are you currently serving these models as an API and what upcoming tools are you exploring?

For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?


This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html

IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.


Two important takeaways on the base model:

* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.

Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?


phi-1 supposedly does 50.6 on HumanEval with 1.3B parameters. (Python only) https://arxiv.org/abs/2306.11644

Weights haven't been released, though.


phi-1 is a code-specific base model, with further finetuning on top of that. This is a general language base model, not really comparable.


no code or dataset either for phi-1.


its not so much about benefit, as it is a design goal to want large context windows.

https://twitter.com/suchenzang/status/1699926157028897078?s=... notes some issues directly comparing the 16k context number. the odd choice of tokenizer means its effectively like a 10-12k model (? ballpark, not calculated)


That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k


There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).


> scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

The article claims 18.9 for the base model, but also claims 20.7 for the fine tuned model.


Awesome! I applaud everyone training new models and attempting different techniques!

I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).

I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.


If this model can be made to work as GGUF, TheBloke will probably have a set of quantizations in a day or two at most.


We're working on it!


I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.


This is the least detailed foundational model release I have seen. Llama paper offers lot more details like ablations, loss curves etc. Falcon has data preparation details etc. Google's model release papers like T5 are some of the best and includes many ablations.


I mean "I am become death, destroyer of worlds" bullshit about AI safety/ethics/etc that is included in every press release from Google/Meta/OpenAI and even much smaller players.


Yes, but in many cases paper is written by many people including ethicists who believes that and add that to the paper. It doesn't deplete the value of people who actually made it work.


Why are ablations useful? Their release report seemed very informative to me without getting bogged down in jargon.


Ablations are important because they tell why the model is better. Here the model is of similar size of llama 7b, trained on 1/3rd the dataset still their claim is that the performance is better. Now this could happen due to lot of things like relu squared or better dataset or 16k tokens. We just don't know why it performed better.


Congrats on the release! Two questions.

1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?

2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?


Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.


Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.


Really cool! Honestly I wish these releases would come with a demo (like on replicate or hugging face)


The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.


Can you share details of the build failure on the github? We'll try to help.

The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.


https://github.com/persimmon-ai-labs/adept-inference/issues/...

It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.

My advice is stream first then make synchronous convenience wrappers on top of that. Also, lean on community standards for PoC. I’m guessing your investors are interested in making this scale as cheaply as possible, but that is probably the least important feature for people evaluating your model’s quality locally.


> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.

I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.


Automatic kernel fusion (compilation) is a very active field, and most major frameworks support some easy-to-use compilation (e.g. jax's jit, or torch.compile which iirc uses openai's triton under the hood). Often you can still do better than the compiler by writing fused kernels yourself (either in cuda c++ or in something like triton (python which compiles down to cuda) but compilers are getting pretty good.

edit: not sure why op is getting downvotes, this is a very reasonable question imo; maybe the characterization of kernel compilation as "AI" vs. just "software"?


Both AI and compilers are just software and right now the optimizers are written manually which is kinda weird because the whole point of LLMs is to generate sequences of tokens that minimize some scalar valued loss function. In the case of compilers the input is some high level code in python expressing tensor operations and the output is whatever is executable by GPUs as fast as possible by combination of kernels which are formally equivalent to the tensor operations expressed in Python (or whatever higher level language is used to write the tensor specifications to be optimized for the task at hand). Everything in this loop has a well defined input with a well defined output and an associated scalar valued metric (execution time) and even a normalization factor (output length with shorter sequences being "better").

The whole thing seems obviously amenable to gradient based optimization and data augmentation with synthetic code generators. It is surprising that no one is pursuing such approaches to improving the optimization pipeline in kernel compilation/fusion/optimization because it is just another symbol game with much better defined metrics than natural language models.


thanks for explaining pretty concisely w/out being rude :)


Can someone explain the down votes? What exactly is incorrect in OPs comment?


i don't know why people downvote, but writing highly performant gpu code across multiple languages is still in the realm of only a few people with a lot of the right experience can do well, and while ai can help assist those people it's not a problem that can be fully solved by an ai at this moment, maybe a few years with a large feedback loop of iterating, testing, benchmarking, repeating. i guess one day but not now


If the bottleneck is writing performant code then seems like that's the first thing AI companies should solve with AI. If that's solved then building applications on top of that foundation is very easy. Are there any companies working on this problem?


Did you just create three new accounts to ask this series of questions? These questions, and account names, all seem to share similar patterns.


HN keeps throttling after 2 comments. Maybe they need more AI to detect bots instead of whatever algorithm they are using at the moment.

Back to the question at hand. How exactly do AI companies plan to build AGI if they can not optimize the AI development loop with their tools and techniques?


I would have to guess it has something to do with that task not actually being suitable for language models at their current stage. Even if they could be trusted to perform the task, its actually not that much work to just... write code to handle keeping this kind of thing in sync. It's really really not that much more work. You really don't even need to do it, both training and inference can be done within PyTorch or in C++.

If it was necessary for some reason... Running a language model to keep something like this is sync over long term training and iteration would likely be more expensive than a developer's time AND block the researcher in a verification loop on the output which still probably needs to be checked by the developer (they could be the same person which will just deepen the frustration they experience).

The use of a lot of garbage accounts in this thread and lack of model details also looks pretty shady...


I'm confused. If these tools aren't good enough for AI research then why would they be good enough for consumer applications? If language models can not help with the AI development loop then the technology is not going to be useful for consumer use cases. Code can be very easily verified by linters and type systems so the problem of verification is much simpler than in consumer use cases without linters and type systems.


>Code can be very easily verified by linters and type systems so the problem of verification is much simpler than in consumer use cases without linters and type systems.

you are confusing (syntactic) validation from verification. verifying code is an incredibly hard problem.

You can get a lot of value out of a models even if they are not capable of AI development because most people aren't doing things that are as complicated as AI development.


I don't think AI development is complicated. It's just a bunch of functions with parameters which are optimized with gradient descent. The AI development loop is extremely simple and most AI "research" is basically stacking standard tensor operations and seeing what works. It's surprising there is no company that is applying AI to AI development since it is essentially a game with symbols and very well defined measurable outcomes.


> If language models can not help with the AI development loop then the technology is not going to be useful for consumer use cases.

it quite literally is useful for consumer usecases though.

For example, one consumer usecase that is being used by a lot of students right now is cheating on their homework.

It is right now being used for all sorts of consumer things like that.

Also, if you have an opinion you can just say what your opinion is. You don't have to hide it behind questions and alt accounts.

You can just have an opinion and say it.


I don't have an opinion. I am legitimately surprised that very easy problems in AI research have not already been solved with some foundation model. Translating and optimization of code from one formal language to another seems like a very obvious application of AI and yet most of the work is still done manually.


I don't think you read what I said, or you don't know what you're talking about. Language models are actively being researched to be useful, but as of right now _they are not ready, capable, or trustworthy enough_ to perform tasks without supervision, they're especially bad at complex code and code that doesn't have many examples in the internet corpuses such as optimized CUDA kernels...


>The model has 70k unused embeddings for multimodal extensions,

Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?


Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.


Gotcha. Thanks for explaining


Appreciate the release! Since you're hosting the downloads directly, I'd recommend throwing an integrity hash for each of the files alongside the download links so users can verify there wasn't any corruption in transfer.


Looks like someone there is reading HN as they just did it!


Good shout. Will be fixed soon.


Good. Keep it going. Let’s have more $0 for free AI models getting released since we all know it is the future and you can’t compete with free.

The AI race to zero must be accelerated with $0 free models and less control from gatekeepers such as ClosedAI


Do you have any explanations on why this performed better than Llama 2?


They did several things differently from Llama 2.

From my understanding, you'd have to repeat the experiment isolating each variable to see what difference each one actually makes, no?


Since it is coming from Adept, maybe they are building 8B models for UI automation, the inputs are usually large and latency required is low. It's basically a task of information extraction and UI action generation.


What kind of use cases do these sub 10B param models serve? Are they mostly useful for code completion?


You can run them either for general purpose inference. You can also fine-tune them and get improved performance for specific use cases.

It's safe to assume they're worse at every task than larger models, so I wouldn't look at use cases in terms of what tasks they can do compared to larger models.

But what's good about them is they're smaller so they can run on smaller and cheaper hardware. So an example would be to fine-tune and then run on some sort of local user device rather than in the cloud. This might become more practical in the future as hardware improves.


Yeah, my point is moreso is are smaller models ever "smart" enough to perform useful tasks?

Perhaps for basic code completion and simple writing tasks?


say you had very vertical trained models, such that you had like 1000 separate LLMs trained on specialized data and then others LLMs trained on which LLM is most likely to have the data you need, sort of like the way Wikipedia is interlinked, or hierarchical, or essentially like a db index, over nested LLMs, performance would scale higher with many more highly focused models, at least that's my understanding of what possible here.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: