Hacker News new | past | comments | ask | show | jobs | submit login
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (vllm.ai)
295 points by wskwon on June 20, 2023 | hide | past | favorite | 42 comments



This is really cool to see.

> Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.

> Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

This mentions improvements for throughput which is great, and it mentions memory savings. I'm a bit confused how 80% of the memory could be wasted by the KV cache when the vast majority of the memory is usually holding the model itself?

How much memory savings does this translate to effectively for say a 30B 4bit model?


This really depends on what GPUs you use. If you GPUs has very small amount of memory, vLLM will help more.

vLLM addresses the memory bottleneck for saving KV caches and hence increases the throughput.


Ion Stoica's lab continues to be a powerhouse of innovation. Previous successes of Stoica and his students include (but are certainly not limited to) Apache Spark, Ray, Apache Mesos and Alluxio.


I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?


I think they are orthogonal.

Flash attention is just another way to compute exact attention.

This work mainly concerns how to resolve memory fragmentation across different sequences

You still need to compute attention as is once you retrieve the needed key values


Thanks for the explanation! I believe the two ideas are basically orthogonal. FlashAttention reduces memory read/writes, while PagedAttention reduces memory waste.


The ideas are orthogonal, and can be used (theoretically) at the same time.


I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.


Reading between the lines, it sounds like some of the speedup comes from VRAM savings on an otherwise close to full GPU?

This is definitely cool and needed, but it might not be so dramatic running 3-5 but quant on a less full GPU.


Yes, vLLM focuses on maximizing throughput when the VRAM is fully utilized. Nevertheless, I believe users can still benefit from vLLM even if they don't utilize the memory to its full capacity, because vLLM also includes other optimizations orthogonal to the PagedAttention (e.g., optimized CUDA kernels).


Does this mean that GPT-4/65b level performance is closer to running on a say a m1/m2 with only 24+ gigabytes of ram?


Nope. You will still need a proper GPU. You can't yet run large language models on tiny hardware like an m1/m2. Even the llama.cpp magic is only possible with very small models at beam size 1, which really limits the "creativity" of these models.


We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.

With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.

Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).

[0] - https://github.com/toverainc/willow

[1] - https://github.com/toverainc/willow-inference-server

[2] - https://github.com/toverainc/willow-inference-server/tree/wi...


I am in fact running my own instance of the Willow Inference Server (née air-infer-api) against a Tesla P4 8GB gifted to me by our mutual friend Richard. It works wonderfully, up to IIRC 3 chunks of audio. We really need to implement streaming so I can use it to close caption videos without subtitles.

For others in this thread, if you haven't tried Willow yet, check it out, as it is an amazing leap forward and can actually run on some pretty small GPUs. LLMs are hogging the AI spotlight but you will struggle to run them on consumer hardware. Image and audio processing models are generally much smaller and more approachable.


Not really. vLLM optimizes the throughput of your LLM, but does not reduce the minimum required amount of resource to run your model.


But (in theory) - llama.cpp could implement similar approach to paging/memory and see a speedup for 4bit models on cpu?


Semi-related question: this page is full of little charts and diagrams. There are thousands of similar projects/sites/experiment sites with their own charts and diagrams. But it seems like there are always subtle-to-large differences in them that indicate they're made with totally different libraries.

Are there just thousands of homebrewn non-standard chart & diagram builders out there? How does one even begin to pick a standard to whip out quickies like these? Google SEO makes it virtually impossible to get to substance.


I often see charts produced using matplotlib or plotly - often you can tell based on the colour schemes used. For example, the bar chart at the bottom of this paper looks like it was made with plotly. I think the reason for such variance in the style of charts is largely due to the flexibility frameworks such as matplotlib provide: you can control basically every aspect of a chart and use any number of predefined or custom stylesheets to change the look and feel.


We used matplotlib for the performance charts, and used a free website to convert google slides to the animation gifs.


Which "free website"?


The color scheme on these implies Google Drawing, but I don't know how they made them into animations - maybe just manually?


Google slides I think.


Now do the same for image classifiers. I tried a few of them, they're just horribly slow.

This is pretty outrageous considering the first robust image image classifiers appeared around 2007.


Doesn't work on image classifiers, because there's no KV cache. Also, standard image classifiers can do 100-1000 images/sec without any optimizations.


It's not really fast if makes intense usage of a GPU


...what?


Pretty cool stuff and the results are amazing. Hoping we will see virtual memory get standardized in pytorch or cuda.


Is it available an hosted demo?

What are use cases for which open source models are equivalent of GPT 3.5?


You can think of LMSYS Vicuna: https://chat.lmsys.org as our hosted demo, as it actually uses vLLM as the backend.


I'm spoiled by 4 bit and unfortunately it doesn't appear to be supposed here so this isn't of much use to me, but it's awesome to see people working on the inference speed side of things regardless.


this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization..


yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput.


If this were submitted as a new comment it would be at the top of the page.


You mean like, theoretically, in the future? Or you mean today?


probably mean agnostic, agonistic implies the opposite.


oops typo


Now just waiting for a timing attack paper where you can see or guess someone else's conversation that is hosted in the same data center :-).

Or maybe you get typically get dedicated machine time during inference?


vLLM has been adopted by LMSYS for serving Vicuna and Chatbot Arena.


Cool, I prefer the OpenAI-Compatible api. Although this is not very technically difficult, it is really intimate, because it make me feel free to use all ChatGPT applications.


Thanks! Please try it out and share any feedback you might have.


I wonder if this sort of memory management can be made for Pytorch transformers as under the hood optimization.


Parallelization, Paging! What will those AI/ML PhD’s think of next!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: