Hacker News new | past | comments | ask | show | jobs | submit login
Efficient Memory Management for Large Language Model Serving with PagedAttention (arxiv.org)
102 points by jmorgan on Sept 14, 2023 | hide | past | favorite | 16 comments



Without understanding most of that paper, here's a question for someone who might know more: can pagedattention work to make cpu inference faster too?


Only if the CPU is serving multiple users, maybe.

LLMs can't batch token generation for single users. Its sequential, each token depends on the next. In fact that's a part of the paper: "dumb" batching will leave the GPU underutilized because responses aren't all the same length, and they end up processing one token at a time at the end.


Just from the abstract, this is primarily for batching inference, for batched inference using GPUs gives an order of magnitude speed increase so probably not something that usually makes sense to do on CPUs…


Not only batching. It works by serving different requests as long as they share the prefix.

This basically enables KV cache reuse when there is a prefix matching (from my shallow understanding of how KV cache works).

I failed to see how this help for local deployed LLM, unless you consider the case you ask the same question or with the same prefix are high (like always starts with "please help me ..."?)


You also have fine-tuned models for specific tasks that may see very similar inputs for a variety of outputs. Think an LLM trained on pulling out specific types of information, no matter where it was stored within the file. E.g. "find the date of the shipment for product# 5432" and then you pass in 10k json documents with a similar shape.


Yeah, but I was under the impression that for the same prompt, implementations are already share the KV cache. This area is so new so these obvious ideas might not get implemented as widely as I thought.


Maybe if you have a model with a large context window, you stuff a document in the prompt as a prefix, then ask a bunch of different questions about the document?


That would be pretty useful. I'm working on getting chatgpt to classify a dataset. So basically I use the same big prompt for a bunch of different small texts and ask chatgpt to generate the class label. Something like initializing the prompt state sounds good. Basically trade more processing time for more memory usage. Who know maybe openai is doing such optimization from their side


I might be wrong, but looks like this could help with speculative decoding which can already vastly improves the inference speed?


Ah, I see. This isn't necessarily virtualizing the static weights but the variable -sized and data dependent key value caches. These caches are built up as you go through the sequence of tokens. Makes sense.

How doesn't paging worsen speed performance though? If you are making more trips to the memory, then are you really just saving vram?

Also I see that vLLM which implements PagedAttention is also using a better scheduling? Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?

What are the results of using the sequence-length only without virtualization?


> How doesn't paging worsen speed performance though?

It does worsen the performance of the attention kernel, if comparing to kernels which takes keys and values in continuous memory layout.

> Wouldn't the speed improvements be coming from that instead? Don't put an expected short input and output in the same batch as a big input and big output?

Actually it puts everything in the same batch. The reason for its high throughput is that sequences are removed from the batch as soon as it's finished, and new sequences can be added to the batch on-the-fly if there is enough space in KV cache. This is called continuous batching (https://www.anyscale.com/blog/continuous-batching-llm-infere...).

Paged attention and "virtualized" KV cache play an important role in an efficient implementation of continuous batching. Text generation in LLM is a dynamic process and it's not possible to predict how long the output is when scheduling incoming requests. Therefore a dynamic approach is needed for KV cache allocation, even though it hurts the performance of attention.


Thank you so much for the response! This is really good info!



[flagged]


Was this written by an LLM?


Is any post that is unusually helpful just assumed to be the product of a LLM now?


I think its more about ChatGPT's style of dry, technical, gramatically eloquent summarization without any "hot takes" or questions like one would expect from human commenters on a forum.

Also the intro feels like an out-of-the blue textbook entry, not a start to a comment.

> Think of LLMs as massive libraries of information. Accessing and using this library efficiently, without wasting space or time, is crucial.

Of course, plenty of human users do this, and I don't think OP is actually an LLM, but some suspicion is reasonable these days.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: