> Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
> Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.
This mentions improvements for throughput which is great, and it mentions memory savings. I'm a bit confused how 80% of the memory could be wasted by the KV cache when the vast majority of the memory is usually holding the model itself?
How much memory savings does this translate to effectively for say a 30B 4bit model?
Ion Stoica's lab continues to be a powerhouse of innovation. Previous successes of Stoica and his students include (but are certainly not limited to) Apache Spark, Ray, Apache Mesos and Alluxio.
Thanks for the explanation! I believe the two ideas are basically orthogonal. FlashAttention reduces memory read/writes, while PagedAttention reduces memory waste.
I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.
Yes, vLLM focuses on maximizing throughput when the VRAM is fully utilized. Nevertheless, I believe users can still benefit from vLLM even if they don't utilize the memory to its full capacity, because vLLM also includes other optimizations orthogonal to the PagedAttention (e.g., optimized CUDA kernels).
Nope. You will still need a proper GPU. You can't yet run large language models on tiny hardware like an m1/m2. Even the llama.cpp magic is only possible with very small models at beam size 1, which really limits the "creativity" of these models.
We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.
With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.
Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).
I am in fact running my own instance of the Willow Inference Server (née air-infer-api) against a Tesla P4 8GB gifted to me by our mutual friend Richard. It works wonderfully, up to IIRC 3 chunks of audio. We really need to implement streaming so I can use it to close caption videos without subtitles.
For others in this thread, if you haven't tried Willow yet, check it out, as it is an amazing leap forward and can actually run on some pretty small GPUs. LLMs are hogging the AI spotlight but you will struggle to run them on consumer hardware. Image and audio processing models are generally much smaller and more approachable.
Semi-related question: this page is full of little charts and diagrams. There are thousands of similar projects/sites/experiment sites with their own charts and diagrams. But it seems like there are always subtle-to-large differences in them that indicate they're made with totally different libraries.
Are there just thousands of homebrewn non-standard chart & diagram builders out there? How does one even begin to pick a standard to whip out quickies like these? Google SEO makes it virtually impossible to get to substance.
I often see charts produced using matplotlib or plotly - often you can tell based on the colour schemes used. For example, the bar chart at the bottom of this paper looks like it was made with plotly. I think the reason for such variance in the style of charts is largely due to the flexibility frameworks such as matplotlib provide: you can control basically every aspect of a chart and use any number of predefined or custom stylesheets to change the look and feel.
I'm spoiled by 4 bit and unfortunately it doesn't appear to be supposed here so this isn't of much use to me, but it's awesome to see people working on the inference speed side of things regardless.
Cool, I prefer the OpenAI-Compatible api. Although this is not very technically difficult, it is really intimate, because it make me feel free to use all ChatGPT applications.
> Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
> Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.
This mentions improvements for throughput which is great, and it mentions memory savings. I'm a bit confused how 80% of the memory could be wasted by the KV cache when the vast majority of the memory is usually holding the model itself?
How much memory savings does this translate to effectively for say a 30B 4bit model?