Hacker News new | past | comments | ask | show | jobs | submit login
GGUF, the Long Way Around (vickiboykis.com)
249 points by Tomte 9 months ago | hide | past | favorite | 30 comments



Llama.cpp I think has a ton of clone-and-own boilerplate, presumably from having grown so quickly (I think one of their .cu files is over 10k lines or so, roughly, ATM).

While I haven't seen the model storage and distribution format, the rewrite to GGUF for file storage seems to have been a big boon/boost to the project. Thanks Phil! Cool stuff. Also, he's a really nice guy to boot. Please say hi from Fern to him if you ever run into him. I mean it literally, make his life a hellish barrage of nonstop greetings from Fern.


Thank you for the reference to the CUDA file [1]. It's always nice to see how complex data structures are handled in GPUs. Does anyone have any idea what the bit patterns are for (starting at line 1529)?

[1] https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda...


Those have to do with dequantization. It involves table lookups and some adjusting math.


I honestly think have a way to just use json (a.k.a. safetensors) / msgpack or some lightweight metadata serializer is a better route than coming up with a new file format. That's also why I just use SQLite to serialize the metadata (and tensor weights, this part is an oversight).


Gguf is cleaner to read in languages that don't have a json parsing library, and works with memory mapping in C. It's very appealing for minimal inference frameworks vs other options.


safetensors can mmap too because the tensor data are just offsets and you are free to align to whatever you want.

It is hard to keep metadata minimal, and before long, you will start to have many different "atom"s and end-up with things that mov supports but mp4 doesn't etc etc. (mov format is generally well-defined and easy-to-parse, but being a binary format, you have to write your parser etc is not a pleasant experience).

If you just want minimal dependency, flatbuffers, capnproto, json are all well-supported on many platforms.


mmap() requires that you map at page aligned intervals which must be congruent with the file offset. You can't just round down because some gpus like metal require that the data pointers themselves be page aligned too.


Yeah, safetensors separates metadata and tensor data. The metadata is an offset reference to the tensor data that you are free to define yourselves. In that way, you can create files in safetensors format but the tensor data itself is paged aligned offsets.


But usually AWQ get recommended for GPU inference over GGUF


By who? Only comparison I have seen is that it sucks vs. EXL2 https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...


The issue is benchmarks for LLMs or models formats are tough to compare, as there are many factors at play. But beyond ooba's comparison, many other sources recommend GPTQ or AWQ for GPU inference as it gives better quality for the same quant level (AWQ apparently takes more VRAM though, but better quality). Given how many models are available I would take these tests with a grain of salts.


Haven't heard this. Was this a few months ago? A lot happens in this space over that time span.


I think a binary format is obviously the right answer here.


> GPT-Generated Unified Format

GG is Georgi Gerganov


Nothing like a good backronym


“no yapping” gave me a bit of a chuckle. Quick way to ask the response to be brief I guess.


It would be cool to also discuss how llamafiles[1] work, and how they differ from the GGUF files.

[1]: https://github.com/Mozilla-Ocho/llamafile


Cool. I was just learning about GGUF by creating my own parser for it based on the spec https://github.com/ggerganov/ggml/blob/master/docs/gguf.md (for educational purposes)


As LLMs have quite minor changes between architectures, would it make sense to just embed the model compiled to some sort of simple bytecode right in the GGUF file? Then, only implement specific new operations when researchers come up with a new model that gains enough traction to be of interest.


Not really. We've been on that road before. Embedding computation graph into the file makes changes to the computation graph harder (you need to make sure it is backward compatible). This is OK in general (as we have onnx already), but then if you have dynamic shape and the fact that different optimizations we implemented are actually tied to the computation graph, this is simply not optimal. (BTW, this is why PyTorch just embed the code into the pth file, much easier and backward compatible than a static computation graph).


wait, why is embedding the graph into the file bad?

it enables really clean separation of the core autodiff library and whatever backend you want to use to accelerate the graph computations, which can simply read the file and be completely independent of the core implementation

but also, if you just store the tensors in some arbitrary order and then store the indices of the order in which they have to read and traversed, you can easily adjust the graph to add stuff like layer fusion or smth similar (i'm not really familiar w/ comp graph optimisations tbh)

what would an alternative look like anyway?


Yeah, but you want to avoid remote code execution:

https://www.bleepingcomputer.com/news/security/malicious-ai-...


The bytecode would not even need to be Turing-complete. Or maybe it could take inspiration from eBPF which gives some guarantees. What you posted is related to the design oversight of Python's pickle format.


I think ONNX does what you say.


It seems like a lot of innovation is around training, no? GGML (the library that reads GGUF format) supports these values for the required 'general.architecture':

  llama
  mpt
  gptneox
  gptj
  gpt2
  bloom
  falcon
  rwkv


I've also been trying to figure out GGUF and the other model formats going around. I'm horrified to see there is no model architecture details in the file! As you say, it seems they are hard-coding the above architectures as constants. If a new hot model comes out, one would need to update the reader code (which has the new model arch implemented). Am I understanding this right?

I'm also a bit confused by the quantization aspect. This is a pretty complex topic. GGML seems to use 16bit as per the article. If was pushing it to 8bit, I reckin I'd see no size improvement the GGML file? The article says they encode quantization versions in that file. Where are they defined?


Why are you horrified?

In designing software, there's often a trade off between (i) generality / configurability, and (ii) performance.

llama.cpp is built for inference, not for training or model architecture research. It seems reasonable to optimize for performance, which is what ~100% of llama.cpp users care about.


GGUF files seems to be proliferating. I think some folks (like myself) make an incorrect assumption that the format has more portability/generalizability than it appears to have. Hence, the horror!


I’ve been looking for a good resource on GGUF for the past week or so, the timing on this is awesome! Thanks!


This is an excellent deep dive! Love the depth here Vicki




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: