I do love kaitai and recently contributed a grammar, but both the title and the copy talk about writing.
> Reading and writing binary formats is hard... Kaitai Struct tries to make this job easier...
Kaitai cannot write data back out. [1] This is a major limitation for me. It would be nice to use it as a mutation engine for fuzzing, but without being able to write it back out, it is mostly just beneficial for analysis.
I might be missing something - why would this be a breakthrough? It sounds complicated to generate the interfaces, sure, but is there a theoretical problem blocking this, or just practical?
I've looked at this problem quite a bit over the years...I agree with you completely. there isn't anything fundamental here, just the normal cultural adoption issues, usability, etc. There may be some compilation/complexity issues around formats with variable length fields and self-description, but certainly less problematic than general purpose programming.
I really wish though that there were more traction here, as I really believe that we should be quite prepared do deal with bits and not protobufs by default. nothing wrong with protobufs for quite a number of uses. I just don't know why people are so afraid of and/or biased against bit strings.
Through this thread I really feel like I’m missing something.
Are we not talking about writing binary files that conform to the spec?
Like, in the case of a GIF: simply writing valid garbage data should produce a file that presents as a valid GIF with noise for the image. Similarly: reading the file through the parser and writing it out unmodified should create an identical file (assuming no stenography).
You're right. The person saying this would be a breakthrough doesn't understand what this is doing.
There are already similarly declarative tools which can accomplish this. Haskell has binary parsing libraries which work similarly and give you both reading and writing capabilities.
I honestly don't get it either. The inverse of the read spec is the write spec. My guess having not dug deeply is that they don't distinguish between required and optional fields, that said, they should still be able to write what they have based on the read spec but could potentially be still an invalid file.
It's a really interesting idea! I'm also surprised there hasn't been much traction here. I've started a Rust library for this: https://github.com/sharksforarms/deku
It's a declarative bit-level symmetrical reader writer generator library.
Yeah I was also confused. I wrote a bidirectional parser/writer layer for yaml in a haskell program I had at work. The yaml structure mappings were all declarative in the code and even allowed documentation for the structures to be printed out. It's not that hard once you define the primitive bidirectional (higher-order) mapping function to go from a `Configurable a` to a `Configurable b`, the rest kind of unfolds from there.
Binary file formats can be vastly more convoluted than YAML. For example, consider roundtripping ZIP archives or PDF documents, or both at the same time (see https://www.alchemistowl.org/pocorgtfo/).
Err. I'm not sure what asn.1 is missing for this? I've seen lots of people use asn.1 exactly for this (i.e. writing a grammar for an arbitrary pre-existing binary format not readily described in asn.1).
asn.1 distinguishes between schema and encoding; there are many binary encodings and you can technically devise a custom one that would let you describe the high level structure with an asn.1 grammar and then lay out the actual bits with a custom encoding format so that it matches the pre-existing format you're writing the new serde for). This may work as many formats have this leveled approach. E.g. the lowest level of the spec may tell something about how to encode integers (all integers are 32-bit big endian, or varlen encoded ...), sequences (ength prefixed, or terminated by a sentinel.
Any chances you have some reference to what you saw?
Right I think we were working with PER (tagless) to match a complex multi-layer protocol. Can't share the exemple but let's say asn.1 helped generate saner code than the handrolled one AND allowed other languages to decode...
I understand what you're saying though. Memories of using this were... unpleasant. I think the 'best' alternative for me would be RecordFlux. Ada-like syntax, generation of AoRTE-provable SPARK code, and recently the expressivity of the tool has increased. And the whole thing is in python, easily extensible to build things from the type description: generators, fuzzers, advanced specific parsers (need only one field and want the control fields' positions and sizes), wireshark plugins, Postgres extensions...
I really like what they're doing there. Might be the one of the low-effort (for the user!) lead-bullets for safer software.
It can, but it can get incredibly slow for large formats. I was using it to reverse engineer some binary game formats but the parser would take a couple of minutes to complete. I rewrote it using struct and that time dropped to a few seconds. Useful for probing an unknown format, but I prefer the 010 editor since it’s more interactive.
Ooh, exciting! I built a parser [1] for AIS messages [2], a quirky ship-to-ship protocol. My lower-level stuff always felt clumsy to me. I'll have to see if this cleans it up.
BinData (https://github.com/dmendel/bindata) is a Ruby gem for this, basically using a DSL in Ruby to declaratively define binary data formats that can be both read and written.
I wrote a sort-of adjacent library for Go at one point. I’m a bit stuck trying to figure out exactly what to do on 2.0 but it has a lot of Kaitai like features including an expression language for transforming things (on master version) and it supports writing structures out.
I have been working on Deku: a declarative binary reading and writing: bit-level, symmetric, serialization/deserialization library. https://github.com/sharksforarms/deku
and still there is no generator that creates efficient and partialy streamed readers/writers for high performance protocols or resource constraint environments (as less pre allocations as possible, zero copy concepts,..., streamed reading/writing, good inline possibility, ...)
100% fix formats with no self reference or for example checksums that sits in front of the checksumd data (no streaming possible)...
I started working on an alternative that supported writing but didn't follow through since I didn't think many people were interested in Kaitai to begin with.
I love Kaitai, and use it extensively (outside my day job) for exploring binary file formats of things like DOS games and quilting patterns. About a year ago I wrote a blog entry about using it for scientific data: https://matthewturk.github.io/post/kaitai-struct-scientific-... .
Recently I learned about some pretty cool work using it with respect to high-energy physics:
i found the Kaitai toolset (IDE, compiler) to be useful in parsing or "deserialization" for proprietary financial protocols. if you are just reading data (especially in an environment with clients in different languages) it's a strong recommend, however lack of support for "serialization" means that you will still need to roll your own encoders. ultimately we created our own tooling/DSL for encoding as well as generating KSY files to generate clients as part of our builds
edit: i forgot to add - there is an issue for serialization on kaitai github repo for some time [1], with some interesting discussion around the implementation challenges
In ID3 tags there's such a thing as unsynchronization: the MP3 syncword is 11 bits set to one so for the tag data not to be mistaken for an MPR3 frame it must not have a similar sequence of bits. The solution was to replace every 0xFF in tag data with 0xFF 0x00. Or not to replace, as this mistake may only be made by old players that do not understand ID3. So there's a special setting for this that may occur in two places: in the whole tag or in an individual frame within a tag.
The logic itself is simple, as you read data byte-by-byte you need to check check if the previous byte was 0xFF and the frame or tag is marked as unsychronized. Yet it's not that simple to describe this declaratively. I wonder if Kaitai can actually do this. From what I see Kaitai does have at least part of ID3 described, but it doesn't seem to actually do unsynchronization, as far as I can tell from the code.
The first thing that comes to my mind in this context is that "kaitai" means "want to write" (書いたい) in Japanese. It also has many other meanings: https://jisho.org/search/kaitai . Maybe the authors had disassembly (解体) in mind.
Thanks for catching my blind spot there. I think what happened was that I used the -te form (書いて) and stripped off -te, but didn't conjugate the verb properly starting from 書く.
Is there support for C in the pipeline? That would be nice, because lots of binary data manipulation in C even now, and it could use some quality-of-life updates.
I can s this being hugely useful to bring able to generate code to read wider range of formats instead of having to include/embed dozens of 3rd party libs to read the files. Could be strong tool for commercial product dev that have legal or technical limitation on use of 3rd party code.
supports unions and 4 popular type safe languages. The idea is that you'd write decorators in those languages to implement functionality similar to ksy or write a template to generate ksy and reuse Kaitai tool chain.
Protocol Buffers and Kaitai structs solve different problems, although they both deal with serialized data. With a protobuf, you don't really care about how the data that you're serializing into the binary buffer is represented in the binary format. All you care about is that your data can be serialized and deserialized. Conversely, Kaitai allows you to control and specify the representation of the serialized data. This allows you to specify arbitrary formats (like image formats, for example).
Essentially you start with the serialization format instead of starting with the deserialized data (or like how the title of this submission says, "describe the structure of data"). As such, you can somewhat describe protobufs [0] using Kaitai structs, but the converse is not necessarily true.
There's actually a section under their FAQ [1] with a more in-depth response to this.
Kaitai Struct can read any 'ol random data format while protobufs (I assume) are only able to read/write a specific protocol.
I was playing with it a while back along with wasm and got it to decode all the individual opcodes (along with the rest of the file) but it turned out to be really, really slow in the generated python version. C++ probably has much better performance but I haven't actually tested the difference.
If I understand it correctly, katai generates for you the reader as well, while protobuf generates only the data container. Katai is more a sort of protobuf + grpc for reading data.
Big difference is Katai is designed to parse existing binary formats like PNG, JPEG, MIDI, WAV files.
Things like Protobuf can serialize/deserialize any data but it’s very opiniated about how to do this. You won’t convince it to work with existing file formats.
I think that I agree of you, and I'm obviously fond of clever DSL's cough mgmt config cough, but it's not clear what specifics you had in mind here. Share away if you're interested!
YAML has far too many features and doesn't let you structure data well for this use-case. There's no concise way to provide typings for its fields, for instance - note how every field requires two lines in the examples, far too verbose.
I don't really know what to tell you other than that YAML is not well-suited to this domain, and designing a domain-specific language which meets the particular needs of the system, and only just, would be better. You could base it off of C structs or Rust types or s-expressions or whatever else, the exact syntax isn't important so long as it allows you to concisely and precisely specify the semantics of the tool.
If it was something that looked like YAML, but wasn't, that'd be fine. YAML in its simplest form is a neat idea. Indentation is simple and obvious. Same as it doesn't matter that TOML looks a heck of a lot like INI. But using YAML itself is terrible, because YAML is a colossally bloated language that does crazy things at the wrong points and has a million syntaxes that step on each others' toes.
> Reading and writing binary formats is hard... Kaitai Struct tries to make this job easier...
Kaitai cannot write data back out. [1] This is a major limitation for me. It would be nice to use it as a mutation engine for fuzzing, but without being able to write it back out, it is mostly just beneficial for analysis.
1. https://doc.kaitai.io/faq.html#writing