Hacker News new | past | comments | ask | show | jobs | submit login
Kaitai: Describe the structure of data, not how you read or write it (kaitai.io)
245 points by eterps on Dec 5, 2020 | hide | past | favorite | 65 comments



I do love kaitai and recently contributed a grammar, but both the title and the copy talk about writing.

> Reading and writing binary formats is hard... Kaitai Struct tries to make this job easier...

Kaitai cannot write data back out. [1] This is a major limitation for me. It would be nice to use it as a mutation engine for fuzzing, but without being able to write it back out, it is mostly just beneficial for analysis.

1. https://doc.kaitai.io/faq.html#writing


Yikes, huge limitation. Guess I won't be looking into this any further.


It would be a CS breakthrough if it could... You're asking a lot...


This breakthrough happened already. This is called bidirectional or invertible parsing. See it discussed here https://news.ycombinator.com/item?id=16392654

And this paper https://dl.acm.org/doi/10.1145/1863523.1863525 "Invertible syntax descriptions: unifying parsing and pretty printing"

And this Haskell library https://hackage.haskell.org/package/roundtrip among others


I might be missing something - why would this be a breakthrough? It sounds complicated to generate the interfaces, sure, but is there a theoretical problem blocking this, or just practical?


I've looked at this problem quite a bit over the years...I agree with you completely. there isn't anything fundamental here, just the normal cultural adoption issues, usability, etc. There may be some compilation/complexity issues around formats with variable length fields and self-description, but certainly less problematic than general purpose programming.

I really wish though that there were more traction here, as I really believe that we should be quite prepared do deal with bits and not protobufs by default. nothing wrong with protobufs for quite a number of uses. I just don't know why people are so afraid of and/or biased against bit strings.


Through this thread I really feel like I’m missing something.

Are we not talking about writing binary files that conform to the spec?

Like, in the case of a GIF: simply writing valid garbage data should produce a file that presents as a valid GIF with noise for the image. Similarly: reading the file through the parser and writing it out unmodified should create an identical file (assuming no stenography).

Right?


You're right. The person saying this would be a breakthrough doesn't understand what this is doing.

There are already similarly declarative tools which can accomplish this. Haskell has binary parsing libraries which work similarly and give you both reading and writing capabilities.


I honestly don't get it either. The inverse of the read spec is the write spec. My guess having not dug deeply is that they don't distinguish between required and optional fields, that said, they should still be able to write what they have based on the read spec but could potentially be still an invalid file.


It's a really interesting idea! I'm also surprised there hasn't been much traction here. I've started a Rust library for this: https://github.com/sharksforarms/deku

It's a declarative bit-level symmetrical reader writer generator library.


Yeah I was also confused. I wrote a bidirectional parser/writer layer for yaml in a haskell program I had at work. The yaml structure mappings were all declarative in the code and even allowed documentation for the structures to be printed out. It's not that hard once you define the primitive bidirectional (higher-order) mapping function to go from a `Configurable a` to a `Configurable b`, the rest kind of unfolds from there.


Binary file formats can be vastly more convoluted than YAML. For example, consider roundtripping ZIP archives or PDF documents, or both at the same time (see https://www.alchemistowl.org/pocorgtfo/).


How exactly would it be a cs breakthrough? Are you sure you understand what this does?


Are we talking about different things?

ASN.1 does more or less what I was hoping Kaitai can do. Where's the breakthrough?

I just want to be able to describe network protocols in some "language" and generate code that can serialize/deserialize it.


Kaitai is designed to describe arbitrary, preexisting binary formats. You can’t do that with ASN.1.


Err. I'm not sure what asn.1 is missing for this? I've seen lots of people use asn.1 exactly for this (i.e. writing a grammar for an arbitrary pre-existing binary format not readily described in asn.1).


asn.1 distinguishes between schema and encoding; there are many binary encodings and you can technically devise a custom one that would let you describe the high level structure with an asn.1 grammar and then lay out the actual bits with a custom encoding format so that it matches the pre-existing format you're writing the new serde for). This may work as many formats have this leveled approach. E.g. the lowest level of the spec may tell something about how to encode integers (all integers are 32-bit big endian, or varlen encoded ...), sequences (ength prefixed, or terminated by a sentinel.

Any chances you have some reference to what you saw?


Right I think we were working with PER (tagless) to match a complex multi-layer protocol. Can't share the exemple but let's say asn.1 helped generate saner code than the handrolled one AND allowed other languages to decode...

I understand what you're saying though. Memories of using this were... unpleasant. I think the 'best' alternative for me would be RecordFlux. Ada-like syntax, generation of AoRTE-provable SPARK code, and recently the expressivity of the tool has increased. And the whole thing is in python, easily extensible to build things from the type description: generators, fuzzers, advanced specific parsers (need only one field and want the control fields' positions and sizes), wireshark plugins, Postgres extensions...

I really like what they're doing there. Might be the one of the low-effort (for the user!) lead-bullets for safer software.


I know, I was just using it as kind of an example of what I want and of a similar problem. Maybe not the right example.


I think I'm missing something, but rust's `serde`?


Is this a known theoretical CS problem?


Any other lib that also generates Encoder/Writer code?


The Construct library for Python can do it:

https://construct.readthedocs.io/en/latest/intro.html#exampl...

I’ve long searched for something better than Construct, but so far I have yet to find even an equal.


It can, but it can get incredibly slow for large formats. I was using it to reverse engineer some binary game formats but the parser would take a couple of minutes to complete. I rewrote it using struct and that time dropped to a few seconds. Useful for probing an unknown format, but I prefer the 010 editor since it’s more interactive.


Ooh, exciting! I built a parser [1] for AIS messages [2], a quirky ship-to-ship protocol. My lower-level stuff always felt clumsy to me. I'll have to see if this cleans it up.

[1] https://github.com/wpietri/simpleais [2] https://gpsd.gitlab.io/gpsd/AIVDM.html


BinData (https://github.com/dmendel/bindata) is a Ruby gem for this, basically using a DSL in Ruby to declaratively define binary data formats that can be both read and written.


I wrote a sort-of adjacent library for Go at one point. I’m a bit stuck trying to figure out exactly what to do on 2.0 but it has a lot of Kaitai like features including an expression language for transforming things (on master version) and it supports writing structures out.

https://github.com/go-restruct/restruct


I have been working on Deku: a declarative binary reading and writing: bit-level, symmetric, serialization/deserialization library. https://github.com/sharksforarms/deku


and still there is no generator that creates efficient and partialy streamed readers/writers for high performance protocols or resource constraint environments (as less pre allocations as possible, zero copy concepts,..., streamed reading/writing, good inline possibility, ...)

100% fix formats with no self reference or for example checksums that sits in front of the checksumd data (no streaming possible)...


Recordflux?


I guess the submission title is incorrect then:

>Kaitai: Describe the structure of data, not how you read or write it


Yep this is why I dropped it.

I started working on an alternative that supported writing but didn't follow through since I didn't think many people were interested in Kaitai to begin with.


Wow - thanks for the heads-up


I love Kaitai, and use it extensively (outside my day job) for exploring binary file formats of things like DOS games and quilting patterns. About a year ago I wrote a blog entry about using it for scientific data: https://matthewturk.github.io/post/kaitai-struct-scientific-... .

Recently I learned about some pretty cool work using it with respect to high-energy physics:

https://osf.io/2sner/


i found the Kaitai toolset (IDE, compiler) to be useful in parsing or "deserialization" for proprietary financial protocols. if you are just reading data (especially in an environment with clients in different languages) it's a strong recommend, however lack of support for "serialization" means that you will still need to roll your own encoders. ultimately we created our own tooling/DSL for encoding as well as generating KSY files to generate clients as part of our builds

edit: i forgot to add - there is an issue for serialization on kaitai github repo for some time [1], with some interesting discussion around the implementation challenges

1. https://github.com/kaitai-io/kaitai_struct/issues/27


This looks interesting, but is there an alternative for expressing the format? YAML is just plain awful.


I like the flatbuffer syntax as an IDL. It should be fairly trivial to write a tool to generate ksy

https://adsharma.github.io/flattools-11222020.html


In ID3 tags there's such a thing as unsynchronization: the MP3 syncword is 11 bits set to one so for the tag data not to be mistaken for an MPR3 frame it must not have a similar sequence of bits. The solution was to replace every 0xFF in tag data with 0xFF 0x00. Or not to replace, as this mistake may only be made by old players that do not understand ID3. So there's a special setting for this that may occur in two places: in the whole tag or in an individual frame within a tag.

The logic itself is simple, as you read data byte-by-byte you need to check check if the previous byte was 0xFF and the frame or tag is marked as unsychronized. Yet it's not that simple to describe this declaratively. I wonder if Kaitai can actually do this. From what I see Kaitai does have at least part of ID3 described, but it doesn't seem to actually do unsynchronization, as far as I can tell from the code.


The first thing that comes to my mind in this context is that "kaitai" means "want to write" (書いたい) in Japanese. It also has many other meanings: https://jisho.org/search/kaitai . Maybe the authors had disassembly (解体) in mind.


Just a correction to point away from want to write. It's 書きたい (kakitai) for want to write (and it's probably not want to buy 買いたい).


Thanks for catching my blind spot there. I think what happened was that I used the -te form (書いて) and stripped off -te, but didn't conjugate the verb properly starting from 書く.


I note that the Kaitai compiler is written in Scala and built using sbt, both of which are unfortunately not bootstrappable.

https://bootstrappable.org/projects/jvm-languages.html


Can Kaitai Struct store its metadata in a binary form that's parseable with Kaitai Struct?

Seems like there's an obvious bootstrapping task there that would then make the custom .ksy format irrelevant.


Is there support for C in the pipeline? That would be nice, because lots of binary data manipulation in C even now, and it could use some quality-of-life updates.


I recently started making a (hopefully) version-independent blender converter using Kaitai:

https://github.com/bjorn-ali-goransson/wz-pie-converter

(Its for 3d models for the FOSS game Warzone 2100)


I can s this being hugely useful to bring able to generate code to read wider range of formats instead of having to include/embed dozens of 3rd party libs to read the files. Could be strong tool for commercial product dev that have legal or technical limitation on use of 3rd party code.


Guess it is similar to Apache Daffodil.


No support for sum types? I didn't see even simple unions covered.


https://adsharma.github.io/flattools-11222020.html

supports unions and 4 popular type safe languages. The idea is that you'd write decorators in those languages to implement functionality similar to ksy or write a template to generate ksy and reuse Kaitai tool chain.


Do you have any use case for these?


This is very interesting, but just curious, when would you use something like this versus, say, protobufs?


Protocol Buffers and Kaitai structs solve different problems, although they both deal with serialized data. With a protobuf, you don't really care about how the data that you're serializing into the binary buffer is represented in the binary format. All you care about is that your data can be serialized and deserialized. Conversely, Kaitai allows you to control and specify the representation of the serialized data. This allows you to specify arbitrary formats (like image formats, for example).

Essentially you start with the serialization format instead of starting with the deserialized data (or like how the title of this submission says, "describe the structure of data"). As such, you can somewhat describe protobufs [0] using Kaitai structs, but the converse is not necessarily true.

There's actually a section under their FAQ [1] with a more in-depth response to this.

[0] https://formats.kaitai.io/google_protobuf/index.html

[1] https://doc.kaitai.io/faq.html#vs-protobuf


Kaitai Struct can read any 'ol random data format while protobufs (I assume) are only able to read/write a specific protocol.

I was playing with it a while back along with wasm and got it to decode all the individual opcodes (along with the rest of the file) but it turned out to be really, really slow in the generated python version. C++ probably has much better performance but I haven't actually tested the difference.


Last time I looked at it, it generated sub-objects as pointers allocated with new, which was a bit meh.


You can use it to reverse engineer specifications for arbitrary binary data.


If I understand it correctly, katai generates for you the reader as well, while protobuf generates only the data container. Katai is more a sort of protobuf + grpc for reading data.


Big difference is Katai is designed to parse existing binary formats like PNG, JPEG, MIDI, WAV files.

Things like Protobuf can serialize/deserialize any data but it’s very opiniated about how to do this. You won’t convince it to work with existing file formats.


Protobufs generate readers/writers for various languages (i.e. serialization/deserializiation).

gRPC is an RPC framework that uses Protobufs.


If you try to speak a protocol which is not yours.


Seems like a bummer that the stream interfaces are blocking for C++ and java :/


The industry rediscovers DEC RMS?


This is a use-case for which a DSL is well-suited. They would be wise to abandon YAML.


I think that I agree of you, and I'm obviously fond of clever DSL's cough mgmt config cough, but it's not clear what specifics you had in mind here. Share away if you're interested!


YAML has far too many features and doesn't let you structure data well for this use-case. There's no concise way to provide typings for its fields, for instance - note how every field requires two lines in the examples, far too verbose.

I don't really know what to tell you other than that YAML is not well-suited to this domain, and designing a domain-specific language which meets the particular needs of the system, and only just, would be better. You could base it off of C structs or Rust types or s-expressions or whatever else, the exact syntax isn't important so long as it allows you to concisely and precisely specify the semantics of the tool.


If it was something that looked like YAML, but wasn't, that'd be fine. YAML in its simplest form is a neat idea. Indentation is simple and obvious. Same as it doesn't matter that TOML looks a heck of a lot like INI. But using YAML itself is terrible, because YAML is a colossally bloated language that does crazy things at the wrong points and has a million syntaxes that step on each others' toes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: