Kaitai: Describe the structure of data, not how you read or write it

carom · on Dec 5, 2020

I do love kaitai and recently contributed a grammar, but both the title and the copy talk about writing.

> Reading and writing binary formats is hard... Kaitai Struct tries to make this job easier...

Kaitai cannot write data back out. [1] This is a major limitation for me. It would be nice to use it as a mutation engine for fuzzing, but without being able to write it back out, it is mostly just beneficial for analysis.

1. https://doc.kaitai.io/faq.html#writing

kevinherron · on Dec 5, 2020

Yikes, huge limitation. Guess I won't be looking into this any further.

imtringued · on Dec 5, 2020

It would be a CS breakthrough if it could... You're asking a lot...

nextaccountic · on Dec 6, 2020

This breakthrough happened already. This is called bidirectional or invertible parsing. See it discussed here https://news.ycombinator.com/item?id=16392654

And this paper https://dl.acm.org/doi/10.1145/1863523.1863525 "Invertible syntax descriptions: unifying parsing and pretty printing"

And this Haskell library https://hackage.haskell.org/package/roundtrip among others

jacb · on Dec 5, 2020

I might be missing something - why would this be a breakthrough? It sounds complicated to generate the interfaces, sure, but is there a theoretical problem blocking this, or just practical?

convolvatron · on Dec 5, 2020

I've looked at this problem quite a bit over the years...I agree with you completely. there isn't anything fundamental here, just the normal cultural adoption issues, usability, etc. There may be some compilation/complexity issues around formats with variable length fields and self-description, but certainly less problematic than general purpose programming.

I really wish though that there were more traction here, as I really believe that we should be quite prepared do deal with bits and not protobufs by default. nothing wrong with protobufs for quite a number of uses. I just don't know why people are so afraid of and/or biased against bit strings.

joshspankit · on Dec 5, 2020

Through this thread I really feel like I’m missing something.

Are we not talking about writing binary files that conform to the spec?

Like, in the case of a GIF: simply writing valid garbage data should produce a file that presents as a valid GIF with noise for the image. Similarly: reading the file through the parser and writing it out unmodified should create an identical file (assuming no stenography).

Right?

IceDane · on Dec 6, 2020

You're right. The person saying this would be a breakthrough doesn't understand what this is doing.

There are already similarly declarative tools which can accomplish this. Haskell has binary parsing libraries which work similarly and give you both reading and writing capabilities.

spullara · on Dec 5, 2020

I honestly don't get it either. The inverse of the read spec is the write spec. My guess having not dug deeply is that they don't distinguish between required and optional fields, that said, they should still be able to write what they have based on the read spec but could potentially be still an invalid file.

armsforsharks · on Dec 5, 2020

It's a really interesting idea! I'm also surprised there hasn't been much traction here. I've started a Rust library for this: https://github.com/sharksforarms/deku

It's a declarative bit-level symmetrical reader writer generator library.

afranchuk · on Dec 6, 2020

Yeah I was also confused. I wrote a bidirectional parser/writer layer for yaml in a haskell program I had at work. The yaml structure mappings were all declarative in the code and even allowed documentation for the structures to be printed out. It's not that hard once you define the primitive bidirectional (higher-order) mapping function to go from a `Configurable a` to a `Configurable b`, the rest kind of unfolds from there.

HelloNurse · on Dec 6, 2020

Binary file formats can be vastly more convoluted than YAML. For example, consider roundtripping ZIP archives or PDF documents, or both at the same time (see https://www.alchemistowl.org/pocorgtfo/).

IceDane · on Dec 6, 2020

How exactly would it be a cs breakthrough? Are you sure you understand what this does?

kevinherron · on Dec 5, 2020

Are we talking about different things?

ASN.1 does more or less what I was hoping Kaitai can do. Where's the breakthrough?

I just want to be able to describe network protocols in some "language" and generate code that can serialize/deserialize it.

comex · on Dec 5, 2020

Kaitai is designed to describe arbitrary, preexisting binary formats. You can’t do that with ASN.1.

touisteur · on Dec 6, 2020

Err. I'm not sure what asn.1 is missing for this? I've seen lots of people use asn.1 exactly for this (i.e. writing a grammar for an arbitrary pre-existing binary format not readily described in asn.1).

ithkuil · on Dec 6, 2020

asn.1 distinguishes between schema and encoding; there are many binary encodings and you can technically devise a custom one that would let you describe the high level structure with an asn.1 grammar and then lay out the actual bits with a custom encoding format so that it matches the pre-existing format you're writing the new serde for). This may work as many formats have this leveled approach. E.g. the lowest level of the spec may tell something about how to encode integers (all integers are 32-bit big endian, or varlen encoded ...), sequences (ength prefixed, or terminated by a sentinel.

Any chances you have some reference to what you saw?

touisteur · on Dec 6, 2020

Right I think we were working with PER (tagless) to match a complex multi-layer protocol. Can't share the exemple but let's say asn.1 helped generate saner code than the handrolled one AND allowed other languages to decode...

I understand what you're saying though. Memories of using this were... unpleasant. I think the 'best' alternative for me would be RecordFlux. Ada-like syntax, generation of AoRTE-provable SPARK code, and recently the expressivity of the tool has increased. And the whole thing is in python, easily extensible to build things from the type description: generators, fuzzers, advanced specific parsers (need only one field and want the control fields' positions and sizes), wireshark plugins, Postgres extensions...

I really like what they're doing there. Might be the one of the low-effort (for the user!) lead-bullets for safer software.

kevinherron · on Dec 5, 2020

I know, I was just using it as kind of an example of what I want and of a similar problem. Maybe not the right example.

OJFord · on Dec 6, 2020

I think I'm missing something, but rust's `serde`?

quiescant_dodo · on Dec 5, 2020

Is this a known theoretical CS problem?

floppy123 · on Dec 5, 2020

Any other lib that also generates Encoder/Writer code?

comex · on Dec 5, 2020

The Construct library for Python can do it:

https://construct.readthedocs.io/en/latest/intro.html#exampl...

I’ve long searched for something better than Construct, but so far I have yet to find even an equal.

attheicearcade · on Dec 5, 2020

It can, but it can get incredibly slow for large formats. I was using it to reverse engineer some binary game formats but the parser would take a couple of minutes to complete. I rewrote it using struct and that time dropped to a few seconds. Useful for probing an unknown format, but I prefer the 010 editor since it’s more interactive.

wpietri · on Dec 5, 2020

Ooh, exciting! I built a parser [1] for AIS messages [2], a quirky ship-to-ship protocol. My lower-level stuff always felt clumsy to me. I'll have to see if this cleans it up.

[1] https://github.com/wpietri/simpleais [2] https://gpsd.gitlab.io/gpsd/AIVDM.html

tprynn · on Dec 6, 2020

BinData (https://github.com/dmendel/bindata) is a Ruby gem for this, basically using a DSL in Ruby to declaratively define binary data formats that can be both read and written.

jchw · on Dec 6, 2020

I wrote a sort-of adjacent library for Go at one point. I’m a bit stuck trying to figure out exactly what to do on 2.0 but it has a lot of Kaitai like features including an expression language for transforming things (on master version) and it supports writing structures out.

https://github.com/go-restruct/restruct

armsforsharks · on Dec 5, 2020

I have been working on Deku: a declarative binary reading and writing: bit-level, symmetric, serialization/deserialization library. https://github.com/sharksforarms/deku

floppy123 · on Dec 6, 2020

and still there is no generator that creates efficient and partialy streamed readers/writers for high performance protocols or resource constraint environments (as less pre allocations as possible, zero copy concepts,..., streamed reading/writing, good inline possibility, ...)

100% fix formats with no self reference or for example checksums that sits in front of the checksumd data (no streaming possible)...

touisteur · on Dec 6, 2020

Recordflux?

layoutIfNeeded · on Dec 6, 2020

I guess the submission title is incorrect then:

>Kaitai: Describe the structure of data, not how you read or write it

junon · on Dec 7, 2020

Yep this is why I dropped it.

I started working on an alternative that supported writing but didn't follow through since I didn't think many people were interested in Kaitai to begin with.

iab · on Dec 5, 2020

Wow - thanks for the heads-up

mturk · on Dec 5, 2020

I love Kaitai, and use it extensively (outside my day job) for exploring binary file formats of things like DOS games and quilting patterns. About a year ago I wrote a blog entry about using it for scientific data: https://matthewturk.github.io/post/kaitai-struct-scientific-... .

Recently I learned about some pretty cool work using it with respect to high-energy physics:

https://osf.io/2sner/

whiskypeters · on Dec 5, 2020

i found the Kaitai toolset (IDE, compiler) to be useful in parsing or "deserialization" for proprietary financial protocols. if you are just reading data (especially in an environment with clients in different languages) it's a strong recommend, however lack of support for "serialization" means that you will still need to roll your own encoders. ultimately we created our own tooling/DSL for encoding as well as generating KSY files to generate clients as part of our builds

edit: i forgot to add - there is an issue for serialization on kaitai github repo for some time [1], with some interesting discussion around the implementation challenges

1. https://github.com/kaitai-io/kaitai_struct/issues/27

beders · on Dec 6, 2020

This looks interesting, but is there an alternative for expressing the format? YAML is just plain awful.

adsharma · on Dec 7, 2020

I like the flatbuffer syntax as an IDL. It should be fairly trivial to write a tool to generate ksy

https://adsharma.github.io/flattools-11222020.html

Mikhail_Edoshin · on Dec 6, 2020

In ID3 tags there's such a thing as unsynchronization: the MP3 syncword is 11 bits set to one so for the tag data not to be mistaken for an MPR3 frame it must not have a similar sequence of bits. The solution was to replace every 0xFF in tag data with 0xFF 0x00. Or not to replace, as this mistake may only be made by old players that do not understand ID3. So there's a special setting for this that may occur in two places: in the whole tag or in an individual frame within a tag.

The logic itself is simple, as you read data byte-by-byte you need to check check if the previous byte was 0xFF and the frame or tag is marked as unsychronized. Yet it's not that simple to describe this declaratively. I wonder if Kaitai can actually do this. From what I see Kaitai does have at least part of ID3 described, but it doesn't seem to actually do unsynchronization, as far as I can tell from the code.

nayuki · on Dec 5, 2020

The first thing that comes to my mind in this context is that "kaitai" means "want to write" (書いたい) in Japanese. It also has many other meanings: https://jisho.org/search/kaitai . Maybe the authors had disassembly (解体) in mind.

denial · on Dec 6, 2020

Just a correction to point away from want to write. It's 書きたい (kakitai) for want to write (and it's probably not want to buy 買いたい).

nayuki · on Dec 7, 2020

Thanks for catching my blind spot there. I think what happened was that I used the -te form (書いて) and stripped off -te, but didn't conjugate the verb properly starting from 書く.

pabs3 · on Dec 6, 2020

I note that the Kaitai compiler is written in Scala and built using sbt, both of which are unfortunately not bootstrappable.

https://bootstrappable.org/projects/jvm-languages.html

aappleby · on Dec 5, 2020

Can Kaitai Struct store its metadata in a binary form that's parseable with Kaitai Struct?

Seems like there's an obvious bootstrapping task there that would then make the custom .ksy format irrelevant.

billfruit · on Dec 6, 2020

Is there support for C in the pipeline? That would be nice, because lots of binary data manipulation in C even now, and it could use some quality-of-life updates.

aliswe · on Dec 7, 2020

I recently started making a (hopefully) version-independent blender converter using Kaitai:

https://github.com/bjorn-ali-goransson/wz-pie-converter

(Its for 3d models for the FOSS game Warzone 2100)

zoom6628 · on Dec 6, 2020

I can s this being hugely useful to bring able to generate code to read wider range of formats instead of having to include/embed dozens of 3rd party libs to read the files. Could be strong tool for commercial product dev that have legal or technical limitation on use of 3rd party code.

atbpaca · on Dec 5, 2020

Guess it is similar to Apache Daffodil.

karavelov · on Dec 5, 2020

No support for sum types? I didn't see even simple unions covered.

adsharma · on Dec 7, 2020

https://adsharma.github.io/flattools-11222020.html

supports unions and 4 popular type safe languages. The idea is that you'd write decorators in those languages to implement functionality similar to ksy or write a template to generate ksy and reuse Kaitai tool chain.

ta988 · on Dec 6, 2020

Do you have any use case for these?

nullspace · on Dec 5, 2020

This is very interesting, but just curious, when would you use something like this versus, say, protobufs?

karlding · on Dec 5, 2020

Protocol Buffers and Kaitai structs solve different problems, although they both deal with serialized data. With a protobuf, you don't really care about how the data that you're serializing into the binary buffer is represented in the binary format. All you care about is that your data can be serialized and deserialized. Conversely, Kaitai allows you to control and specify the representation of the serialized data. This allows you to specify arbitrary formats (like image formats, for example).

Essentially you start with the serialization format instead of starting with the deserialized data (or like how the title of this submission says, "describe the structure of data"). As such, you can somewhat describe protobufs [0] using Kaitai structs, but the converse is not necessarily true.

There's actually a section under their FAQ [1] with a more in-depth response to this.

[0] https://formats.kaitai.io/google_protobuf/index.html

[1] https://doc.kaitai.io/faq.html#vs-protobuf

UncleEntity · on Dec 5, 2020

Kaitai Struct can read any 'ol random data format while protobufs (I assume) are only able to read/write a specific protocol.

I was playing with it a while back along with wasm and got it to decode all the individual opcodes (along with the rest of the file) but it turned out to be really, really slow in the generated python version. C++ probably has much better performance but I haven't actually tested the difference.

tubs · on Dec 6, 2020

Last time I looked at it, it generated sub-objects as pointers allocated with new, which was a bit meh.

imtringued · on Dec 5, 2020

You can use it to reverse engineer specifications for arbitrary binary data.

vander_elst · on Dec 5, 2020

If I understand it correctly, katai generates for you the reader as well, while protobuf generates only the data container. Katai is more a sort of protobuf + grpc for reading data.

jackjeff · on Dec 5, 2020

Big difference is Katai is designed to parse existing binary formats like PNG, JPEG, MIDI, WAV files.

Things like Protobuf can serialize/deserialize any data but it’s very opiniated about how to do this. You won’t convince it to work with existing file formats.

lima · on Dec 5, 2020

Protobufs generate readers/writers for various languages (i.e. serialization/deserializiation).

gRPC is an RPC framework that uses Protobufs.

uerg · on Dec 5, 2020

If you try to speak a protocol which is not yours.

rockwotj · on Dec 6, 2020

Seems like a bummer that the stream interfaces are blocking for C++ and java :/

nickdothutton · on Dec 7, 2020

The industry rediscovers DEC RMS?

ddevault · on Dec 5, 2020

This is a use-case for which a DSL is well-suited. They would be wise to abandon YAML.

purpleidea · on Dec 6, 2020

I think that I agree of you, and I'm obviously fond of clever DSL's cough mgmt config cough, but it's not clear what specifics you had in mind here. Share away if you're interested!

ddevault · on Dec 6, 2020

YAML has far too many features and doesn't let you structure data well for this use-case. There's no concise way to provide typings for its fields, for instance - note how every field requires two lines in the examples, far too verbose.

I don't really know what to tell you other than that YAML is not well-suited to this domain, and designing a domain-specific language which meets the particular needs of the system, and only just, would be better. You could base it off of C structs or Rust types or s-expressions or whatever else, the exact syntax isn't important so long as it allows you to concisely and precisely specify the semantics of the tool.

pie_flavor · on Dec 6, 2020

If it was something that looked like YAML, but wasn't, that'd be fine. YAML in its simplest form is a neat idea. Indentation is simple and obvious. Same as it doesn't matter that TOML looks a heck of a lot like INI. But using YAML itself is terrible, because YAML is a colossally bloated language that does crazy things at the wrong points and has a million syntaxes that step on each others' toes.