Hacker News new | past | comments | ask | show | jobs | submit login
Interval Parsing Grammars for File Format Parsing (2023) [pdf] (acm.org)
63 points by vitplister 7 months ago | hide | past | favorite | 7 comments



I built a grammar to tackle these sorts of problems when I had trouble writing formal grammar notations for my binary data format. It's even got a syntax highlighter.

https://dogma-lang.org/

So far it's been able to describe 90% of what's out there. Some examples:

- 802.3 layer 2 Ethernet: https://github.com/kstenerud/dogma/blob/master/v1/examples/8...

- Microsoft ICO format: https://github.com/kstenerud/dogma/blob/master/v1/examples/i...

- Android Dex v39: https://github.com/kstenerud/dogma/blob/master/v1/examples/d...

- IPv4: https://github.com/kstenerud/dogma/blob/master/v1/examples/i...

- DNS query: https://github.com/kstenerud/dogma/blob/master/v1/examples/d...

- Microsoft Minidump: https://github.com/kstenerud/dogma/blob/master/v1/examples/m...

- Concise Binary Encoding: https://github.com/kstenerud/concise-encoding/blob/master/cb...

- Concise Text Encoding: https://github.com/kstenerud/concise-encoding/blob/master/ct...


The main feature of interval parsing appears to be that it can jump over content such that a later part in a file does not depend on knowing everything that comes before it. Has Dogma similar expressiveness?


Yes, the `offset` function does this by specifying a bit-offset to branch to. For example the ICO `dir_entry`, which is a directory list of icon resources in the file. https://github.com/kstenerud/dogma/blob/master/v1/examples/i... - It's using image_offset*8 because everything in an ICO file is a byte-offset (8 bits)

It's also needed to parse Minidump. For example https://github.com/kstenerud/dogma/blob/master/v1/examples/m... and https://github.com/kstenerud/dogma/blob/master/v1/examples/m...


Greg Morrisett is one of those people who is consistently involved in neat work. Typed x86, Cyclone, and SAFE architecture come to mind. It’s best to just all the papers of such people. I found a list of his:

https://www.cs.cornell.edu/~jgm/jgm.html


My searches for any existing publication of those code snippets didn't shake out, so I waited for the 5GB download of the docker .tar and pulled out the files that ended in .ipg. I'm very cognizant that's not the whole story, but that's what I had the energy to do for now. I really wanted to see the PDF one, since that's actually the heuristic I use for evaluating any such "I can describe binary files" framework because that file format is ... special

https://gist.github.com/mdaniel/cdf52de6a8aa8982d591da82b160...

  tar -xOf ipg-pldi-ae.tar e4a01dbc5b413d9709f0cf716cdb848725893b5f97e0870e09fd83e16839dfad/layer.tar \
  | tar -xvf - \
      home/opam/pldi-ae/IPG/spec/dns/ipg/dns.ipg \
      home/opam/pldi-ae/IPG/spec/elf/ipg/elf.ipg \
      home/opam/pldi-ae/IPG/spec/elf/ipg/readelf.ipg \
      home/opam/pldi-ae/IPG/spec/gif/ipg/gif.ipg \
      home/opam/pldi-ae/IPG/spec/ipv4/ipg/ipv4.ipg \
      home/opam/pldi-ae/IPG/spec/pe/ipg/pe.ipg \
      home/opam/pldi-ae/IPG/spec/zip/blackbox/unzip.ipg \
      home/opam/pldi-ae/IPG/spec/zip/ipg/zip.ipg
because Gist wouldn't let me use "/" in the filenames, I just replaced them with _ after killing the home/opam part; I left the rest so hopefully they'll show up in search results since pldi-ae and IPG are pretty distinct


BinaryLang (for Nim) has similar features[1]. I've written a very compact ELF parser with it[2]. Notice that the last struct has array elements that skip over content based on offsets specified in the header.

[1] https://github.com/sealmove/binarylang

[2] https://github.com/khaledh/elfdump/blob/master/elfparse.nim


Wow. The introduction itself blows my mind. I would've never thought of even trying to specify a formal grammar for binary file formats, let alone come up with an algorithm to handle context sensitive ones that arise in practice. Seems awesome, especially this bit:

> To the best of our knowledge, IPGs support all syntactic and parsing-based properties in common file formats and can reduce discrepancies between a file format specification and an implementation, as well as discrepancies between different implementations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: