Hacker News new | past | comments | ask | show | jobs | submit login
Construct-JS – A library for creating byte level data structures (github.com/francisrstokes)
163 points by FrancisStokes on March 18, 2019 | hide | past | favorite | 50 comments



DataView [0] itself already has pretty useful methods to work with binary data. To encode/decode strings, use TextEncoder [1]

Both implementations are very optimized and fast. Also DataView offers working with 64-bit values using the new BigInt type [2].

Here [3] a short example on how you can quickly build C-like structures in JavaScript

[0] DataView: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] TextEncoder: https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] BigInt: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[3] C-like Structures: https://jsfiddle.net/5tz0hsjq/


construct-js uses DataView under the hood. I wrote construct-js to ease working with binary data by allowing you to name and reference sections, and handle things like endianess easily in one place. With a DataView, I still need to calculate offsets into a structure manually, whereas construct-js has .getOffset(name) and .getDeepOffset(path). If performance is key however, use the lowest level of course.


First off, great to see interest in binary data in JS :)

We looked into this problem years ago as part of our spreadsheet parser/writer library (open source version https://github.com/sheetjs/js-xlsx) and ultimately opted for lower-level functions that act on Buffers/ArrayBuffers/Arrays. Performance-wise, creating a new ArrayBuffer / Buffer for each field (which you are currently doing in https://github.com/francisrstokes/construct-js/blob/master/s...) is incredibly expensive. When we did performance tests years ago, allocating in 2KB blocks and manually orchestrating writes was nearly 10x faster than individual field-level allocations and concatenations both in the browser and in NodeJS.

Ironically, the ZIP file example referenced in the README alludes to another pitfall in the approach: the actual DEFLATE algorithm used in compression actually requires unaligned bit writes. See section 5.5 of the current APPNOTE.TXT for more details: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT


Thanks! When it comes to performance I can understand choosing a more low level approach. construct-js trades performance for declarativeness (though I suspsect that I can do a lot of optimising under the hood).

As for DEFLATE, I'm working on an automatic bit level structure at the moment which would allow for unaligned structures. Should be in the lib in a couple of days.


IIRC, when I last checked this (admittedly this was years ago) a single TypedArray object had 200 bytes of overhead - not counting the backing buffer for the array itself - due to the complexity of the one backing buffer being potentially used by multiple arrays and so on. By comparison regular object had about a dozen bytes of overhead. Because of this, tiny typed arrays are almost always worse in terms of performance than objects or plain arrays. Especially if we are talking about "type stable" objects, that is: prototypical objects with a hidden class that does not change.

Things have probably improved in Buffer/DataView/TypedArray land thanks to the push for WASM, but the allocation overhead still will be a lot higher and involve a lot more work to allocate them.


What's the point of such a "declarative" API?

  const localHeader = Struct('localHeader')
    .field('header', DoubleWord(0x04034b50))
    .field('sharedHeaderInfo', sharedHeaderInfo)
    .field('filename', filename);
There's already a construct for key-value pairs in JS, it's called an object:

  const localHeader = Struct('localHeader', {
    'header': DoubleWord(0x04034b50),
    'sharedHeaderInfo': sharedHeaderInfo,
    'filename': filename
  });


Although this is only a handful of lines [0] it's nice to design things well.

- By providing a 'builder' you add a clear API to the struct object's constructor. It doesn't make any more or fewer claims of capibility than it needs to. By exposing the internal storage you break that level of abstraction.

- Even assuming that the JS object type _is_ ordered, the ADT of the Struct type may be subtly different to the ADT of the object. By coupling the two you may prevent future flexibility.

- To that point, the 'field' method _does_ do something with each invocation (forceEndianess) as it adds it to the internal storage. If you passed in a map object it would still need to traverse each key to do that work, which (knowing JS objects) might make it more complicated.

[0] https://github.com/francisrstokes/construct-js/blob/master/s...


You seem to assume that the passed-in object should also be used as the resulting object. This is false, as they can be decoupled.

In other words, it's still possible to pass in a separate js object instance containing the configuration.

Yet I'm not sure about the whole of your arguments, so I can't tell if it would actually make sense to wrap the default js builder pattern implementation (the javascript object initializer syntax aaaaaaaaaaab refers to).

However, I do method builders often overused in API's where a simple singular builder method combined with builder classes suffice and simplify.


Agreed to an extent. But what if you want to extend the API to include the same field twice (for example). The semantics of the input object would prevent that.

I'm reluctant to spend too much time arguing over a fine point. It could be done one way or the other. I was just addressing aaaaaaaaaaab's question.


Really well put!


since order is important, the underlying storage is probably an array of K/V pairs. Object.entries on a dynamically generated object does not guarantee a particular order


They could have used a Map (introduced in ES6 [1]) which maintains key insertion order. However, I'm not sure whether either the object or map interface would be significantly better or worse than the builder-style the author picked. It just doesn't seem to matter, IMO.

The interface that would be most flexible would probably be a list of alternating key values, e.g. ['fieldname', datastruct(), 'fieldname'] since then you could do all the list things to it.

[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


The underlying structure is not the issue, it's the object literal that doesn't have its key order defined (in older standards) and therefore the interface as suggested would be broken.

In practice the order will be maintained in the major engines, but that's still not a good reason to do rely on it.


I think the parent is suggesting taking a map instead of an object as the interface for constructing the Struct objects. Not using it as the underlying structure.


Map wouldn't have really brought a lot of clarity or control. You'd have to specify an array of array-pairs, which is noisy IMO, and then you wouldn't have the benefit of being able to programatically/conditionally add fields.

I might consider a static `Struct.fromMap(name, mapObject)`, but as a complimentary API.


Please don't use "word" to mean 16 bits. In an era when machine words are generally 64 bits, we're not talking about an anachronism from the previous generation, but the one before that - in a language that is completely insulated from the actual machine architecture.

One thing I love about Rust is that it uses u16, u32, u64 etc for unsigned and i16, i32, i64 etc for signed, which is about perfect - clear, concise and future-proof. That would be perfect for this library.

https://en.wikipedia.org/wiki/Word_(computer_architecture)


>> One thing I love about Rust is that it uses u16, u32, u64 etc for unsigned and i16, i32, i64 etc for signed, which is about perfect

Yes, even good old C has uint16_t and int16_t for this. I use these exclusively for embedded work because we care about the size of everything. Also agree that Rust gets it right by using a single character with the size: u16, i16.

It's funny because C opted to leave the number of bits machine dependent in the name of portability, but that turns out to have the opposite effect.


> It's funny because C opted to leave the number of bits machine dependent in the name of portability, but that turns out to have the opposite effect.

That depends on what you consider the portable part. In the era of proprietary mainframes operating on proprietary datasets, data portability probably didn't matter as much as code portability to perform the same sort of operation on another machine.

C's size-less `int`, `float`, etc allows the exact same code to compile on the native (high speed) sizes of the platform without any editing, `ifdef`s, etc.

(Side note: That's what bothers me a lot about the language wars -- the features of a language are based on the trade-offs between legibility, performance, and the environment from the era they were intended to be used in. Often both sides of those spats fail to remember that.)


> good old C has uint16_t

Firstly, no, good old C doesn't. These things are a rather new addition (C99). In 1999 there was decades of good old C already which didn't have int16_t.

It is implementation-defined whether there is an int16_t; so C doesn't really have int16_t in the sense that it has int.

> It's funny because C opted to leave the number of bits machine dependent in the name of portability, but that turns out to have the opposite effect.

Is that so? This code will work nicely on an ancient Unix box with 16 bit int, or on a machine with 32 or even 64 bit int:

  #include <stdio.h>
  int main(void)
  {
    int i;
    char a[] = "abc";
    for (i = 0; i < sizeof a; i++)
      putchar(a[i]);
    putchar('\n');
    return 0;
  }
Write a convincing argument that we should change both ints here to int32_t or whatever for improved portability.


>> Firstly, no, good old C doesn't.

Yes, it does. The C99 standard has been around for 20 years. I started giving up on compilers that don't support it 10 years ago. I consider it a given for C.

>> It is implementation-defined whether there is an int16_t

No, it's not. That type is part of the C99 standard.

>> Is that so? This code will work nicely on an ancient Unix box with 16 bit int, or on a machine with 32 or even 64 bit int:

That's cool, your example only needs to count to 3. Any size integer will do. The problems arise when you go to 100,000 and that old 16bit machine rolls over at 65536 but the newer ones don't. Other times someone (me) may want things to roll over at the 16 bit boundary and we need to specify the size as int16_t rather than figure out if int or short or char is that size for each architecture. (and yes I know rollover is undefined behavior)

>> Write a convincing argument that we should change both ints here to int32_t or whatever for improved portability.

In your example it doesn't matter. I'd argue that at least giving the size of your integers some thought every time is a good habit to get into so you don't write non-portable code in the cases where it does matter. You are free to argue that it's too much effort or something, but I'd invite you to argue that "i16" is more effort to type than "int".


> The problems arise when you go to 100,000 and that old 16bit machine rolls over at 65536 but the newer ones don't

There, we are running into the question of: on that system, can we define that array at all, if it has 100,000 characters.

> That type is part of the C99 standard.

Unfortunately, the standard isn't what translates and executes your code; that would be the implementation. The standard allows implementations not to provide int16_t, if they have no type for which it can be a typedef name. A maximally program can use int16_t only if it has detected that it's present. If an int16_t typedef name is provided then <stdint.h> will also define the macro INT16_MAX. We can thus have code conditional on the existence of int16_t via #ifdef. (Since we know it has to be nonzero if provided, we can use #if also).

> Other times someone (me) may want things to roll over at the 16 bit boundary and we need to specify the size as int16_t rather than figure out if int or short or char is that size for each architecture. (and yes I know rollover is undefined behavior)

Someone who knows C knows that unsigned short is 16 bits wide or wider, as is unsigned int. A maximally portable wrap-around of a 16 bit counter, using either of thise types, is achieved using: cntr = (cntr + 1) & 0xFFFF. That will work on a machine that has no 16 bit word, and whose compiler doesn't simulate the existence of one.

We can do it less portably if we rely on there being a uint16_t; then we can drop the & 0xFFFF. (It isn't undefined behavior if we use the unsigned type.) The existence of uint16_t in the standard is encouraging programmers to write less portable; they reach for that instead of the portable code with the & 0xFFFF. Code relying on uint16_t, ironically, is not portable to the PDP-7 machines that Thompson and Ritchie originally worked with on Unix; those machines have 18 bit words.


I've always wondered if someone wrote a close-to-the-metal C compiler for the CDC 6400.

60 bit words, 18 and 60 bit registers, 6 bit characters.


I think that technically isn't allowed, since the C standard requires single-byte representations of 26 uppercase + 26 lowercase + 10 numeric + 29 symbol + 1 space = 92 characters (and assorted (most of them useless) control characters), which don't fit in 64 values of a 6-bit byte.

It would interesting to see a CDC 6400 implementation of "like C, but we ignore the bit about character set and also the part where the compiler is a evil genie trying to screw you over", though.


"word" for "16 bits" is not so much an anachronism as an Intel-Microsoftism.

In computing, "word" is understood to be machine dependent: "what is that machine's word size?"


I suspect the reason for this unfortunate naming has something to do with the fact that strings are encoded in UTF16 in JavaScript. Having said that I completely agree with you.


If you can live with compiling a declarative yaml description and only need to read the data then there is also kaitai-struct. [0] One of the compilation targets is javascript. Their web ide (feels kind of like a simple sweetscapes 010 editor) which is quite nice uses the js target. [1] Other targets are c++, java, php, python, ruby and more.

I used it in the past to read a proprietary file format and it worked well, but they also have quite a few predefined formats in their gallery. [2]

[0] http://kaitai.io/

[1] https://ide.kaitai.io/

[2] http://formats.kaitai.io/


What’s the difference between something like this and tools like protobuf?

Edit: this provides a good explanation http://doc.kaitai.io/faq.html#_google_protocol_buffers_asn_1...


Kaitai doesn't (currently) support writing out data, only reading it in, so this is fairly different.


Does this not have any relationship to Construct, the python library for doing the same thing? https://construct.readthedocs.io/

Seems weird that it's not even mentioned. I thought it would be a port.


I believe I have the same thing in structurae [1]. I have a RecordArray[2] class for exactly this: you can define a struct (or record) and use ArrayBuffer to hold a specified number of structs. Although I went with extending DataView to do it all instead of creating a "wrapping" class. I wrote about it in the article introducing structurae[3].

[1] https://github.com/zandaqo/structurae

[2] https://github.com/zandaqo/structurae#RecordArray

[3] https://blog.usejournal.com/structurae-data-structures-for-h...


I was just thinking about some of the barriers crypto implementations, for example, face when thinking of doing a JavaScript implementation. A library like this, especially for pedagogical code, eliminates a very important barrier, so, bravo!

The idea is a good one for another reason, because you can use this API to define a data-structure which can be 'rendered' in different ways - which could be useful for porting data-structures between node, DOM and WASM.

Bravo!


Thanks!


Might be good to clarify use cases in the README. It sounds like this is useful for reading and writing binary file formats. My initial thought was that this was intended for careful memory management for performance reasons, like a packed array of structs like you see in lower-level languages. But seems like there's probably no way to do that in pure JS without slowing things down significantly, which mostly would ruin the point.


Indeed.

Mnemonist [1] is a library which looks more like what you're looking for. There are some performance benchmarks here and there on the web, comparing the performance of their data structures with other....

[1]: https://yomguithereal.github.io/mnemonist/


Talking about binary - What's the situation like today if you want to look under the hood of a webasm program? I haven't tried it myself but I guess it's a lot harder to figure out what's going on compared to even minified javascript?


If you are the original developer, there is source maps support at least in Firefox so you can step through the source code like with any normal debugger.

If you are not the original developer, most likely source maps are not available to you. Then, wasm can be studied with already existing reverse engineering tools, like IDA Pro, binary ninja or radare2/cutter. Last one is even FLOSS.


Assuming you're talking about webassembly, the binaryen toolchain includes programs that convert wasm code into its textual representation (based on S-expressions).

https://github.com/WebAssembly/binaryen


I wrote a similar project many years ago: jParser. https://github.com/vjeux/jParser


Reminds me of a similar project varstruct (https://github.com/varstruct/varstruct). Also maybe you could consider the abstract-encoding interface specification (https://github.com/mafintosh/abstract-encoding), although it has a few flaws of it's own.


Nice, and needed. Curious if you also considered a sprintf-ish template based approach like Perl's pack() or Python's struct.pack().


This is way outside of my expertise, but I have an honest question: why do we need a separate .getDeep() when .get() could be written do the same for any dot-delimited path?


Not the author but I assume it's to save the overhead of string parsing when you know you don't need it.


Doesn't JS have an inability to call C code? Might this be a step on the way to enabling that?


Not really, because C is a compiled language instead of interpreted. So you would have to distribute binaries. This library does not help with that.



Parse binary structs... with a language that only has doubles.



Nowadays Javascript supports packed binary data natively. The only things that Javascript lacks is a 64-bit integer type. All other numeric types can safely be represented in a double.

JITs can also use integer types internally when possible (small integer optimization).


As a frontend developer, do I need this? And why?

Or is this more for the heavy lifting and backend uses of JS?


The irony is that "byte-level binary data structures" is all that WASM is good for right now! So what's the problem with just using WASM?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: