I'm curious how this compares to \* Cap'n Proto \* ASN.1 \* gzip-compressed JSON...

stock_toaster · on Aug 4, 2019

I've always enjoyed the bencode[1] and netstring/tnetstring[2] formats too.

[1]: https://en.wikipedia.org/wiki/Bencode

[2]: http://web.archive.org/web/20140701085126/http://tnetstrings...

ehsankia · on Aug 4, 2019

For what it's worth, I tested it on this gigantic json file I have in this app (yes I should probably not be using JSON here).

Raw json is 90mb, cbor was 80mb. json+gzip takes it to 30mb and cbor+gzip was 31mb.

That being said, the schema has a lot of repeated keys, so that's why gzip helps a lot.

clhodapp · on Aug 4, 2019

It seems like repeated keys would be super-easy to make more efficient in a binary format: You could write each key in stringish form once (the first time it was seen) and then refer to it with a numeric seen-key reference from there on.

Matthias247 · on Aug 4, 2019

If you want it to be infinitely streaming compatible (which CBOR is) it raises another question: For how long are identifiers valid, and do they get invalidated or updates at some point of time. The header compression in HTTP2 solves such a problem, but also introduces quite a bit of additional complexity

clhodapp · on Aug 4, 2019

Right. There is definitely a tradeoff between requiring retention of the identifiers, which requires keeping more state, and re-specifying them, which requires sending more data. There are definitely more sophisticated ways to handle this (see: your HTTP2 header example, which I think even includes value caching), but an easy way to chose a point on that tradeoff spectrum is to simply keep a fixed-sized ringbuffer and retain, say, the last 256 keys.

08-15 · on Aug 4, 2019

This is such a great idea, it got standardized as a CBOR extension: http://cbor.schmorp.de/stringref

clhodapp · on Aug 5, 2019

Oh nice!

ehsankia · on Aug 4, 2019

As mentioned above, for something like to work, you need pre-defined schema, such as protobuf, which yes I know is what I should be using :)

clhodapp · on Aug 4, 2019

You definitely don't need a predefined schema. You simply make it a responsibility of both the reader and the writer to keep track of what stringy keys have been seen and in what order. You can then refer back into that ordered list of known names the next time a repeated name comes up.

mantap · on Aug 4, 2019

The primary advantage of a schemaless binary encoding such as CBOR is that it lets you encode binary data directly instead of using a double encoding of base64 in JSON.

kbumsik · on Aug 4, 2019

> gzip-compressed JSON

CBOR is not about compression to make it smaller, but for machine readability. For example for integers CBOR uses binary representation so that machines can read it directly without converting from string to integer in JSON.

rocqua · on Aug 4, 2019

Except that CBOR uses network order, so you cannot just interpret the bytes as an integer.

thayne · on Aug 4, 2019

Cap'n'proto and ASN.1 require a schema. Gzip compression means the content has to be decrypted into json and then parsed into a native representation, which probably requires more memory and cpu than cbor deserialization.

Matthias247 · on Aug 4, 2019

That’s right. In addition: CBOR can’t automatically compress field names, since those are strings which need to get fully serialized. gzip can compress them too, so it has a chance to trim the size of data further down in exchange for the additional cost of a 2nd encoding. Cap‘N‘Proto, protobuf and co can replace field names by IDs as indicated through Schemas and will thereby most space efficient in general.

edoceo · on Aug 4, 2019

Not decrypted. Decoded or decompressed.

thayne · on Aug 5, 2019

Oops, I meant decompressed.

userbinator · on Aug 4, 2019

Technically, you can use ASN.1 in a "schemaless" mode, but it's not very common.

StreamBright · on Aug 4, 2019

Parsing JSON is probably an order of magnitude more inefficient than working with any binary message formats in terms of speed and energy.

garmaine · on Aug 4, 2019

...why? That's not at all obvious.

StreamBright · on Aug 4, 2019

Because accessing fields in a binary format is a seek() while in a text format it is much more complicated. Try to build a mental model what is required to parse text based formats.

You can have a look to results of performance testing involving these binary message formats vs json.

http://zderadicka.eu/comparison-of-json-like-serializations-...

https://github.com/ludocode/schemaless-benchmarks#speed---de...

https://eng.uber.com/trip-data-squeeze/

http://ugorji.net/blog/benchmarking-serialization-in-go

garmaine · on Aug 4, 2019

Sorry, I read "order of magnitude more efficient" and was confused.

StreamBright · on Aug 4, 2019

If you take into consideration the sum of energy it takes to process an average daily dataset, then yes I think it will be an order of magnitude more efficient (in terms of energy use).

signa11 · on Aug 4, 2019

try parsing a tcp/udp header encoded in json vs as it ‘normally’ is. see if you can do it at line rate :o)

StreamBright · on Aug 4, 2019

Why would I want to do that?? :) I was using JSON and Avro in production and it was pretty obvious that Avro just beats the shit out of JSON libraries, even though there are state of the art libraries like Boon[1].

1. https://github.com/boonproject/boon/wiki/Boon-JSON-in-five-m...

For me it does not really matter which binary message format you are using, I think there is not a huge difference between the different libraries. There are some feature differences for sure, library quality maybe. If you are going for the most performance that is possible to achiave you could look at SBE[2] developed by performance freaks.

2. https://github.com/real-logic/simple-binary-encoding

signa11 · on Aug 5, 2019

> Why would I want to do that?? :)

because you _seem_ to imply that json parsing is faster/better than binary parsing...

StreamBright · on Aug 5, 2019

>> Parsing JSON is probably an order of magnitude more ____inefficient___ than working with any binary message formats in terms of speed and energy.

Do I?

signa11 · on Aug 12, 2019

oh dear lord ! how dumb of me :o) apologies...