Hacker News new | past | comments | ask | show | jobs | submit login

I'm curious how this compares to

* Cap'n Proto

* ASN.1

* gzip-compressed JSON

in various ways. (I don't know much about progress in serialization methods.)




I've always enjoyed the bencode[1] and netstring/tnetstring[2] formats too.

[1]: https://en.wikipedia.org/wiki/Bencode

[2]: http://web.archive.org/web/20140701085126/http://tnetstrings...


For what it's worth, I tested it on this gigantic json file I have in this app (yes I should probably not be using JSON here).

Raw json is 90mb, cbor was 80mb. json+gzip takes it to 30mb and cbor+gzip was 31mb.

That being said, the schema has a lot of repeated keys, so that's why gzip helps a lot.


It seems like repeated keys would be super-easy to make more efficient in a binary format: You could write each key in stringish form once (the first time it was seen) and then refer to it with a numeric seen-key reference from there on.


If you want it to be infinitely streaming compatible (which CBOR is) it raises another question: For how long are identifiers valid, and do they get invalidated or updates at some point of time. The header compression in HTTP2 solves such a problem, but also introduces quite a bit of additional complexity


Right. There is definitely a tradeoff between requiring retention of the identifiers, which requires keeping more state, and re-specifying them, which requires sending more data. There are definitely more sophisticated ways to handle this (see: your HTTP2 header example, which I think even includes value caching), but an easy way to chose a point on that tradeoff spectrum is to simply keep a fixed-sized ringbuffer and retain, say, the last 256 keys.


This is such a great idea, it got standardized as a CBOR extension: http://cbor.schmorp.de/stringref


Oh nice!


As mentioned above, for something like to work, you need pre-defined schema, such as protobuf, which yes I know is what I should be using :)


You definitely don't need a predefined schema. You simply make it a responsibility of both the reader and the writer to keep track of what stringy keys have been seen and in what order. You can then refer back into that ordered list of known names the next time a repeated name comes up.


The primary advantage of a schemaless binary encoding such as CBOR is that it lets you encode binary data directly instead of using a double encoding of base64 in JSON.


> gzip-compressed JSON

CBOR is not about compression to make it smaller, but for machine readability. For example for integers CBOR uses binary representation so that machines can read it directly without converting from string to integer in JSON.


Except that CBOR uses network order, so you cannot just interpret the bytes as an integer.


Cap'n'proto and ASN.1 require a schema. Gzip compression means the content has to be decrypted into json and then parsed into a native representation, which probably requires more memory and cpu than cbor deserialization.


That’s right. In addition: CBOR can’t automatically compress field names, since those are strings which need to get fully serialized. gzip can compress them too, so it has a chance to trim the size of data further down in exchange for the additional cost of a 2nd encoding. Cap‘N‘Proto, protobuf and co can replace field names by IDs as indicated through Schemas and will thereby most space efficient in general.


Not decrypted. Decoded or decompressed.


Oops, I meant decompressed.


Technically, you can use ASN.1 in a "schemaless" mode, but it's not very common.


Parsing JSON is probably an order of magnitude more inefficient than working with any binary message formats in terms of speed and energy.


...why? That's not at all obvious.


Because accessing fields in a binary format is a seek() while in a text format it is much more complicated. Try to build a mental model what is required to parse text based formats.

You can have a look to results of performance testing involving these binary message formats vs json.

http://zderadicka.eu/comparison-of-json-like-serializations-...

https://github.com/ludocode/schemaless-benchmarks#speed---de...

https://eng.uber.com/trip-data-squeeze/

http://ugorji.net/blog/benchmarking-serialization-in-go


Sorry, I read "order of magnitude more efficient" and was confused.


If you take into consideration the sum of energy it takes to process an average daily dataset, then yes I think it will be an order of magnitude more efficient (in terms of energy use).


try parsing a tcp/udp header encoded in json vs as it ‘normally’ is. see if you can do it at line rate :o)


Why would I want to do that?? :) I was using JSON and Avro in production and it was pretty obvious that Avro just beats the shit out of JSON libraries, even though there are state of the art libraries like Boon[1].

1. https://github.com/boonproject/boon/wiki/Boon-JSON-in-five-m...

For me it does not really matter which binary message format you are using, I think there is not a huge difference between the different libraries. There are some feature differences for sure, library quality maybe. If you are going for the most performance that is possible to achiave you could look at SBE[2] developed by performance freaks.

2. https://github.com/real-logic/simple-binary-encoding


> Why would I want to do that?? :)

because you _seem_ to imply that json parsing is faster/better than binary parsing...


>> Parsing JSON is probably an order of magnitude more ____inefficient___ than working with any binary message formats in terms of speed and energy.

Do I?


oh dear lord ! how dumb of me :o) apologies...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: