It seems like repeated keys would be super-easy to make more efficient in a binary format: You could write each key in stringish form once (the first time it was seen) and then refer to it with a numeric seen-key reference from there on.
If you want it to be infinitely streaming compatible (which CBOR is) it raises another question: For how long are identifiers valid, and do they get invalidated or updates at some point of time. The header compression in HTTP2 solves such a problem, but also introduces quite a bit of additional complexity
Right. There is definitely a tradeoff between requiring retention of the identifiers, which requires keeping more state, and re-specifying them, which requires sending more data. There are definitely more sophisticated ways to handle this (see: your HTTP2 header example, which I think even includes value caching), but an easy way to chose a point on that tradeoff spectrum is to simply keep a fixed-sized ringbuffer and retain, say, the last 256 keys.
You definitely don't need a predefined schema. You simply make it a responsibility of both the reader and the writer to keep track of what stringy keys have been seen and in what order. You can then refer back into that ordered list of known names the next time a repeated name comes up.
The primary advantage of a schemaless binary encoding such as CBOR is that it lets you encode binary data directly instead of using a double encoding of base64 in JSON.
CBOR is not about compression to make it smaller, but for machine readability. For example for integers CBOR uses binary representation so that machines can read it directly without converting from string to integer in JSON.
Cap'n'proto and ASN.1 require a schema. Gzip compression means the content has to be decrypted into json and then parsed into a native representation, which probably requires more memory and cpu than cbor deserialization.
That’s right. In addition: CBOR can’t automatically compress field names, since those are strings which need to get fully serialized. gzip can compress them too, so it has a chance to trim the size of data further down in exchange for the additional cost of a 2nd encoding.
Cap‘N‘Proto, protobuf and co can replace field names by IDs as indicated through Schemas and will thereby most space efficient in general.
Because accessing fields in a binary format is a seek() while in a text format it is much more complicated. Try to build a mental model what is required to parse text based formats.
You can have a look to results of performance testing involving these binary message formats vs json.
If you take into consideration the sum of energy it takes to process an average daily dataset, then yes I think it will be an order of magnitude more efficient (in terms of energy use).
Why would I want to do that?? :) I was using JSON and Avro in production and it was pretty obvious that Avro just beats the shit out of JSON libraries, even though there are state of the art libraries like Boon[1].
For me it does not really matter which binary message format you are using, I think there is not a huge difference between the different libraries. There are some feature differences for sure, library quality maybe. If you are going for the most performance that is possible to achiave you could look at SBE[2] developed by performance freaks.
* Cap'n Proto
* ASN.1
* gzip-compressed JSON
in various ways. (I don't know much about progress in serialization methods.)