Hacker News new | past | comments | ask | show | jobs | submit login
CBOR – Concise Binary Object Representation (cbor.io)
81 points by tosh on Aug 3, 2019 | hide | past | favorite | 71 comments



CBOR is MessagePack. At least cbor-ruby started with the MessagePack sources. The story is that Carsten took MessagePack, wrote a standard and added some things he wanted, and called it something else.

I wrote [1] a pretty comprehensive (and admittedly biased) critique of the CBOR standard years ago.

[1] https://news.ycombinator.com/item?id=14072598

Disclaimer: I wrote and maintain a MessagePack implementation.


Indeed, the RFC says so directly:

   CBOR was inspired by MessagePack.  MessagePack was developed and
   promoted by Sadayuki Furuhashi ("frsyuki").  This reference to
   MessagePack is solely for attribution; CBOR is not intended as a
   version of or replacement for MessagePack, as it has different design
   goals and requirements.
I fail to see what is wrong with having both CBOR and MessagePack, or with trying to bring a MessagePack-like thing to the IETF, if MessagePack's design (lack of extension points, etc.) was problematic for enabling future applications.

Also: CBOR is not one person. The CBOR working group, and the IETF in general, put forward the spec. If you object to how the process went down, your quarrel likely lies with IETF, not any individual.


CBOR isn't merely _inspired_ by MessagePack. cbor-ruby forked a Ruby MessagePack implementation. Its format is conceptually and fundamentally the same.

As for the rest of the process, that HN post I linked to has a basket of links at the end which are a good summary of what happened. I still (as I wrote before) won't characterize events, I think people should make up their own mind about it. But I think it's not biasing to reiterate:

> At least cbor-ruby started with the MessagePack sources. The story is that Carsten took MessagePack, wrote a standard and added some things he wanted, and called it something else.


In your opinion, what's the value of MessagePack? I worked with it when writing some Fluentd tools, and it was neat, but I didn't love it enough to switch from JSON on other projects. Maybe there's a killer feature I didn't know I need?


At my work we recently went through a large exercise to decide on a common data storage format. The contenders were JSON, MessagePack, and Avro. MessagePack won because:

- Msgpack serialization and deserialization is very fast in many languages - often 100x faster then JSON

- Msgpack natively supports encoding binary data

- Msgpack has type extensions, making it trivial to represent common types in an efficient way (eg. IPv4 address, timestamps)

- Msgpack has good libraries available in many languages

If you do not care about those things (no binary data, no need for extended types, not performance critical) then JSON is just fine.


I'm curious why didn't you consider FlatBuffers as well.


FlatBuffers are not self-describing.

FlatBuffers, Protobuf, Cap'n Proto, etc., all require an external schema configuration that you compile into a code chunk that you include into your program. Without this it is impossible to make sense of the data. In our case, the data is semi-structured and changes frequently. The prospect of maintaining a schema registry for all the data users and keeping everyone up to date and backwards compatible is enough of a burden that it was excluded.

Avro also uses schemas, but since the schema is embedded in the file it is self-describing so the reader does not need to do anything special to interpret the data. But Avro's C library is buggy and the python deserialization performance was terrible, so Avro was not selected.


In recent benchmarks I ran for a project, the performance of MessagePack blew JSON out the water. Obviously, it may differ in your case, but generally, MessagePack is designed to be machine readable and easy to parse, so performance with generally be better. Message sizes are generally better too.


(TL;DR: If you're looking for a low-overhead binary serialization format, MessagePack is great. It sits somewhere between "send JSON to a JS client" and "I have a fleet of backend services all sending data back and forth".)

MessagePack is marketed as "It's like JSON, but fast and small", but I don't view it as a JSON replacement at all. I think JSON is great for most projects, especially anything talking to a JS front end, and as soon as you move into something requiring more performance, you probably need schema/versioning guarantees that something like FlatBuffers or Cap'n Proto would give you.

I think MessagePack is a great on-disk binary format, and a pretty good network binary format if you don't have complicated application architecture (i.e., a video game client/server, a chat client, etc.) It's way, way faster than JSON which can be important for mobile, IoT, and embedded work). It's more compact than JSON if you want to avoid the overhead of gzipping/gunzipping everything. It compresses similarly as JSON+gzip if that's important to you. It's not confounded by things like canonicalization or attempts to add a schema. It's also amenable to streaming, something you... mostly can't do with JSON without going to a lot of trouble. It's also easy to implement -- which is important in some cases. Try implementing a JSON encoder/decoder!


I did write a json encoder/decoder! I was just goofing around and decided to see how hard it was. Apparently I didn't get around to flushing out the docs, so here's the tests: https://github.com/amorphid/json_momoa/blob/master/test/json...


Incredible name.....

Edit: to be a little more constructive, I read this a while ago and it scared me off implementing anything JSON related forever: http://seriot.ch/parsing_json.php#4.


I found it much more pleasant to parse JSON than YAML. At least there's a fairly soecific RFC for JSON! YAML parsing felt like a whole bunch of "meh, we came up with this, but just sorta do what you want".

And on the topic of cute names, back when I was learning Elixir, I wrote a wrapper for an Erlang YAML parser. I called it Mark Yamill :)

https://hex.pm/packages/mark_yamill


It is odd to read in the RFC section 1.1 this sentence,

The format should use contemporary machine representations of data

...and then see in section 1.2,

All multi-byte values are encoded in network byte order (that is, most significant byte first, also known as "big-endian")

I know it's largely a choice of tradition, but it seems almost anachronistic to specify any new protocols as BE when LE is the overwhelming majority of machines today, and probably has been for at least the past two decades.


> All multi-byte values are encoded in network byte order

Head desk.


It's a mistake carried over from msgpack, which presumably chose BE for coolness reasons ("It's The Unix Way!").


It it also weird that "network byte order" is considered as BE. Is there any advantage of BE when it is transferred over the network?


If I remember correctly, BE helps in switching networks for routing packets. A packet would start with the big-endian destination address. If that address was hierarchical, i.e. most significant bytes signify the network (interface) it was in (like IP), then only a few of the first bits need to be processed in order to find which interface to direct the packet to. The transmission out therefore begins just after the few first bits are received.

E.g. for a switch connected to networks A: 0xa, B: 0x1, C: 0x3, a only the first byte 0x1 of the packet destined to 0x1234 needs to be processed before forwarding, saving some time compared to LE, where the entire address 0x4321 would have to be processed to find out that it's at the 0x1 network.


Registers on anything today will be way wider than few bytes, moreover on network hardware.

The gate savings on doing endiannes specific circuits are close to zero in comparison to many many other things a typical logic block comes with today.


It's not intended to save gates but time. Bits are (were) serially encoded on the wire, so you could start switching sooner, before even having received the whole header.


I understand the rationale, but all that hardware today will probably get more bits on input in a single clock cycle than at the time this was a concern.

Today, you have to process way way more bytes in a single clock cycle anyways.

Internally, for a chip designer, almost all modern high speed serial busses look way wider than a single byte. And all of its serialness is kept inside the transmitter/serdes/interface blocks without any external exposure.


No. In this context it simply means "network-standard." And the standard has always been big endian for network applications. Not because of any advantage, but simply because that was the standard.


It makes no difference when you look at the hardware, but an interesting side-note is that Ethernet is natively little bit-endian. That is, if you're sending a byte of data the least significant bit is sent first (think of the pulses of laser light going down a cable). In fact this is true for pretty much all physical layer protocols, as it makes computing the CRC vastly easier.

So if you use big-endian byte orders you're actually sending your bits all jumbled up (bits from the most significant byte first, then the least significant bit from within each byte first).



A recent discussion here on Latacora's "How (not) to sign a JSON object" [0], had me thinking of CBOR. Unlike JSON, MsgPack, protobufs, BSON, or any other commonly used data interchange format that I'm aware of; CBOR has a canonical representation (although, with seeming ambiguity in float representation) [1].

Anyone have any thoughts on using canonical CBOR for object signing? Currently, I'm building a system with a content-addressable data store, and I'm particularly interested in data formats with a canonical form for this use-case.

[0] https://news.ycombinator.com/item?id=20516489

[1] https://tools.ietf.org/html/rfc7049#section-3.9


Generally, there isn't an efficient object model for CBOR (three really troublesome features are the use of arbitrary CBOR structures as map keys, 64 bit unsigned negative numbers, and semantic tagging resulting in data being represented in an alternative form e.g. a BigDecimal type rather than a binary array).

As a result, round-tripping through a CBOR implementation still may result in data structure changes. Depending on the type of change and any exploits in say the hashing algorithm, this could be a security issue.

On the flip side, you can just tag a byte array as CBOR data, and sign it. Unlike JSON, you don't need to perform an encoding/escaping to make one document safe to embed into another document.


You generally don't need canonical representation.


Shameless self-plug: JSON for Modern C++ (https://github.com/nlohmann/json) supports CBOR along MessagePack, UBJSON, and BSON, see https://github.com/nlohmann/json#binary-formats-bson-cbor-me....


I'm curious how this compares to

* Cap'n Proto

* ASN.1

* gzip-compressed JSON

in various ways. (I don't know much about progress in serialization methods.)


I've always enjoyed the bencode[1] and netstring/tnetstring[2] formats too.

[1]: https://en.wikipedia.org/wiki/Bencode

[2]: http://web.archive.org/web/20140701085126/http://tnetstrings...


For what it's worth, I tested it on this gigantic json file I have in this app (yes I should probably not be using JSON here).

Raw json is 90mb, cbor was 80mb. json+gzip takes it to 30mb and cbor+gzip was 31mb.

That being said, the schema has a lot of repeated keys, so that's why gzip helps a lot.


It seems like repeated keys would be super-easy to make more efficient in a binary format: You could write each key in stringish form once (the first time it was seen) and then refer to it with a numeric seen-key reference from there on.


If you want it to be infinitely streaming compatible (which CBOR is) it raises another question: For how long are identifiers valid, and do they get invalidated or updates at some point of time. The header compression in HTTP2 solves such a problem, but also introduces quite a bit of additional complexity


Right. There is definitely a tradeoff between requiring retention of the identifiers, which requires keeping more state, and re-specifying them, which requires sending more data. There are definitely more sophisticated ways to handle this (see: your HTTP2 header example, which I think even includes value caching), but an easy way to chose a point on that tradeoff spectrum is to simply keep a fixed-sized ringbuffer and retain, say, the last 256 keys.


This is such a great idea, it got standardized as a CBOR extension: http://cbor.schmorp.de/stringref


Oh nice!


As mentioned above, for something like to work, you need pre-defined schema, such as protobuf, which yes I know is what I should be using :)


You definitely don't need a predefined schema. You simply make it a responsibility of both the reader and the writer to keep track of what stringy keys have been seen and in what order. You can then refer back into that ordered list of known names the next time a repeated name comes up.


The primary advantage of a schemaless binary encoding such as CBOR is that it lets you encode binary data directly instead of using a double encoding of base64 in JSON.


> gzip-compressed JSON

CBOR is not about compression to make it smaller, but for machine readability. For example for integers CBOR uses binary representation so that machines can read it directly without converting from string to integer in JSON.


Except that CBOR uses network order, so you cannot just interpret the bytes as an integer.


Cap'n'proto and ASN.1 require a schema. Gzip compression means the content has to be decrypted into json and then parsed into a native representation, which probably requires more memory and cpu than cbor deserialization.


That’s right. In addition: CBOR can’t automatically compress field names, since those are strings which need to get fully serialized. gzip can compress them too, so it has a chance to trim the size of data further down in exchange for the additional cost of a 2nd encoding. Cap‘N‘Proto, protobuf and co can replace field names by IDs as indicated through Schemas and will thereby most space efficient in general.


Not decrypted. Decoded or decompressed.


Oops, I meant decompressed.


Technically, you can use ASN.1 in a "schemaless" mode, but it's not very common.


Parsing JSON is probably an order of magnitude more inefficient than working with any binary message formats in terms of speed and energy.


...why? That's not at all obvious.


Because accessing fields in a binary format is a seek() while in a text format it is much more complicated. Try to build a mental model what is required to parse text based formats.

You can have a look to results of performance testing involving these binary message formats vs json.

http://zderadicka.eu/comparison-of-json-like-serializations-...

https://github.com/ludocode/schemaless-benchmarks#speed---de...

https://eng.uber.com/trip-data-squeeze/

http://ugorji.net/blog/benchmarking-serialization-in-go


Sorry, I read "order of magnitude more efficient" and was confused.


If you take into consideration the sum of energy it takes to process an average daily dataset, then yes I think it will be an order of magnitude more efficient (in terms of energy use).


try parsing a tcp/udp header encoded in json vs as it ‘normally’ is. see if you can do it at line rate :o)


Why would I want to do that?? :) I was using JSON and Avro in production and it was pretty obvious that Avro just beats the shit out of JSON libraries, even though there are state of the art libraries like Boon[1].

1. https://github.com/boonproject/boon/wiki/Boon-JSON-in-five-m...

For me it does not really matter which binary message format you are using, I think there is not a huge difference between the different libraries. There are some feature differences for sure, library quality maybe. If you are going for the most performance that is possible to achiave you could look at SBE[2] developed by performance freaks.

2. https://github.com/real-logic/simple-binary-encoding


> Why would I want to do that?? :)

because you _seem_ to imply that json parsing is faster/better than binary parsing...


>> Parsing JSON is probably an order of magnitude more ____inefficient___ than working with any binary message formats in terms of speed and energy.

Do I?


oh dear lord ! how dumb of me :o) apologies...


Neat. One little thing I gotta wonder about: was this name perchance a backronym for the author’s name Carsten Bormann (CBor)? Not that I’m complaining, but that does seem like an odd coincidence.

Also, how does this compare to JSON binary serialization, such as BSON?


Regarding your first question, yes. Source: I have been a student at University of Bremen and attended some of his lectures (I remember at least one titled "Physikalisch-technische Grundlagen Digitaler Medien").


For someone saw CBOR, CBOR is kinda binary version of JSON, like MsgPack. CBOR is already mature as it is used in web browsers internally by a web standard [1] (and yes, Signed HTTP is in a controversy though...)

Because it is pretty much optimized for machine readability and lightness, one of usage is microcontrollers for IoT, with CBOR + CoAP(kinda HTTP for IoT), although I wouldn't say it is common yet, since CoAP needs IPv6.

[1]: https://wicg.github.io/webpackage/draft-yasskin-http-origin-...


I'd point towards WebAuthn[0] as a reason why browsers implement CBOR, since that's much more widely supported.

[0]: https://www.w3.org/TR/webauthn-1/#dependencies


I recently wanted to use something JSON-RPC'ish for communication between host PC and my microcontroller. I looked into CBOR/MessagePack as well as UBJSON.

I didn't find a CBOR/MessagePack nor UBJSON implementation that was microcontroller-friendly and easy to use.

In the end I ended up just using plain JSON. Easy to debug as you can easily see what's on the wire, easy to implement, relatively small code size.


Most CBOR implementations for MCUs tend to be a part of RTOSs, integrated and optimized for their own memory allocation libraries. But many of them are based on Intel's TinyCBOR: https://github.com/intel/tinycbor


You would probably reach for CoAP when your nodes are speaking 6LoWPAN but, there's nothing about CoAP that requires IPv6.


I've been working on a general purpose binary and text encoding format for some time now as a personal project. It supports time, decimal floats, binary blobs as first-class citizens, supports comments, and focuses on encoding the most commonly used values more compactly. The binary format [1] is almost done (I've come up with a more compact date representation that I'll be adding this weekend to replace smalltime in the C and go implementations), and the text format isn't far behind.

I almost had a better floating point compression format, but it turned out to be too complicated, and only works well for decimal floating point, so I'll probably not use it in CBE.

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...


Have you considered reserving some opcode space for future extensions?

This would make it easier for CBEv2 to add types without changing how a byte is interpreted.


Yes, I will be reserving space once I finish with the new date format.


I tried CBOR but honestly the JS libraries available are not yet of production quality and I had to abandon it and go back to messagepack.


Fun Fact: CBOR is both an acronym for the software as well as the initials of the author, Professor Carsten Bormann (https://www.informatik.uni-bremen.de/~cabo/)


COSE, which uses CBOR, is the JOSE (commonly known by its subset JWT) of small binary messaging. https://tools.ietf.org/html/rfc8152


It is, although COSE is significantly different to JOSE in many ways.


this been here for a while... including the website. question: is anyone actually/effectively using it in a commercial project?


BINC, MsgPack, ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: