CBOR – Concise Binary Object Representation

camgunz · on Aug 4, 2019

CBOR is MessagePack. At least cbor-ruby started with the MessagePack sources. The story is that Carsten took MessagePack, wrote a standard and added some things he wanted, and called it something else.

I wrote [1] a pretty comprehensive (and admittedly biased) critique of the CBOR standard years ago.

[1] https://news.ycombinator.com/item?id=14072598

Disclaimer: I wrote and maintain a MessagePack implementation.

ucarion · on Aug 4, 2019

Indeed, the RFC says so directly:

   CBOR was inspired by MessagePack.  MessagePack was developed and
   promoted by Sadayuki Furuhashi ("frsyuki").  This reference to
   MessagePack is solely for attribution; CBOR is not intended as a
   version of or replacement for MessagePack, as it has different design
   goals and requirements.

I fail to see what is wrong with having both CBOR and MessagePack, or with trying to bring a MessagePack-like thing to the IETF, if MessagePack's design (lack of extension points, etc.) was problematic for enabling future applications.

Also: CBOR is not one person. The CBOR working group, and the IETF in general, put forward the spec. If you object to how the process went down, your quarrel likely lies with IETF, not any individual.

camgunz · on Aug 4, 2019

CBOR isn't merely _inspired_ by MessagePack. cbor-ruby forked a Ruby MessagePack implementation. Its format is conceptually and fundamentally the same.

As for the rest of the process, that HN post I linked to has a basket of links at the end which are a good summary of what happened. I still (as I wrote before) won't characterize events, I think people should make up their own mind about it. But I think it's not biasing to reiterate:

> At least cbor-ruby started with the MessagePack sources. The story is that Carsten took MessagePack, wrote a standard and added some things he wanted, and called it something else.

PopeDotNinja · on Aug 4, 2019

In your opinion, what's the value of MessagePack? I worked with it when writing some Fluentd tools, and it was neat, but I didn't love it enough to switch from JSON on other projects. Maybe there's a killer feature I didn't know I need?

Mordak · on Aug 4, 2019

At my work we recently went through a large exercise to decide on a common data storage format. The contenders were JSON, MessagePack, and Avro. MessagePack won because:

- Msgpack serialization and deserialization is very fast in many languages - often 100x faster then JSON

- Msgpack natively supports encoding binary data

- Msgpack has type extensions, making it trivial to represent common types in an efficient way (eg. IPv4 address, timestamps)

- Msgpack has good libraries available in many languages

If you do not care about those things (no binary data, no need for extended types, not performance critical) then JSON is just fine.

pdimitar · on Aug 4, 2019

I'm curious why didn't you consider FlatBuffers as well.

Mordak · on Aug 4, 2019

FlatBuffers are not self-describing.

FlatBuffers, Protobuf, Cap'n Proto, etc., all require an external schema configuration that you compile into a code chunk that you include into your program. Without this it is impossible to make sense of the data. In our case, the data is semi-structured and changes frequently. The prospect of maintaining a schema registry for all the data users and keeping everyone up to date and backwards compatible is enough of a burden that it was excluded.

Avro also uses schemas, but since the schema is embedded in the file it is self-describing so the reader does not need to do anything special to interpret the data. But Avro's C library is buggy and the python deserialization performance was terrible, so Avro was not selected.

alexchamberlain · on Aug 4, 2019

In recent benchmarks I ran for a project, the performance of MessagePack blew JSON out the water. Obviously, it may differ in your case, but generally, MessagePack is designed to be machine readable and easy to parse, so performance with generally be better. Message sizes are generally better too.

camgunz · on Aug 4, 2019

(TL;DR: If you're looking for a low-overhead binary serialization format, MessagePack is great. It sits somewhere between "send JSON to a JS client" and "I have a fleet of backend services all sending data back and forth".)

MessagePack is marketed as "It's like JSON, but fast and small", but I don't view it as a JSON replacement at all. I think JSON is great for most projects, especially anything talking to a JS front end, and as soon as you move into something requiring more performance, you probably need schema/versioning guarantees that something like FlatBuffers or Cap'n Proto would give you.

I think MessagePack is a great on-disk binary format, and a pretty good network binary format if you don't have complicated application architecture (i.e., a video game client/server, a chat client, etc.) It's way, way faster than JSON which can be important for mobile, IoT, and embedded work). It's more compact than JSON if you want to avoid the overhead of gzipping/gunzipping everything. It compresses similarly as JSON+gzip if that's important to you. It's not confounded by things like canonicalization or attempts to add a schema. It's also amenable to streaming, something you... mostly can't do with JSON without going to a lot of trouble. It's also easy to implement -- which is important in some cases. Try implementing a JSON encoder/decoder!

PopeDotNinja · on Aug 4, 2019

I did write a json encoder/decoder! I was just goofing around and decided to see how hard it was. Apparently I didn't get around to flushing out the docs, so here's the tests: https://github.com/amorphid/json_momoa/blob/master/test/json...

camgunz · on Aug 4, 2019

Incredible name.....

Edit: to be a little more constructive, I read this a while ago and it scared me off implementing anything JSON related forever: http://seriot.ch/parsing_json.php#4.

PopeDotNinja · on Aug 4, 2019

I found it much more pleasant to parse JSON than YAML. At least there's a fairly soecific RFC for JSON! YAML parsing felt like a whole bunch of "meh, we came up with this, but just sorta do what you want".

And on the topic of cute names, back when I was learning Elixir, I wrote a wrapper for an Erlang YAML parser. I called it Mark Yamill :)

https://hex.pm/packages/mark_yamill

userbinator · on Aug 4, 2019

It is odd to read in the RFC section 1.1 this sentence,

The format should use contemporary machine representations of data

...and then see in section 1.2,

All multi-byte values are encoded in network byte order (that is, most significant byte first, also known as "big-endian")

I know it's largely a choice of tradition, but it seems almost anachronistic to specify any new protocols as BE when LE is the overwhelming majority of machines today, and probably has been for at least the past two decades.

Gibbon1 · on Aug 4, 2019

> All multi-byte values are encoded in network byte order

Head desk.

blattimwind · on Aug 4, 2019

It's a mistake carried over from msgpack, which presumably chose BE for coolness reasons ("It's The Unix Way!").

kbumsik · on Aug 4, 2019

It it also weird that "network byte order" is considered as BE. Is there any advantage of BE when it is transferred over the network?

rhn_mk1 · on Aug 4, 2019

If I remember correctly, BE helps in switching networks for routing packets. A packet would start with the big-endian destination address. If that address was hierarchical, i.e. most significant bytes signify the network (interface) it was in (like IP), then only a few of the first bits need to be processed in order to find which interface to direct the packet to. The transmission out therefore begins just after the few first bits are received.

E.g. for a switch connected to networks A: 0xa, B: 0x1, C: 0x3, a only the first byte 0x1 of the packet destined to 0x1234 needs to be processed before forwarding, saving some time compared to LE, where the entire address 0x4321 would have to be processed to find out that it's at the 0x1 network.

baybal2 · on Aug 4, 2019

Registers on anything today will be way wider than few bytes, moreover on network hardware.

The gate savings on doing endiannes specific circuits are close to zero in comparison to many many other things a typical logic block comes with today.

bestouff · on Aug 4, 2019

It's not intended to save gates but time. Bits are (were) serially encoded on the wire, so you could start switching sooner, before even having received the whole header.

baybal2 · on Aug 4, 2019

I understand the rationale, but all that hardware today will probably get more bits on input in a single clock cycle than at the time this was a concern.

Today, you have to process way way more bytes in a single clock cycle anyways.

Internally, for a chip designer, almost all modern high speed serial busses look way wider than a single byte. And all of its serialness is kept inside the transmitter/serdes/interface blocks without any external exposure.

garmaine · on Aug 4, 2019

No. In this context it simply means "network-standard." And the standard has always been big endian for network applications. Not because of any advantage, but simply because that was the standard.

jleahy · on Aug 4, 2019

It makes no difference when you look at the hardware, but an interesting side-note is that Ethernet is natively little bit-endian. That is, if you're sending a byte of data the least significant bit is sent first (think of the pulses of laser light going down a cable). In fact this is true for pretty much all physical layer protocols, as it makes computing the CRC vastly easier.

So if you use big-endian byte orders you're actually sending your bits all jumbled up (bits from the most significant byte first, then the least significant bit from within each byte first).

dang · on Aug 4, 2019

Thread from 2016: https://news.ycombinator.com/item?id=10995726

2015: https://news.ycombinator.com/item?id=9597198

2013: https://news.ycombinator.com/item?id=6932089

2013: https://news.ycombinator.com/item?id=6632576 (the largest)

eadan · on Aug 4, 2019

A recent discussion here on Latacora's "How (not) to sign a JSON object" [0], had me thinking of CBOR. Unlike JSON, MsgPack, protobufs, BSON, or any other commonly used data interchange format that I'm aware of; CBOR has a canonical representation (although, with seeming ambiguity in float representation) [1].

Anyone have any thoughts on using canonical CBOR for object signing? Currently, I'm building a system with a content-addressable data store, and I'm particularly interested in data formats with a canonical form for this use-case.

[0] https://news.ycombinator.com/item?id=20516489

[1] https://tools.ietf.org/html/rfc7049#section-3.9

dwaite · on Aug 4, 2019

Generally, there isn't an efficient object model for CBOR (three really troublesome features are the use of arbitrary CBOR structures as map keys, 64 bit unsigned negative numbers, and semantic tagging resulting in data being represented in an alternative form e.g. a BigDecimal type rather than a binary array).

As a result, round-tripping through a CBOR implementation still may result in data structure changes. Depending on the type of change and any exploits in say the hashing algorithm, this could be a security issue.

On the flip side, you can just tag a byte array as CBOR data, and sign it. Unlike JSON, you don't need to perform an encoding/escaping to make one document safe to embed into another document.

blattimwind · on Aug 4, 2019

You generally don't need canonical representation.

nlohmann · on Aug 4, 2019

Shameless self-plug: JSON for Modern C++ (https://github.com/nlohmann/json) supports CBOR along MessagePack, UBJSON, and BSON, see https://github.com/nlohmann/json#binary-formats-bson-cbor-me....

schoen · on Aug 4, 2019

I'm curious how this compares to

* Cap'n Proto

* ASN.1

* gzip-compressed JSON

in various ways. (I don't know much about progress in serialization methods.)

stock_toaster · on Aug 4, 2019

I've always enjoyed the bencode[1] and netstring/tnetstring[2] formats too.

[1]: https://en.wikipedia.org/wiki/Bencode

[2]: http://web.archive.org/web/20140701085126/http://tnetstrings...

ehsankia · on Aug 4, 2019

For what it's worth, I tested it on this gigantic json file I have in this app (yes I should probably not be using JSON here).

Raw json is 90mb, cbor was 80mb. json+gzip takes it to 30mb and cbor+gzip was 31mb.

That being said, the schema has a lot of repeated keys, so that's why gzip helps a lot.

clhodapp · on Aug 4, 2019

It seems like repeated keys would be super-easy to make more efficient in a binary format: You could write each key in stringish form once (the first time it was seen) and then refer to it with a numeric seen-key reference from there on.

Matthias247 · on Aug 4, 2019

If you want it to be infinitely streaming compatible (which CBOR is) it raises another question: For how long are identifiers valid, and do they get invalidated or updates at some point of time. The header compression in HTTP2 solves such a problem, but also introduces quite a bit of additional complexity

clhodapp · on Aug 4, 2019

Right. There is definitely a tradeoff between requiring retention of the identifiers, which requires keeping more state, and re-specifying them, which requires sending more data. There are definitely more sophisticated ways to handle this (see: your HTTP2 header example, which I think even includes value caching), but an easy way to chose a point on that tradeoff spectrum is to simply keep a fixed-sized ringbuffer and retain, say, the last 256 keys.

08-15 · on Aug 4, 2019

This is such a great idea, it got standardized as a CBOR extension: http://cbor.schmorp.de/stringref

clhodapp · on Aug 5, 2019

Oh nice!

ehsankia · on Aug 4, 2019

As mentioned above, for something like to work, you need pre-defined schema, such as protobuf, which yes I know is what I should be using :)

clhodapp · on Aug 4, 2019

You definitely don't need a predefined schema. You simply make it a responsibility of both the reader and the writer to keep track of what stringy keys have been seen and in what order. You can then refer back into that ordered list of known names the next time a repeated name comes up.

mantap · on Aug 4, 2019

The primary advantage of a schemaless binary encoding such as CBOR is that it lets you encode binary data directly instead of using a double encoding of base64 in JSON.

kbumsik · on Aug 4, 2019

> gzip-compressed JSON

CBOR is not about compression to make it smaller, but for machine readability. For example for integers CBOR uses binary representation so that machines can read it directly without converting from string to integer in JSON.

rocqua · on Aug 4, 2019

Except that CBOR uses network order, so you cannot just interpret the bytes as an integer.

thayne · on Aug 4, 2019

Cap'n'proto and ASN.1 require a schema. Gzip compression means the content has to be decrypted into json and then parsed into a native representation, which probably requires more memory and cpu than cbor deserialization.

Matthias247 · on Aug 4, 2019

That’s right. In addition: CBOR can’t automatically compress field names, since those are strings which need to get fully serialized. gzip can compress them too, so it has a chance to trim the size of data further down in exchange for the additional cost of a 2nd encoding. Cap‘N‘Proto, protobuf and co can replace field names by IDs as indicated through Schemas and will thereby most space efficient in general.

edoceo · on Aug 4, 2019

Not decrypted. Decoded or decompressed.

thayne · on Aug 5, 2019

Oops, I meant decompressed.

userbinator · on Aug 4, 2019

Technically, you can use ASN.1 in a "schemaless" mode, but it's not very common.

StreamBright · on Aug 4, 2019

Parsing JSON is probably an order of magnitude more inefficient than working with any binary message formats in terms of speed and energy.

garmaine · on Aug 4, 2019

...why? That's not at all obvious.

StreamBright · on Aug 4, 2019

Because accessing fields in a binary format is a seek() while in a text format it is much more complicated. Try to build a mental model what is required to parse text based formats.

You can have a look to results of performance testing involving these binary message formats vs json.

http://zderadicka.eu/comparison-of-json-like-serializations-...

https://github.com/ludocode/schemaless-benchmarks#speed---de...

https://eng.uber.com/trip-data-squeeze/

http://ugorji.net/blog/benchmarking-serialization-in-go

garmaine · on Aug 4, 2019

Sorry, I read "order of magnitude more efficient" and was confused.

StreamBright · on Aug 4, 2019

If you take into consideration the sum of energy it takes to process an average daily dataset, then yes I think it will be an order of magnitude more efficient (in terms of energy use).

signa11 · on Aug 4, 2019

try parsing a tcp/udp header encoded in json vs as it ‘normally’ is. see if you can do it at line rate :o)

StreamBright · on Aug 4, 2019

Why would I want to do that?? :) I was using JSON and Avro in production and it was pretty obvious that Avro just beats the shit out of JSON libraries, even though there are state of the art libraries like Boon[1].

1. https://github.com/boonproject/boon/wiki/Boon-JSON-in-five-m...

For me it does not really matter which binary message format you are using, I think there is not a huge difference between the different libraries. There are some feature differences for sure, library quality maybe. If you are going for the most performance that is possible to achiave you could look at SBE[2] developed by performance freaks.

2. https://github.com/real-logic/simple-binary-encoding

signa11 · on Aug 5, 2019

> Why would I want to do that?? :)

because you _seem_ to imply that json parsing is faster/better than binary parsing...

StreamBright · on Aug 5, 2019

>> Parsing JSON is probably an order of magnitude more ____inefficient___ than working with any binary message formats in terms of speed and energy.

Do I?

signa11 · on Aug 12, 2019

oh dear lord ! how dumb of me :o) apologies...

nneonneo · on Aug 4, 2019

Neat. One little thing I gotta wonder about: was this name perchance a backronym for the author’s name Carsten Bormann (CBor)? Not that I’m complaining, but that does seem like an odd coincidence.

Also, how does this compare to JSON binary serialization, such as BSON?

starbugs · on Aug 4, 2019

Regarding your first question, yes. Source: I have been a student at University of Bremen and attended some of his lectures (I remember at least one titled "Physikalisch-technische Grundlagen Digitaler Medien").

kbumsik · on Aug 4, 2019

For someone saw CBOR, CBOR is kinda binary version of JSON, like MsgPack. CBOR is already mature as it is used in web browsers internally by a web standard [1] (and yes, Signed HTTP is in a controversy though...)

Because it is pretty much optimized for machine readability and lightness, one of usage is microcontrollers for IoT, with CBOR + CoAP(kinda HTTP for IoT), although I wouldn't say it is common yet, since CoAP needs IPv6.

[1]: https://wicg.github.io/webpackage/draft-yasskin-http-origin-...

Rafert · on Aug 4, 2019

I'd point towards WebAuthn[0] as a reason why browsers implement CBOR, since that's much more widely supported.

[0]: https://www.w3.org/TR/webauthn-1/#dependencies

magicalhippo · on Aug 4, 2019

I recently wanted to use something JSON-RPC'ish for communication between host PC and my microcontroller. I looked into CBOR/MessagePack as well as UBJSON.

I didn't find a CBOR/MessagePack nor UBJSON implementation that was microcontroller-friendly and easy to use.

In the end I ended up just using plain JSON. Easy to debug as you can easily see what's on the wire, easy to implement, relatively small code size.

kbumsik · on Aug 4, 2019

Most CBOR implementations for MCUs tend to be a part of RTOSs, integrated and optimized for their own memory allocation libraries. But many of them are based on Intel's TinyCBOR: https://github.com/intel/tinycbor

unmole · on Aug 4, 2019

You would probably reach for CoAP when your nodes are speaking 6LoWPAN but, there's nothing about CoAP that requires IPv6.

kstenerud · on Aug 4, 2019

I've been working on a general purpose binary and text encoding format for some time now as a personal project. It supports time, decimal floats, binary blobs as first-class citizens, supports comments, and focuses on encoding the most commonly used values more compactly. The binary format [1] is almost done (I've come up with a more compact date representation that I'll be adding this weekend to replace smalltime in the C and go implementations), and the text format isn't far behind.

I almost had a better floating point compression format, but it turned out to be too complicated, and only works well for decimal floating point, so I'll probably not use it in CBE.

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

tjalfi · on Aug 4, 2019

Have you considered reserving some opcode space for future extensions?

This would make it easier for CBEv2 to add types without changing how a byte is interpreted.

kstenerud · on Aug 4, 2019

Yes, I will be reserving space once I finish with the new date format.

mantap · on Aug 4, 2019

I tried CBOR but honestly the JS libraries available are not yet of production quality and I had to abandon it and go back to messagepack.

ktpsns · on Aug 4, 2019

Fun Fact: CBOR is both an acronym for the software as well as the initials of the author, Professor Carsten Bormann (https://www.informatik.uni-bremen.de/~cabo/)

Zamicol · on Aug 4, 2019

COSE, which uses CBOR, is the JOSE (commonly known by its subset JWT) of small binary messaging. https://tools.ietf.org/html/rfc8152

nmadden · on Aug 4, 2019

It is, although COSE is significantly different to JOSE in many ways.

larodi · on Aug 4, 2019

this been here for a while... including the website. question: is anyone actually/effectively using it in a commercial project?

throwaway3627 · on Aug 4, 2019

BINC, MsgPack, ...