Amazon Ion – A richly-typed, self-describing, hierarchical serialization format

hliyan · on Nov 20, 2021

This reminded me of a tight-packed binary format we used in the trading systems domain almost 20 years ago. Instead of including metadata/field names in each message, it had a central message dictionary that every client/server would first download a copy from. Messages had only type IDs, followed by binary packed data in the correct field order. Because of microsecond latency requirements, we even avoided the serialization/deserialization process by making the memory format of the message and the wire format one and the same. The message class contained the same buffer that you would send/store. The GetInt(fieldID) method of the class simply points to the right place in the buffer and does a cast to int. Application logs contained these messages, rather than plain text. There was a special reader to read logs. Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection.

In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key. I now work with web based systems. On days like this, I miss the old ways.

angstrom · on Nov 20, 2021

And to top it off you could fit the entire message into whatever the MTU of your network supported. Cap it at 1500 bytes and subtract the overhead for the frame headers and you get an extremely tight TCP/IP sequence stream that buffers through 16MB without needing to boil the ocean for a compound command sequence.

Having been in industry only 2 decades it amuses me how many times this gets rediscovered.

hliyan · on Nov 20, 2021

That just reminded me of the most mysterious scaling issue I ever faced. We had a message to disseminate market data for multiple markets (e.g. IBM: 100/100.12 @ NYSE, 101/102 @ NASDAQ etc.). The system performed admirably under load testing (think 50,000 messages per second). One day we onboarded a single new regional exchange and the whole market data load test collapsed. We searched high and low for days without success, until someone figured out that the new market addition had caused the market data message to exceed the Ethernet frame size for the first time. Problem was not at the application layer or the transport, it was data link layer fragmentation! Figuring that out felt like solving a murder mystery (I wasn't the one who figured it out though).

jkhdigital · on Nov 20, 2021

Classic example of a leaky abstraction, and the principle that implementation details inevitably become undocumented API behavior.

kabdib · on Nov 20, 2021

A lot of "transparent RPC" systems are like this. "It's just like a normal function call, it's sooo convenient" . . . until it isn't, because it involves the network hardware and configuration, routing environment, firewalls, equipment failure . . .

andylynch · on Nov 20, 2021

I’ve worked on systems like this too - the max packet size is very well documented. Then post trade it all gets turned into FIXML which somehow manages to be both more verbose and less readable.

angstrom · on Nov 20, 2021

Yeah, that's part of the trick for large listing responses to be spread across frames. Usually with some indicator like a "more flag" so the client can say "get me the next sequence by requesting the next index in the listup with the prior btree index. People do this all the time with large databases and it's a very similar use case.

vendiddy · on Nov 20, 2021

This was a fun back and forth to read!

cutemonster · on Nov 21, 2021

How did you solve the problem?

elcritch · on Nov 20, 2021

Ouch that’s rough. One nice bit of IPv6 is that it doesn’t allow fragmentation. It often much nicer to get no message or an error than subtly missing data.

depereo · on Nov 20, 2021

IPv6 does allow fragmentation.

elcritch · on Nov 20, 2021

Ah yah that’s right. I’m just learning more of ipv6 and get it mixed up. It appears what I had in my mind was about intermediate routers: “Unlike in IPv4, IPv6 routers (intermediate nodes) never fragment IPv6 packets.” (Wikipedia). To the previous point, it looks like ipv6 does require networks to send 1280 byte or smaller packets unfragmented.

kwertyoowiyop · on Nov 20, 2021

Every multiplayer game programmer from the 1990s agrees with you!

agumonkey · on Nov 20, 2021

Smells like engineering

armchairhacker · on Nov 20, 2021

I don't understand why serialization formats that separate structure and content aren't more popular.

Imagine a system every message is a UID or DID (https://www.w3.org/TR/did-core/) followed by raw binary data. The UID completely describes the shape of the rest of the message. You can also transmit messages to define new UIDs: these messages' UID is a shared global UID that everyone knows about.

Once a client learns a UID, messages are about as compact as possible. And the data defining UIDs can be much more descriptive than e.g. property names in JSON. You can send documentation and other excess data when defining the UID, because you don't have to worry about size, because you're only sending the UID once. And UIDs can reference other UIDs to reduce duplication.

NavinF · on Nov 20, 2021

You just described protobufs and all its successors.

See the “@0xdbb9ad1f14bf0b36” at the top of this capnproto file for example: https://capnproto.org/language.html

It’s a 64bit random number so it’ll never have unintentional collisions.

Also note that a capnp schema is natively represented as a capnp message. Pretty convenient for the “You can also transmit messages to define new UIDs” part of your scheme :)

nly · on Nov 20, 2021

Protobufs is a boring old tag-length-value format. It's kind of the worst of both worlds because it has no type information encoded in to it, meaning it's useless without the schema, while still having quite a bit of overhead.

Capn'Proto is more like a formalization of C structs in that new fields are only added to the end. If memory serves, on the wire there is no tag, type or length info (for fixed size field types), and everything is rooted at fixed offsets

kentonv · on Nov 20, 2021

Mostly right. Allow me to provide some wonky details.

Protobuf uses tag-type-values, i.e. each field is encoded with a tag specifying the field number and some basic type info before the value. The type info is only just enough information to be able to skip the field if you don't recognize it, e.g. it specifies "integer" vs. "byte blob". Some types (such as byte blob) also have a length, some (integer) do not. Nested messages are usually encoded as byte blobs with a length, but there's an alternate encoding where they have a start tag and an end tag instead ("start group" and "end group" are two of the basic types). On one hand, having a length for nested messages seems better because it means you can skip the message during deserialization if you aren't interested in it. On the other hand, it means that during serialization, you have to compute the length of the sub-message before actually serializing it, meaning the whole tree has to be traversed twice, which kind of sucks, especially when the message tree is larger than the L1/L2 cache. Ironically, most Protobuf decoders don't actually support skipping parsing of nested messages so the length that was so expensive to compute ends up being largely unused. Yet, most decoders only support length-delimited nested messages and therefore that's what everyone has to produce. Whoops.

Now on to Cap'n Proto. In a given Cap'n Proto "struct", there is a data section and a pointer section. Primitive types (integers, booleans, etc.) go into the data section. This is the part that looks like a C struct -- fields are identified solely by their offset from the start of the data section. Since new fields can be added over time, if you're reading old data, you may find the data section is too small. So, any fields that are out-of-bounds must be assumed to have default values. Fields that have complex variable-width types, like strings or nested structs, go into the pointer section. Each pointer is 64 bits, but does not work like a native pointer. Half of the pointer specifies an _offset_ of the pointed-to object, relative to the location of the pointer. The other half contains... type information! The pointer encodes enough information for you to know the basic size and shape of the destination object -- just enough information to make a copy of it even if you don't know the schema. This turns out to be super-important in practice for proxy servers and such that need to pass messages through without necessarily knowing the details of the application schema.

In short, both formats actually contain type information on the wire! But, not a full schema -- only the minimal information needed to deal with version skew and make copying possible without data loss.

nly · on Nov 20, 2021

I wouldn't call what protobuf encodes type information. If I recall all the group stuff is deprecated, so what's left basically boils down to 3 types: 32 bit values, 64 bit values and length prefixed values, which covers strings and sub-messages. Without the schema you can't even distinguish strings from sub-objects, as they are both length prefixed as you described.

Can you even distinguish floats and ints without a schema in protobufs? I don't remember.

I really enjoy capnproto, flatbuffers and Avro and bounce between them depending on the task at hand.

kentonv · on Nov 21, 2021

> I wouldn't call what protobuf encodes type information.

Well... I would call it type information, just not complete.

> 32 bit values, 64 bit values and length prefixed values

In protobuf, most integer types are actually encoded as varints, i.e. variable-width integers, not fixed 32-bit or 64-bit. varint encoding encodes 7 bits per byte, and uses the extra bit to indicate whether there are more bytes. (It's not a very good encoding, as it is very branch-heavy. Don't use this in new protocols.)

> Can you even distinguish floats and ints without a schema in protobufs?

You can't distinguish between float vs. fixed32. But int32 would be varint-encoded, while floats are always fixed-width, so you could distinguish between those. (You can't distinguish between int32, uint32, and sint32, though -- and sint32 in particular won't coerce "in the obvious way" to the others.)

The really unfortunate thing is you can't distinguish between strings vs. nested messages (of the length-delimited variety). So if you don't specifically know that something is a nested message then it's not safe to try parsing it...

infogulch · on Nov 20, 2021

Interesting. Ids in particular are described here: https://capnproto.org/language.html#unique-ids

I wonder if giving it a name based on the hash of the definition has been explored; like Unison [0] where all code is content addressable, but for just capnproto definitions. Is there a reason not to?

[0]: https://www.unisonweb.org

NavinF · on Nov 20, 2021

Capnp uses the name of your message, but not its full definition because that would make it impossible to extend protocols in a backwards compatible way. Without the ability to add new fields, making changes to your protocol would be impossible in large orgs.

garmaine · on Nov 20, 2021

> It’s a 64bit random number so it’ll never have unintentional collisions.

It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers. That's not inconceivable.

NavinF · on Nov 20, 2021

Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.

I’ve probably generated 100 IDs over my lifetime.

garmaine · on Nov 20, 2021

Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.

kentonv · on Nov 20, 2021

It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.

Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.

(Plus if you actually created a global database, then you'd find out if there were any collisions...)

heavenlyblue · on Nov 20, 2021

If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.

On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.

logicchains · on Nov 20, 2021

>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.

If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?

tomerv · on Nov 20, 2021

If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.

The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.

doo_daa · on Nov 20, 2021

I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.

adwn · on Nov 20, 2021

> 32 people

Slight correction: only 23 people, actually. So in every second football ("soccer") game, you have two people on the field with the same birthday.

ratorx · on Nov 20, 2021

Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.

For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.

EDIT: It does apply, I misunderstood what the number was being used for.

elcritch · on Nov 20, 2021

It does apply, according to https://www.johndcook.com/blog/2017/01/10/probability-of-sec...

ratorx · on Nov 20, 2021

You’re right, I misunderstood what the magic number was being used for.

lozenge · on Nov 20, 2021

But if my application only uses 100 schemas, I only care about a collision if it's with one of those 100.

gpderetta · on Nov 20, 2021

You have a collision if any two schemas share the id, not if a specific schema collides with any of the others. So it is exactly like the birthday paradox.

heavenlyblue · on Nov 20, 2021

Yeah, but that collision probably doesn’t matter because there’s a bunch of other variables that need to come together for it to be an issue at all.

gpderetta · on Nov 20, 2021

If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.

heavenlyblue · on Nov 20, 2021

I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?

remram · on Nov 20, 2021

When you reach the 4 billionth version of your protocol?

kentonv · on Nov 20, 2021

All versions of the same protocol have the same ID. That is the point of IDs -- to link together different versions of the protocol.

remram · on Nov 20, 2021

You're right! That makes collisions even less likely then.

boxfire · on Nov 20, 2021

MD5 is a 128 bit random number no one would ever have thought would collide. 64 bits is peanuts especially when message types are being defined dynamically

NavinF · on Nov 20, 2021

Dude that’s why I said “unintentional collisions”.

Of course you can get intentional collisions. The security model here assumes that anyone that wants to know your message’s ID can just ask.

Did you know that the Internet Protocol uses a 4-bit header to specify the format (v4 or v6) of the rest of the message? They should have used 128 bits. What a bunch of fools.

remram · on Nov 20, 2021

MD5 is safe against unintentional collisions.

dboreham · on Nov 20, 2021

This is protocol buffers + a global type registry. I worked on such a system.

pcarolan · on Nov 20, 2021

Is it public? Id love to learn more about it.

jenny91 · on Nov 20, 2021

If you read the protobuf source, you can see a bunch of places where you can hook in custom type-fetching code, e.g. in the google.protobuf.Any type.

After studying it a bit, I'm certain this is how it's used inside Google (might also be mentioned elsewhere).

All you'd really need to do is to compile all protos into a repository (you can spit out the binary descriptors from protoc), then fetch those and decode in the client.

It'd actually be quite straightforward to set up,

mtrovo · on Nov 20, 2021

I think the system OP is describing is a little bit more complex. You're not just describing message types, you also have message templates; a template declares a message type and a set of prefilled fields. You save data by just sending the subset of fields that are actually changing, which is a very good abstraction for market data. The template is hydrated on the protocol parsing layer so your code only has to deal with message types itself.

erenon · on Nov 20, 2021

We do something very similar in binlog: https://github.com/morganstanley/binlog

Serialization is platform-dependent (to make it a simple memcpy most of the time), and the schema is sent up front (but can be updated later, with in-bound messages at will). See the User Guide (http://binlog.org/UserGuide.html) and the Internals (http://binlog.org/Internals.html) for more.

mendigou · on Nov 20, 2021

This is exactly how it's done for spacecraft telemetry and telecommand too, but in this case it's to save bytes rather than processing time.

I also miss working on those systems.

oandrew · on Nov 20, 2021

Interesting. Confluent Avro + Schema registry + Kafka uses exactly the same approach - binary serialized Avro datums are prefixed with schema id which can be resolved via Schema registry

lordnacho · on Nov 20, 2021

Same here, I wrote an exchange core that did this using SBE. Basically you don't serialize in the classical sense, because you're simply taking whatever bytes are at your pointer and using them as some natural type. The internals of the exchange also simply used the same layout, so there was minimal copying and interpreting. On the way out it was the same, all you had to do was mask a few fields that you didn't want everyone to see and ship it onto the network.

Even an unoptimized version of this managed to get throughput in the 300K/s range.

Somehow it's the endpoint of my journey into serialization. Basically, avoid it if you need to be super fast. For most things though, it's useful to have something that you can read by eye, so if you're not in that HFT bracket it might be nicer to just use JSON or whatever.

porker · on Nov 20, 2021

Fab story, thank you! I understood up to "Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection." Can you go into more details about how the sequence number and resuming after disconnection worked?

mtrovo · on Nov 20, 2021

Server used a global sequence number for all messages they transmit. Clients are stateful so they know exactly what was the latest message they processed and would send that id when creating a new connection. This was very important as a lot of the message types used delta values, one of the most important ones being the order book. So in order to apply a new message you had to make sure that you're internal state was at the correct sequence id, failing to do so would make your state go bonkers, specially when you're talking about hundreds of messages being received per second. It's scary but you had a special message type that would send you a snapshot of the expected state with a sequence id that they correspond to. So your error handling code would fetch one of these and them ask for all the messages newer than that.

hliyan · on Nov 20, 2021

This is exactly right. It was almost always deltas in favor of snapshots. One of the downsides was that sometimes, debugging an issue required replaying the entire market up to the point of the crash/bug.

hliyan · on Nov 20, 2021

Pretty basic. The receiving process usually has an input thread that just puts the messages into a queue. Then a processing thread processes (maybe logic, maybe disk writes, maybe send) the messages and queues up periodic batch acks to the sender. The sender uses these acks to clear its own queue. The receiver persists the last acked sequence number, so that in case of a restart, it can tell upstream senders to restart sending messages from that point.

mianos · on Nov 20, 2021

The number of times people have "invented" ASN.1 now is ridiculous.

makotobestgirl · on Nov 20, 2021

Sounds like Google's flatbuffers [0], which indexes directly into a byte buffer using the field size prefix.

[0] https://google.github.io/flatbuffers/

nly · on Nov 20, 2021

What you're describing is exactly what still takes place in trading platforms, although a few i've seen now use SBE for consistency sake (it's very common on the market data side)

amitport · on Nov 20, 2021

Well that's just like using C structs. The best serialization protocol :).

nly · on Nov 20, 2021

Some finance software systems do that too. It tends to be a nightmare because people end up adding new message types just to add a single field

mtrovo · on Nov 20, 2021

Don't know if you're describing the original FIX itself with the TCP connection. On FAST FIX they got rid of the TCP connection and market data was sent over UDP using several parallel connections, data was reordered on the client side at consumption time and it only used a TCP connection to recover data when a sequence gap was found.

hliyan · on Nov 20, 2021

Actually, even FAST was too slow for us. This was a proprietary messaging middleware library. And this particular market data feed was the direct one into the matching engine itself. For the rest of the system, we used a sort of reliable multicast using UDP for the first transmission and TCP for missed messages. We initially tried out a Gossip/Epidemic protocol but that didn't work out too well.

FpUser · on Nov 20, 2021

I had exactly the same implementation except that type / version belonged to the whole message and would map to appropriate binary buffer in memory. No real de/serialization was needed.

I still use it in my UDP game servers, with added packet id if message exceeds max datagram length and has to be split

ericbarrett · on Nov 20, 2021

The one concern I'd have with this format is a length field getting corrupted in transit and causing an out-of-bounds memory access. The network protocols' checksums won't save you 100% of the time, especially if there's bad hardware in the loop. If every field is fixed length this is less of a concern, of course; you might get bad data but you won't get e.g. a string with length 64M.

hliyan · on Nov 20, 2021

In our system, if the message didn't unpack properly, the application would send a retransmit request with that message's sequence number. But in practice, this scenario never occurred because TCP already did this for us.

FpUser · on Nov 20, 2021

I do not remember it ever happening but being semi-paranoid I had length in 2 places - beginning and the end of the message.

mrlemke · on Nov 20, 2021

Very neat and similar to a project I am starting for packet radio. I went further with the dictionary concept so that it contains common data. This way, your message contains only a few dictionary "pointers" (integers in base 64). This makes it easier to fit messages in ASCII for 300 baud links.

ktzar · on Nov 20, 2021

Is it FIX messages? https://en.wikipedia.org/wiki/Financial_Information_eXchange It's a good idea, extensible (ranges available for banks to implement their own codes), and fast.

nly · on Nov 20, 2021

Old school texty FIX is incredibly slow. FAST FIX is faster but not fun to use. Largely SBE has won adoption on the market data side, with huge platforms like Euronext (biggest on Europe) using it.

mtrovo · on Nov 20, 2021

I stopped working in the area on the age of FAST FIX, which was extremely good for the time. Do you know what are the differences to SBE?

nly · on Nov 20, 2021

I guess I'm biased based on experience at the companies I've worked at but FAST never seemed to have good libraries or tooling

hliyan · on Nov 20, 2021

It was a proprietary messaging middleware library. We actually found even FAST FIX slow.

o_bender · on Nov 20, 2021

FAST FIX protocol is terrible performance-wise, its format requires multiple branching at every field parsing. Even "high-performance" libraries like mFAST are slow: I recently helped a client to optimize parsing for several messages and got 8x speed improvement over mFAST (which is a big deal in HFT space).

Aeolun · on Nov 20, 2021

> In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key.

If the order submission process depends on the manual press on the enter key (+/- 50ms) is there any point to that though?

hliyan · on Nov 20, 2021

Despite all the algorithms we employed, the concept of a manual trade never went away. Also, when the front-end was taken out of the equation, the latencies were in the microsecond range. 50ms would be excruciatingly slow for an algorithm.

sodality2 · on Nov 20, 2021

It was probably used with high frequency trading so fully automated unless you happened to be testing it manually.

danachow · on Nov 20, 2021

OT but keyboard latency can and often is far below 50ms, more like 1ms. It seems to be a common misconception that denouncing mandates increased lag.

dan-robertson · on Nov 20, 2021

Is this a number that came from an actual benchmark or from some marketing material from a keyboard maker? I ask this because [1] finds latency (measured from touching the key to the usb packet arriving) of 15ms with the fastest keyboard and around 50ms with others, though apparently some manufacturers have since improved. Or are you talking about midi keyboards where I guess latency is more noticeable to users?

[1] https://danluu.com/keyboard-latency/

danachow · on Nov 20, 2021

From the countless review and small time YouTube channels that test these things regularly.

I think that post must be a few years out of date - and moreover by its own admission doesn’t even test hardly any “gaming” keyboards. There is a tremendous amount of competition in keyboards that has been building for the past 10 years.

Input latency is now a marketing thing like horsepower, and there are reasonably reputable [1] places and countless small time YouTube reviewers that test these things. It’s not like it is difficult to improve latency, and now that it is something that is competitively marketed it is delivered on.

[1] https://www.rtings.com/keyboard/tests/latency

Personally I think it’s a bit ridiculous. This fetishization with minimizing latency to now sub-ms levels doesn’t necessarily lead to better performance as many top level gamers do not use the lowest latency level keyboards. But that doesn’t change the fact that modern mainstream gaming keyboards can hit a latency far below 50ms.

dan-robertson · on Nov 20, 2021

The link I posted was 2017. The site you link gives quite different ratings. I assume partly it is different methodology (the site you link tries to account for key travel somehow and they do something with a display and try to account for display latency rather than using a logic analyzer), but I’m not really sure. For some keyboards in common:

- apple magic keyboard (? vs 2017) 15ms vs 27ms

- das keyboard (3 vs S professional/4 Professional) 25 vs 11/10ms

- razer ornata (chroma vs chroma/chroma 2) 35 vs 11.4/10.1ms

Interestingly it is not some simple uniform difference: the Apple keyboard does much worse in the rtings test, perhaps getting not much of a bonus from key travel compensation. But the das keyboard vs the razer that are 10ms apart on my link perform equally on rtings (but maybe I found the wrong model). I don’t have a good explanation for that discrepancy.

danachow · on Nov 21, 2021

I know it is 2017, but that is a very long time in the gaming/mech keyboard market. I remember just about 10 years ago when mech keyboards were a niche for weirdos and a few others that swore by their Model M's - you now can buy these in Walmart. The point about discrepancy is well taken* but I think the bigger point is on the rtings list the number of offerings that are an order of magnitude lower latency - such that the methodology used in your link is not even viable.

dan-robertson · on Nov 21, 2021

Why is using a high speed camera and a logic analyzer less viable than measuring the end to end latency and trying to subtract the computer part of it? Or are you suggesting that a solenoid should be used to press the key instead of a finger?

danachow · on Nov 22, 2021

The latter. Even the linked article points out the inherent limitations.

Aeolun · on Nov 20, 2021

I was thinking more of the time a human finger needs to push the button down.

formerly_proven · on Nov 20, 2021

That's because a lot of input hardware uses moronic debouncing.

secondcoming · on Nov 20, 2021

I assume you were using C++? I'm not sure what you describe is possible these days due to UB. At the very least just casting bytes received over the wire to a type is UB, so you technically need a memcpy() and hope that the compiler optimises it out.

hliyan · on Nov 20, 2021

Yes, it was C++. I was unfamiliar with the acronym "UB" so did a Google search. Does it mean "Undefined Behavior"? If I remember correctly, primitive types other than strings are memcpy'd. GetStr basically returned a char* to the right place in the buffer.

secondcoming · on Nov 20, 2021

Apologies, yes Undefined Behaviour

gpderetta · on Nov 21, 2021

It is not UB if it is correctly aligned.

It is UB if the underlying dynamic type is not compatible from the access type

As the bytes are coming from the kernel, you get to decide the dynamic type.

sattoshi · on Nov 20, 2021

Apache Thrift works on the same principle of separating structure from data.

atlgator · on Nov 20, 2021

We did the same in high fidelity flight simulators for a lot less money I'm sure.

clhodapp · on Nov 20, 2021

It's staggering to me that people keep making these "rich" data formats without sum types. At least to me, the "ors" are just as important as the "ands" in domain modeling. Apart from that, while you can always sort of fake it with a bunch of optional fields I believe that you kind of need a native encoding to a tagged union if you want to avoid bloating your messages.

seniorsassycat · on Nov 20, 2021

ion schema is a type system that can validate ion values and it supports sum types.

https://amzn.github.io/ion-schema/docs/spec.html#union

The ion data model doesn't describe a schema or type system. It's a data structure where values are of a known type. In the binary format values are preceded by a type id, in the text format the syntax declares the type - "" for string, {} for struct. The data model doesn't declare what types a value could have, only the type it does have.

joshlemer · on Nov 20, 2021

Doesn't it trivially have "sum types" since it's just arbitrary self-describing data? i.e. nobody is stopping you from passing around objects in such a way:

{a:1} {a:{b:2}} {a:4} {a:{b:4}}

There's no static type layer over top of this, so it's inherently up to interpretation and whatever type system you want to use to describe this data, to be able to express that the values of `a` can be of type `number | {b: number}`

valenterry · on Nov 20, 2021

> There's no static type layer over top of this

Yeah, that's the problem. I mean, hey, why json? We could just use unstructured plaintext for everything and now we are free to do everything. But obviously that has its own drawbacks.

Having built-in support for sumtypes means better and more ergonomic support from libraries, it means there is one standard and not different ways to encode things and it also means better performance and tooling.

joshlemer · on Nov 20, 2021

The point is that there's no reason to single out sumtypes here. Insofar as ions/json has support for arrays/objects/strings/numbers, it has exactly the same support for sumtypes, as in the example I showed above. Here is a list of "sumtype" `string | number | object`:

[{}, "hi", 1, 2, 3, "yo", {a: "bc"}]

valenterry · on Nov 20, 2021

No, that is not a sumtype, that's an array.

In the same sense "1e-12" is not a number, it's a string. Yes, it's a string that encodes a number in a certain notion, but for alle the tooling, the IDE, the libraries, etc. it will stay a string.

joshlemer · on Nov 20, 2021

What I mean is, it is an array of a sumtype `number | string | object`. So precisely, you could call it a `list<number | string | object>`

ImprobableTruth · on Nov 20, 2021

Sum types =/= union types. Sum types are also called 'tagged' or 'discriminable' unions because they have some way to discriminate between them. That is, if you have an element a of type A, a is not part of the sum type A + B because it's missing a tag.

[5,"hello",3] has the type list (int ∪ string), not list (int + string). You can emulate the latter by manually adding a tag, but native support is much preferable.

joshlemer · on Nov 20, 2021

I know the differences between untagged and tagged unions, I'm trying to provide a minimal example without distracting details but sure we can talk about tagged unions. Here is a list of tagged unions, so I once again point out that sum types are "supported" in JSON/ions just as much as any other data type:

    [
      {tag: "a", foo: 1},
      {tag: "b", bar: "hi", baz: 2},
      {tag: "a", foo: 3},
      {tag: "a", foo: 4},
      {tag: "a", foo: 5},
      {tag: "b", bar: "yo", baz: 6}
    ]

clhodapp · on Nov 20, 2021

By that reasoning, why have structs? You can just make an array of size-2 arrays that give the key and the value.

joshlemer · on Nov 22, 2021

Okay I'll admit, that's a good point.

dunefox · on Nov 20, 2021

That's a list union[number, string, object] or list[Any], not a sum type, no? This

`data X = A | B

[A, B, ...]`

Is a list containing a sum type: list[X]

joshlemer · on Nov 20, 2021

There is no such thing in JSON or Ions as defining this "X" schema somewhere. So I may as well say that your [A,B,...] is a list[Any].

Now, I wouldn't actually call it a list of any, I would say you proved my point for me. Your example is functionally the same as mine. I would give this example:

`[A, B, ...]`

and say that that is a list of sum types. You may say "no no no! Only now is it a list of sum types!":

`data X = A | B

[A, B, ...]`

But my point is that there is no JSON/Ion equivalent of your `data X = A | B`. Everyone in this comment tree is confusing the data itself with out-of-band schema over that data. "Sumtype" is nothing more than a fiction, or a schema. Saying that JSON/Ions don't support sumtypes is like saying JSON doesn't support "NonNegativeInteger" type. Sure it does! Here are some: 1, 2, 3, 10. What tooling or type system you use outside of the data itself to enforce constraints on the data types is orthogonal to the data format itself.

valenterry · on Nov 21, 2021

> But my point is that there is no JSON/Ion equivalent of your `data X = A | B

No one disagrees - it's just that we complain about this. We _want_ to have such an equivalent.

> Saying that JSON/Ions don't support sumtypes is like saying JSON doesn't support "NonNegativeInteger" type.

Correct. But your conclusion is wrong. You seem to assume that no one has a problem with the fact that JSON doesn't support a "NonNegativeInteger" type. But I at least would happily use a format that explicitly supports that.

I mean, look at ION. Json doesn't support the concept of (restricted) integers, but ION extends JSON and offers this type. That's great, because it means if a library reads an integer field, it can map it to an integer and knows that there are constraints.

This is a _very_ relevant issue. Many json libraries in the past have had bugs or could be ddos-ed by feeding them json with large numbers, since the json spec does not constrain the size of numbers.

In that sense, ION could have _also_ added support for "NonNegativeInteger" or sumtypes, or other specific types, but they haven't. And since sumtypes are very fundamental, we complain about it more than we would complain about the lack of "NonNegativeInteger".

quantumspandex · on Nov 20, 2021

His point was type support and standard way of doing things. Using your argument we just need string type to represent everything.

dastbe · on Nov 20, 2021

data interchange formats try to encode as little backwards incompatible information as possible. in this case, it would be the restriction that something is a sum type when it could have multiple fields set in the future. another example is protobuf moving to all fields being optional by default.

as for the wire format, a variant struct where you've only instantiated a single field will encode down to just about the minimum amount of information required.

nly · on Nov 20, 2021

Avro went the opposite way to most and just makes the concept of an optional field implementable via a union with null

Non union fields can even be upgraded to unions later

Personally I find the protobufs "everything is optional!" Behaviour fucking insane and awful to deal with, but it is true to the semantics of its underlying wire format.

valenterry · on Nov 20, 2021

That's not contradicting though.

One can always choose not to use (native) sumtypes if they are interested in extreme performance or compatibility.

But logically speaking, it is _good_ that it's a restriction that a sumtype can't just turn into a multiple-fields type. Because while my software (as the consumer) might still be able to deserialize it, the assumption that only one field is set would be broken and my logic would now potentially broken. Much better if that happens at deserialization time then later one when I find out that my data is incorrect/corrupt.

vlovich123 · on Nov 20, 2021

Have you looked at cap’n’proto. It does sum types in a very sane way.

yakkityyak · on Nov 20, 2021

You should look into https://cuelang.org

the_girabbit · on Nov 20, 2021

Genuine question—why would you need a sum type in a self-describing data format?

valenterry · on Nov 20, 2021

Well, there are already sumtypes, just only specific builtin ones, not custom ones. E.g. booleans are sumtypes (true | false). Everything else that is nullable is also a sumtype (e.g. number | null).

I think it should be pretty obvious how these are helpful and why they are needed no?

the_girabbit · on Nov 20, 2021

Yeah, but it’s a schema-less, self-describing data format. It’s not like a specific position in a data stream has a requirement to be a specific type.

I can see why sum types would be useful in a schema or for the elements of a collection that is required to be homogeneous (ie. List<Foo|Bar>).

For what use case would one use custom sum types in a schema-less data format?

jsolson · on Nov 20, 2021

Protobuf supports sum types in the higher-level generated descriptors and languages -- on the wire they're just encoded as, well... oneof a number of possible options.

ricardobeat · on Nov 20, 2021

Which results in very painful inconsistencies when you’re dealing with the same schema on different platforms.

xyzzy_plugh · on Nov 20, 2021

Are you referring to different language implementations/runtimes? I don't follow your point about inconsistencies.

vlovich123 · on Nov 20, 2021

Cap’n’proto has native sum types.

spenczar5 · on Nov 20, 2021

Others have mentioned Protobuf and Capnproto's support. Avro has them too, they're called Union.

It seems that sum types are the norm, actually.

clhodapp · on Nov 20, 2021

Those do now but I believe that all of them added support years after their initial versions

spenczar5 · on Nov 20, 2021

I think you're incorrect:

Avro had unions in version 1.0 [0], which is from 2012.

Capnproto had unions back in 2013 [1]. That's from the v0.1 days, or maybe even earlier.

Protobuf has had oneof support for about 7 years. They were added in version 2.6.0, from 2014-08-15 [2]. That's still 6 years after the initial public release in 2008, though, so this is maybe what you were thinking of? I don't know too many people who were using protobuf in those days outside of Google, though.

---

[0] https://avro.apache.org/docs/1.0.0/spec.html#Unions

[1] https://github.com/capnproto/capnproto/commit/eb8404a157e074...

[2] https://github.com/protocolbuffers/protobuf/blob/master/CHAN...

clhodapp · on Nov 20, 2021

Thanks for the references, friend!

And yes, I definitely am primarily thinking of protobuf, as I struggled with this back with version 2.5. I had the (apparently mistakenly) impression that Avro and Cap'n Proto (which I think actually first came out in this timeframe) were about on par.

jsnell · on Nov 20, 2021

Previous discussions:

https://news.ycombinator.com/item?id=11546098

https://news.ycombinator.com/item?id=23921610

dang · on Nov 20, 2021

Thanks! Macroexpanded:

Amazon Ion - https://news.ycombinator.com/item?id=23921610 - July 2020 (110 comments)

Amazon open-sources Ion – a binary and text interchangable, typed JSON-superset - https://news.ycombinator.com/item?id=11546098 - April 2016 (163 comments)

throwoutway · on Nov 20, 2021

What do you use for the macroexpansion? There are a hundred odd tasks like this that I need to create macros for!

dang · on Nov 20, 2021

I mean that metaphorically but I do have a bunch of keyboard shortcuts (in a browser extension) that make finding these, and formatting the comments, much faster.

setheron · on Nov 20, 2021

Wow I remember using Ion back at Amazon in 2012. I can’t remember but I think the order data warehouse was using it …

I also now remember back to using something that was akin to FaaS but wasn’t called that. I could give them a JAR of some code that would execute on some Ion data for the order data when it changed. Basically FaaS for an ETL pipeline…

Crazy how ahead of the times some companies were.

User23 · on Nov 20, 2021

That was a golden age for Amazon engineering. I assume they’re still great, but that stretch from 2004 to 2014 was some incredible advancement.

vineyardmike · on Nov 20, 2021

I wonder why it took 10+ years to share then?

timdorr · on Nov 20, 2021

Actually, it only took them 4-ish years: https://amzn.github.io/ion-docs/news/2016/04/21/amazon-open-...

n8ta · on Nov 20, 2021

I recently implemented a similar (simpler) format https://baremessages.org/ in ruby.

First thoughts are:

ION pros: - easy to skip around while reading a file - no need to write a schema - backed by amazon so major langs will have impls - good date support - better concatenation, probably better suited to logging than bare

ION cons - what's the text format even for?

BARE pros: - schemas keep things tightly versioned - smaller binaries (not self describing like ion) - simpler to implement so tons of devs have impl'ed for their favorite lang - better suited to small messages (think REST json api)

BARE cons: - no skip read - no date support

I might do an ion ruby implementation too, to really feel out the difference.

imiric · on Nov 20, 2021

> what's the text format even for?

Configuration files?

Not sure if that's an intended use case, but being more flexible than JSON and stricter than YAML seems ideal for configuration.

ozzythecat · on Nov 20, 2021

Ion text is helpful so you can convert ion binary to text for debugging:

the_girabbit · on Nov 20, 2021

Ion will be even better for (structured) logging if this proposal for templates ever happens. https://github.com/amzn/ion-docs/pull/104

Looks like no one’s even so much as commented on it in the last year, so it might have been abandoned.

jonwilsdon · on Nov 20, 2021

Disclosure: I manage the Ion and PartiQL teams at Amazon.

This proposal hasn’t been abandoned. We hope to post an update soon!

n8ta · on Nov 20, 2021

Ion is already a little too complex for my taste. It'd be a shame to see it go the same way as yaml where it's so complex that most major implementations are not safely interoperable.

seniorsassycat · on Nov 20, 2021

ion text is a good contender for JSON, YAML, TOML usecases. It's also a good way to present the binary to humans.

tlocke · on Nov 20, 2021

One problem with Ion is that it doesn't have a map type, but instead a struct type that allows duplicate keys. I created Zish https://github.com/tlocke/zish as a serialization format that addresses the shortcomings of JSON and Ion. Any comments / criticisms welcome.

silvestrov · on Nov 20, 2021

This is what JSON should have been extended to.

But Douglas Crockford just don't want to innovate anything, just like Gruber didn't want to make a proper specification of the Markdown format.

Sometimes people are keeping innovation back. Fortunately this did not happend with html.

The main thing missing from the text format is a magic and version number. At least the binary format has it.

kwertyoowiyop · on Nov 20, 2021

The dominance of JSON shows that Crockford made some good decisions, even though we may not agree with them on any given day.

seanclayton · on Nov 20, 2021

The dominance of JSON just shows that JS is dominant.

usrusr · on Nov 20, 2021

Even JS stopped parsing JSON as a subset JS a long time ago. JSON lineage has been irrelevant in terms of popularity transmission ever since people stopped doing var jsonobj = eval(jsonstring);

indymike · on Nov 20, 2021

> The dominance of JSON just shows that JS is dominant.

I don't know that's the case... I've used JSON in lots of non-JS languages because it just works, and errors rarely are caused by mismatches in how JSON behaves in language X and language Y. A lot of that is that it is simple, and rigid.

ramraj07 · on Nov 21, 2021

Simple, rigid and limited and everyone’s settled on it so you can’t do anything now.

seniorsassycat · on Nov 20, 2021

Ions text format is a nice JSON alternative while it's binary format is very compact and allows for efficient sparse parsing. Fields are prefixed with their length so you can skip over unneeded fields or structs while only creating objects for values you'll use.

syspec · on Nov 20, 2021

Surprised I have not heard of this before, I'd love something to come along and give JSON a kick in the pants.

I do think JSON is the defacto standard, and it really does get the job done, but for some more advances uses something like this could really shine.

Eelongate · on Nov 20, 2021

Did anything ever become of the lispy language that was being built using Ion as its homoiconic syntax? I'm afraid I can't recall what it was called. Fusion maybe?

swaranga · on Nov 21, 2021

I built a system in my previous team where clients can register "filters" described in Fusion. My system, which was a source of a lot of different notifications, would then run these filters and only send those notifications that passed the filters. It became very popular very quickly because of the easy on-boarding and the fact that clients not get only a fraction of the messages they were interested in. I just checked the Java implementation, it seems to be still active and get commits.

throwaway_sZntK · on Nov 20, 2021

Yeah, Fusion was the name. Last I heard, they discontinued it, saying essentially "If you really want a full Lisp, there's already Clojure." S-exps continued to be used in Ion for embedded 1-liners but they only supported a handful of operators, not a full language.

garmaine · on Nov 20, 2021

I hope you get an answer, because this sounds very intriguing but google is failing me in finding any references to it.

kayamon · on Nov 20, 2021

Dunno about that one but if you like that sort of thing, check out Rebol.

otabdeveloper4 · on Nov 20, 2021

Nice! This thing is actually sane and thought through. A first for serialization formats. They're usually a shitshow.

(Should have gone with 'rational' instead of 'decimal', though. Decimal will be too painful to implement accross languages and implementations. Java bias?)

sirk390 · on Nov 20, 2021

But decimal are way more useful as they can represent currency amounts. It would be strange to show a currency amount like "3/4" or "11/12". Personally, the two datatypes I have always been adding manually to json are datetimes and decimals (from python)

otabdeveloper4 · on Nov 20, 2021

A currency amount is just a rational number with "1000000" as a denominator.

This is the correct representation, and how Google or the blockchain do it.

sirk390 · on Nov 21, 2021

I don't thing there is a more "correct" representation. Representing it as a string is equaly correct. Blockchain or banking often represents money as an integer with the smallest divisible unit (cent, or satoshi), but it is not applicable here because there is no smallest divisible unit.

Zamicol · on Nov 20, 2021

Am I the only one that doesn't like base 64?

Hex for when efficiency isn't paramount.

Base 85 or BasE91 for when efficiency is more of a concern. http://base91.sourceforge.net/

Aeolun · on Nov 20, 2021

I like base64 because it’s the de-facto standard, and data size (in places where I’d use base64) isn’t a main concern for me.

stjohnswarts · on Nov 20, 2021

Yeah I get tired of reinvention of everything for tiny gains in size/performance.

ralusek · on Nov 20, 2021

I understand the case for Base91, but why hex over Base64? Base64 for readability and sticking to multiples of two, Base91 for maximum efficiency with readable ASCII.

Zamicol · on Nov 20, 2021

Base 64 is good at nothing and bad at some things.

- Hex is human readable, case insensitive, not that "inefficient", and always aligns to bytes.

- Base 85 and basE91 are efficient.

- Bitcoin uses Base58 because they thought base 64 was too human unreadable. Ethereum uses Hex.

- Base 256 (bytes) is efficient and the native language of computers.

Base 64 is not efficient, not human readable, and not easy to encode.

The biggest problem with base 64 is that base 64 is not base 64. Are you doing base 64 with padding? Are you doing base 64 with URL safe characters or URL unsafe characters? Are you following the standard RFC 4648 bucket encoding, or are you using iterative divide by radix? I think a great place where the cracks show is JOSE, where for things like thumbprints there's a ton of conversion steps (UTF-8 key -> base 64 -> ASCII bytes -> digest (bytes) -> base 64 thumbprint).

My personal advise for 90% of projects considering looking at base 64 should just use Hex or bytes. If needing human readability, use Hex. Otherwise use binary.

hackcasual · on Nov 20, 2021

You want to use hex whenever byte aligned data is going to be compressed. Base64 quadruples byte level symbols

petilon · on Nov 20, 2021

I like the fact that you can annotate objects as well, not just literals. So this is valid:

    animal: Tiger:: {
       gender: 'F',
       weight: 450
    }

This solves the inheritance problem, i.e., if you have multiple subclasses how do you know which type to deserialize as?

plandis · on Nov 20, 2021

I believe that this is exactly how Jackson serialized Ion handles subtype polymorphism.

oandrew · on Nov 20, 2021

So basically it's Amazon's version of Apache Avro. Avro supports binary/json serialization, schema evolution , logical types (e.g. timestamp) and other cool stuff.

https://avro.apache.org/docs/current/spec.html

joshka · on Nov 20, 2021

Avro didn't exist when Ion started development.

fnord77 · on Nov 20, 2021

I wanted to see what the differences are between Ion and Avro.

Unlike avro, ion doesn't require a schema.

whimsicalism · on Nov 20, 2021

... or thrift ... or protobuf

https://xkcd.com/927/

chromatin · on Nov 20, 2021

Check out Ilya Yaroshenko’s Ion library for D, part of the larger ‘mir’ library:

http://mir-ion.libmir.org/

https://github.com/libmir/mir-ion

hatf0 · on Nov 20, 2021

Weird to see the library I work on show up in HN —- Mir Ion is a pretty complicated library (and admittedly our documentation needs work — I’m working on that!), but I’m very proud of our work.

Some fun things about Mir Ion:

- We can fully deserialize Ion at compile-time (via D’s CTFE functionality)

- We’re one of the fastest JSON parsing libraries (and one of the most memory efficient too — we actually store all JSON data in memory as Ion data, which is vastly more efficient)

- We’re nearly 100% compliant to all of the upstream test cases (our main issue is that we’re often too lax on spec, and allow files that are invalid through)

- The entire library is (nearly) all `@nogc`, thanks to the Mir standard library

If anyone has any questions on Mir Ion, feel free to shoot me a line at harrison (at) 0xcc.pw

stevefan1999 · on Nov 20, 2021

How does that differs from the likes of MessagePack and CBOR?

programd · on Nov 20, 2021

"Zero and negative dates are not valid, so the earliest instant in time that can be represented as a timestamp is Jan 01, 0001"

That seems to be...a problem? How do you deal with archeological dates, of which there are many, in Ion?

pdpi · on Nov 20, 2021

That’s an interesting question. On the one hand, it feels weird that you can’t represent those dates at all.

On the other hand, representability of a given date becomes progressively less useful the further back in time you go, and stuff becomes really gnarly once you go past the Julian calendar in 45BC.

Also, simplifying to “no dates before Jan 1 0001” has very little impact on applications dealing with the modern-ish world (with “modern” generously defined as “anything after the collapse of the Roman Empire”), and I can only assume applications dealing with earlier times could do with a more specialised representation for dates anyway.

biztos · on Nov 20, 2021

Just to give one example, in Thailand right now it's the year 2564.

1 BC for some is not "-1" for everyone.

elteto · on Nov 20, 2021

What modern tech service (of the kind that would have use for Ion) is dealing with archaeological dates _at scale_? Honest question.

dorianmariefr · on Nov 20, 2021

Pretty neat, but isn't it like *two* formats: one binary and one textual?

OJFord · on Nov 20, 2021

Consider that binary, binary coded decimal, Gray code, hexadecimal, octal, etc. are all 'formats' expressing the same (numerical) idea.

You can't say the same of, for example, YAML & JSON, since the former (if not the latter?) has constructs unrepresentable in the other.

It's slightly confused because an application might 'serialise to' JSON or YAML or Ion equivalently - but really that's saying the application's data being serialised fits a model that's a subset of the intersection between those formats.

You could call Ion two, but it's more than that in that it's also a promise that they're 1:1 (err, and onto if you like) - their intersection is their union.

seniorsassycat · on Nov 20, 2021

Two representations of the same data structures.

Ion text is like JSON, in fact all JSON is valid ion text. Ion text has comments, trailing commas, dates, and unquoted keys. It's a really good alternative to JSON, YAML, or TOML.

Ion binary is compact and fast to parse. Values are length prefixed so the parser can skip over unneeded fields or structs, saving time parsing and memory allocated. Common string values, like struct keys and enum values, are given numeric ids and stored once in a header table.

ComputerGuru · on Nov 20, 2021

Do comments persist in binary serialization or is that a lossy one-way operation?

seniorsassycat · on Nov 20, 2021

I think the ion java library includes a AST parser that includes comments, but the ION data model doesn't. The binary format cannot include comments.

I think many text parsers are missing libraries that edit documents in place, preserving formatting and comments.

the_girabbit · on Nov 20, 2021

Comments don’t persist in binary. Like white space, they are explicitly not part of the data model.

echelon · on Nov 20, 2021

One data model, two serializations of it.

deschutes · on Nov 21, 2021

I don't like ion. The added features (over json) don't pull their weight. Symbols, annotations and the binary format all add significant complexity but don't make the format much better. As a consequence of the added complexity language support is poor.

For RPC the binary encoding compares poorly to external schema formats like protobuf. In this context binary ion is a poorly compressed text format.

I don't think the partial document read capability of the binary format is all that important, but I've never worked on an application that would benefit from it either.

mgamache · on Nov 20, 2021

msgpack is near the top for speed and size... Readability is nice. Are there other advantages?

https://msgpack.org/index.html

cma · on Nov 20, 2021

> Ion supports comments.

Thank god.. JSON for config files without comments is so awful.

hirundo · on Nov 20, 2021

It seems like an odd choice to make the type "metadata" a prefix to the value, rather than a separate field. It feels like overloading. What's the advantage?

re · on Nov 20, 2021

Not sure I understand exactly what "a separate field" would look like, but:

1. Considering that a goal of Ion is to be a strict superset of JSON, separate syntax ensures that any JSON value can be parsed without misinterpreting some field as an annotation--there are no reserved/"magic" field names.

2. Annotations can be applied to any type of value, not just objects, which are the only type that have fields.

indymike · on Nov 20, 2021

It tells you how to load the value and can be human readable for audit purposes. example: degrees::'celsius'::100

sirk390 · on Nov 20, 2021

timestamps and decimal are the two most useful additions compared to json. They would be nice to add to json if that is somehow possible.

nly · on Nov 20, 2021

JSON numbers, just like all human readable formats, are decimal... it's not like binary double values are printed out in to JSON in hex or base64

Sure 99% of decoders convert them to and from binary doubles, but that's purely an implementation choice.

indymike · on Nov 20, 2021

> JSON numbers, just like all human readable formats, are decimal...

All JSON numbers are implemented as integers or floating point, and as a result, have to be cast as a decimal (a decimal type is generally something that meets this specification: http://speleotrove.com/decimal/) when you import them.

Decimal types differ from floating point types in three ways: they are accurate, and they take into account rounding rules and precision. Decimal math is slower, can have greater precision and is better suited to domains where finite precision is needed. Floating point is faster, but is not as precise, so it's good for some scientific uses... or where perfect precision isn't important but speed is... say 3d graphics.

I've billed lots of hours over the years fixing code where a developer used floats where they should have used decimals. For example, if you are dealing with money, you probably want decimal. It's one of those problems like trying to parse email addresses with a regex or rolling your own crypto... it will kind a work until someone finds out it really doesn't (think accounting going, our numbers are off by random amounts, WTF?).

nly · on Nov 20, 2021

A binary double can hold any decimal value to 15 digits of precision, so as a serialisation format it's a bit of a non-issue... you just need to convert to decimal and round appropriately before doing any arithmetic where it matters.

And you're confusing JSON the format with typical implementations. Open a JSON file and you see decimal digits. There is no limit to the number of the digits in the grammar. Parsing these digits and converting them to binary doubles, for example, is actually slower than parsing them as decimals, because you have to do the latter anyway to accomplish the former. Almost all JSON libraries convert to binary (e.g. doubles) because of their ubiquitous hardware and software support...but some libraries like RapidJSON expose raw numeric strings out of the parser if you want to plug in a decimal library

indymike · on Nov 21, 2021

> And you're confusing JSON the format with typical implementations. Open a JSON file and you see decimal digits. There is no limit to the number of the digits in the grammar. Parsing these digits and converting them to binary doubles, for example, is actually slower than parsing them as decimals, because you have to do the latter anyway to accomplish the former.

JSON spec for numbers: integer or float (implemented as a double precision float). JSON libraries read numbers as double precision float because that is the correct type for JSON numbers, not for any other reason.

sirk390 · on Nov 21, 2021

If you only look at the json format it's true. But it would be way worse if some decoders returned decimals and some returned floats.

transfire · on Nov 20, 2021

`years::4`? I don't know. What not `4::years`?

Also, symbols converted to integers means the receiving end has to already know exactly what they are.

re · on Nov 20, 2021

Putting annotations before values is likely to be more useful for streaming parsers than putting them after. Imagine the case where the annotation represents a class that you want to deserialize a large object into.

sokoloff · on Nov 20, 2021

There is provision for encoding a local symbol table: https://amzn.github.io/ion-docs/docs/symbols.html#processing...

wisty · on Nov 20, 2021

I scanned the docs, and can't see what happens if you alter your data schema. Anyone know?

travisd · on Nov 20, 2021

Seems like you have to handle that yourself. The serialized data includes the type, so your app code might have to have logic a la “if type1: … else: …” after parsing it.