This reminded me of a tight-packed binary format we used in the trading systems domain almost 20 years ago. Instead of including metadata/field names in each message, it had a central message dictionary that every client/server would first download a copy from. Messages had only type IDs, followed by binary packed data in the correct field order. Because of microsecond latency requirements, we even avoided the serialization/deserialization process by making the memory format of the message and the wire format one and the same. The message class contained the same buffer that you would send/store. The GetInt(fieldID) method of the class simply points to the right place in the buffer and does a cast to int. Application logs contained these messages, rather than plain text. There was a special reader to read logs. Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection.
In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key. I now work with web based systems. On days like this, I miss the old ways.
And to top it off you could fit the entire message into whatever the MTU of your network supported. Cap it at 1500 bytes and subtract the overhead for the frame headers and you get an extremely tight TCP/IP sequence stream that buffers through 16MB without needing to boil the ocean for a compound command sequence.
Having been in industry only 2 decades it amuses me how many times this gets rediscovered.
That just reminded me of the most mysterious scaling issue I ever faced. We had a message to disseminate market data for multiple markets (e.g. IBM: 100/100.12 @ NYSE, 101/102 @ NASDAQ etc.). The system performed admirably under load testing (think 50,000 messages per second). One day we onboarded a single new regional exchange and the whole market data load test collapsed. We searched high and low for days without success, until someone figured out that the new market addition had caused the market data message to exceed the Ethernet frame size for the first time. Problem was not at the application layer or the transport, it was data link layer fragmentation! Figuring that out felt like solving a murder mystery (I wasn't the one who figured it out though).
A lot of "transparent RPC" systems are like this. "It's just like a normal function call, it's sooo convenient" . . . until it isn't, because it involves the network hardware and configuration, routing environment, firewalls, equipment failure . . .
I’ve worked on systems like this too - the max packet size is very well documented.
Then post trade it all gets turned into FIXML which somehow manages to be both more verbose and less readable.
Yeah, that's part of the trick for large listing responses to be spread across frames. Usually with some indicator like a "more flag" so the client can say "get me the next sequence by requesting the next index in the listup with the prior btree index. People do this all the time with large databases and it's a very similar use case.
Ouch that’s rough. One nice bit of IPv6 is that it doesn’t allow fragmentation. It often much nicer to get no message or an error than subtly missing data.
Ah yah that’s right. I’m just learning more of ipv6 and get it mixed up. It appears what I had in my mind was about intermediate routers: “Unlike in IPv4, IPv6 routers (intermediate nodes) never fragment IPv6 packets.” (Wikipedia). To the previous point, it looks like ipv6 does require networks to send 1280 byte or smaller packets unfragmented.
I don't understand why serialization formats that separate structure and content aren't more popular.
Imagine a system every message is a UID or DID (https://www.w3.org/TR/did-core/) followed by raw binary data. The UID completely describes the shape of the rest of the message. You can also transmit messages to define new UIDs: these messages' UID is a shared global UID that everyone knows about.
Once a client learns a UID, messages are about as compact as possible. And the data defining UIDs can be much more descriptive than e.g. property names in JSON. You can send documentation and other excess data when defining the UID, because you don't have to worry about size, because you're only sending the UID once. And UIDs can reference other UIDs to reduce duplication.
It’s a 64bit random number so it’ll never have unintentional collisions.
Also note that a capnp schema is natively represented as a capnp message. Pretty convenient for the “You can also transmit messages to define new UIDs” part of your scheme :)
Protobufs is a boring old tag-length-value format. It's kind of the worst of both worlds because it has no type information encoded in to it, meaning it's useless without the schema, while still having quite a bit of overhead.
Capn'Proto is more like a formalization of C structs in that new fields are only added to the end. If memory serves, on the wire there is no tag, type or length info (for fixed size field types), and everything is rooted at fixed offsets
Mostly right. Allow me to provide some wonky details.
Protobuf uses tag-type-values, i.e. each field is encoded with a tag specifying the field number and some basic type info before the value. The type info is only just enough information to be able to skip the field if you don't recognize it, e.g. it specifies "integer" vs. "byte blob". Some types (such as byte blob) also have a length, some (integer) do not. Nested messages are usually encoded as byte blobs with a length, but there's an alternate encoding where they have a start tag and an end tag instead ("start group" and "end group" are two of the basic types). On one hand, having a length for nested messages seems better because it means you can skip the message during deserialization if you aren't interested in it. On the other hand, it means that during serialization, you have to compute the length of the sub-message before actually serializing it, meaning the whole tree has to be traversed twice, which kind of sucks, especially when the message tree is larger than the L1/L2 cache. Ironically, most Protobuf decoders don't actually support skipping parsing of nested messages so the length that was so expensive to compute ends up being largely unused. Yet, most decoders only support length-delimited nested messages and therefore that's what everyone has to produce. Whoops.
Now on to Cap'n Proto. In a given Cap'n Proto "struct", there is a data section and a pointer section. Primitive types (integers, booleans, etc.) go into the data section. This is the part that looks like a C struct -- fields are identified solely by their offset from the start of the data section. Since new fields can be added over time, if you're reading old data, you may find the data section is too small. So, any fields that are out-of-bounds must be assumed to have default values. Fields that have complex variable-width types, like strings or nested structs, go into the pointer section. Each pointer is 64 bits, but does not work like a native pointer. Half of the pointer specifies an _offset_ of the pointed-to object, relative to the location of the pointer. The other half contains... type information! The pointer encodes enough information for you to know the basic size and shape of the destination object -- just enough information to make a copy of it even if you don't know the schema. This turns out to be super-important in practice for proxy servers and such that need to pass messages through without necessarily knowing the details of the application schema.
In short, both formats actually contain type information on the wire! But, not a full schema -- only the minimal information needed to deal with version skew and make copying possible without data loss.
I wouldn't call what protobuf encodes type information. If I recall all the group stuff is deprecated, so what's left basically boils down to 3 types: 32 bit values, 64 bit values and length prefixed values, which covers strings and sub-messages. Without the schema you can't even distinguish strings from sub-objects, as they are both length prefixed as you described.
Can you even distinguish floats and ints without a schema in protobufs? I don't remember.
I really enjoy capnproto, flatbuffers and Avro and bounce between them depending on the task at hand.
> I wouldn't call what protobuf encodes type information.
Well... I would call it type information, just not complete.
> 32 bit values, 64 bit values and length prefixed values
In protobuf, most integer types are actually encoded as varints, i.e. variable-width integers, not fixed 32-bit or 64-bit. varint encoding encodes 7 bits per byte, and uses the extra bit to indicate whether there are more bytes. (It's not a very good encoding, as it is very branch-heavy. Don't use this in new protocols.)
> Can you even distinguish floats and ints without a schema in protobufs?
You can't distinguish between float vs. fixed32. But int32 would be varint-encoded, while floats are always fixed-width, so you could distinguish between those. (You can't distinguish between int32, uint32, and sint32, though -- and sint32 in particular won't coerce "in the obvious way" to the others.)
The really unfortunate thing is you can't distinguish between strings vs. nested messages (of the length-delimited variety). So if you don't specifically know that something is a nested message then it's not safe to try parsing it...
I wonder if giving it a name based on the hash of the definition has been explored; like Unison [0] where all code is content addressable, but for just capnproto definitions. Is there a reason not to?
Capnp uses the name of your message, but not its full definition because that would make it impossible to extend protocols in a backwards compatible way. Without the ability to add new fields, making changes to your protocol would be impossible in large orgs.
Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.
Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.
It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.
Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.
(Plus if you actually created a global database, then you'd find out if there were any collisions...)
If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.
On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.
>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.
If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?
If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.
The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.
I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.
Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.
For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.
EDIT: It does apply, I misunderstood what the number was being used for.
You have a collision if any two schemas share the id, not if a specific schema collides with any
of the others. So it is exactly like the birthday paradox.
If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.
I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?
MD5 is a 128 bit random number no one would ever have thought would collide. 64 bits is peanuts especially when message types are being defined dynamically
Dude that’s why I said “unintentional collisions”.
Of course you can get intentional collisions. The security model here assumes that anyone that wants to know your message’s ID can just ask.
Did you know that the Internet Protocol uses a 4-bit header to specify the format (v4 or v6) of the rest of the message? They should have used 128 bits. What a bunch of fools.
If you read the protobuf source, you can see a bunch of places where you can hook in custom type-fetching code, e.g. in the google.protobuf.Any type.
After studying it a bit, I'm certain this is how it's used inside Google (might also be mentioned elsewhere).
All you'd really need to do is to compile all protos into a repository (you can spit out the binary descriptors from protoc), then fetch those and decode in the client.
I think the system OP is describing is a little bit more complex. You're not just describing message types, you also have message templates; a template declares a message type and a set of prefilled fields. You save data by just sending the subset of fields that are actually changing, which is a very good abstraction for market data. The template is hydrated on the protocol parsing layer so your code only has to deal with message types itself.
Serialization is platform-dependent (to make it a simple memcpy most of the time), and the schema is sent up front (but can be updated later, with in-bound messages at will). See the User Guide (http://binlog.org/UserGuide.html) and the Internals (http://binlog.org/Internals.html) for more.
Interesting. Confluent Avro + Schema registry + Kafka uses exactly the same approach - binary serialized Avro datums are prefixed with schema id which can be resolved via Schema registry
Same here, I wrote an exchange core that did this using SBE. Basically you don't serialize in the classical sense, because you're simply taking whatever bytes are at your pointer and using them as some natural type. The internals of the exchange also simply used the same layout, so there was minimal copying and interpreting. On the way out it was the same, all you had to do was mask a few fields that you didn't want everyone to see and ship it onto the network.
Even an unoptimized version of this managed to get throughput in the 300K/s range.
Somehow it's the endpoint of my journey into serialization. Basically, avoid it if you need to be super fast. For most things though, it's useful to have something that you can read by eye, so if you're not in that HFT bracket it might be nicer to just use JSON or whatever.
Fab story, thank you! I understood up to
"Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection."
Can you go into more details about how the sequence number and resuming after disconnection worked?
Server used a global sequence number for all messages they transmit. Clients are stateful so they know exactly what was the latest message they processed and would send that id when creating a new connection. This was very important as a lot of the message types used delta values, one of the most important ones being the order book. So in order to apply a new message you had to make sure that you're internal state was at the correct sequence id, failing to do so would make your state go bonkers, specially when you're talking about hundreds of messages being received per second. It's scary but you had a special message type that would send you a snapshot of the expected state with a sequence id that they correspond to. So your error handling code would fetch one of these and them ask for all the messages newer than that.
This is exactly right. It was almost always deltas in favor of snapshots. One of the downsides was that sometimes, debugging an issue required replaying the entire market up to the point of the crash/bug.
Pretty basic. The receiving process usually has an input thread that just puts the messages into a queue. Then a processing thread processes (maybe logic, maybe disk writes, maybe send) the messages and queues up periodic batch acks to the sender. The sender uses these acks to clear its own queue. The receiver persists the last acked sequence number, so that in case of a restart, it can tell upstream senders to restart sending messages from that point.
What you're describing is exactly what still takes place in trading platforms, although a few i've seen now use SBE for consistency sake (it's very common on the market data side)
Don't know if you're describing the original FIX itself with the TCP connection. On FAST FIX they got rid of the TCP connection and market data was sent over UDP using several parallel connections, data was reordered on the client side at consumption time and it only used a TCP connection to recover data when a sequence gap was found.
Actually, even FAST was too slow for us. This was a proprietary messaging middleware library. And this particular market data feed was the direct one into the matching engine itself. For the rest of the system, we used a sort of reliable multicast using UDP for the first transmission and TCP for missed messages. We initially tried out a Gossip/Epidemic protocol but that didn't work out too well.
I had exactly the same implementation except that type / version belonged to the whole message and would map to appropriate binary buffer in memory. No real de/serialization was needed.
I still use it in my UDP game servers, with added packet id if message exceeds max datagram length and has to be split
The one concern I'd have with this format is a length field getting corrupted in transit and causing an out-of-bounds memory access. The network protocols' checksums won't save you 100% of the time, especially if there's bad hardware in the loop. If every field is fixed length this is less of a concern, of course; you might get bad data but you won't get e.g. a string with length 64M.
In our system, if the message didn't unpack properly, the application would send a retransmit request with that message's sequence number. But in practice, this scenario never occurred because TCP already did this for us.
Very neat and similar to a project I am starting for packet radio. I went further with the dictionary concept so that it contains common data. This way, your message contains only a few dictionary "pointers" (integers in base 64). This makes it easier to fit messages in ASCII for 300 baud links.
Old school texty FIX is incredibly slow. FAST FIX is faster but not fun to use. Largely SBE has won adoption on the market data side, with huge platforms like Euronext (biggest on Europe) using it.
FAST FIX protocol is terrible performance-wise, its format requires multiple branching at every field parsing. Even "high-performance" libraries like mFAST are slow: I recently helped a client to optimize parsing for several messages and got 8x speed improvement over mFAST (which is a big deal in HFT space).
> In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key.
If the order submission process depends on the manual press on the enter key (+/- 50ms) is there any point to that though?
Despite all the algorithms we employed, the concept of a manual trade never went away. Also, when the front-end was taken out of the equation, the latencies were in the microsecond range. 50ms would be excruciatingly slow for an algorithm.
Is this a number that came from an actual benchmark or from some marketing material from a keyboard maker? I ask this because [1] finds latency (measured from touching the key to the usb packet arriving) of 15ms with the fastest keyboard and around 50ms with others, though apparently some manufacturers have since improved. Or are you talking about midi keyboards where I guess latency is more noticeable to users?
From the countless review and small time YouTube channels that test these things regularly.
I think that post must be a few years out of date - and moreover by its own admission doesn’t even test hardly any “gaming” keyboards. There is a tremendous amount of competition in keyboards that has been building for the past 10 years.
Input latency is now a marketing thing like horsepower, and there are reasonably reputable [1] places and countless small time YouTube reviewers that test these things.
It’s not like it is difficult to improve latency, and now that it is something that is competitively marketed it is delivered on.
Personally I think it’s a bit ridiculous. This fetishization with minimizing latency to now sub-ms levels doesn’t necessarily lead to better performance as many top level gamers do not use the lowest latency level keyboards. But that doesn’t change the fact that modern mainstream gaming keyboards can hit a latency far below 50ms.
The link I posted was 2017. The site you link gives quite different ratings. I assume partly it is different methodology (the site you link tries to account for key travel somehow and they do something with a display and try to account for display latency rather than using a logic analyzer), but I’m not really sure. For some keyboards in common:
- apple magic keyboard (? vs 2017) 15ms vs 27ms
- das keyboard (3 vs S professional/4 Professional) 25 vs 11/10ms
- razer ornata (chroma vs chroma/chroma 2) 35 vs 11.4/10.1ms
Interestingly it is not some simple uniform difference: the Apple keyboard does much worse in the rtings test, perhaps getting not much of a bonus from key travel compensation. But the das keyboard vs the razer that are 10ms apart on my link perform equally on rtings (but maybe I found the wrong model). I don’t have a good explanation for that discrepancy.
I know it is 2017, but that is a very long time in the gaming/mech keyboard market. I remember just about 10 years ago when mech keyboards were a niche for weirdos and a few others that swore by their Model M's - you now can buy these in Walmart. The point about discrepancy is well taken* but I think the bigger point is on the rtings list the number of offerings that are an order of magnitude lower latency - such that the methodology used in your link is not even viable.
Why is using a high speed camera and a logic analyzer less viable than measuring the end to end latency and trying to subtract the computer part of it? Or are you suggesting that a solenoid should be used to press the key instead of a finger?
I assume you were using C++? I'm not sure what you describe is possible these days due to UB. At the very least just casting bytes received over the wire to a type is UB, so you technically need a memcpy() and hope that the compiler optimises it out.
Yes, it was C++. I was unfamiliar with the acronym "UB" so did a Google search. Does it mean "Undefined Behavior"? If I remember correctly, primitive types other than strings are memcpy'd. GetStr basically returned a char* to the right place in the buffer.
It's staggering to me that people keep making these "rich" data formats without sum types. At least to me, the "ors" are just as important as the "ands" in domain modeling. Apart from that, while you can always sort of fake it with a bunch of optional fields I believe that you kind of need a native encoding to a tagged union if you want to avoid bloating your messages.
The ion data model doesn't describe a schema or type system. It's a data structure where values are of a known type. In the binary format values are preceded by a type id, in the text format the syntax declares the type - "" for string, {} for struct. The data model doesn't declare what types a value could have, only the type it does have.
Doesn't it trivially have "sum types" since it's just arbitrary self-describing data? i.e. nobody is stopping you from passing around objects in such a way:
{a:1}
{a:{b:2}}
{a:4}
{a:{b:4}}
There's no static type layer over top of this, so it's inherently up to interpretation and whatever type system you want to use to describe this data, to be able to express that the values of `a` can be of type `number | {b: number}`
Yeah, that's the problem.
I mean, hey, why json? We could just use unstructured plaintext for everything and now we are free to do everything. But obviously that has its own drawbacks.
Having built-in support for sumtypes means better and more ergonomic support from libraries, it means there is one standard and not different ways to encode things and it also means better performance and tooling.
The point is that there's no reason to single out sumtypes here. Insofar as ions/json has support for arrays/objects/strings/numbers, it has exactly the same support for sumtypes, as in the example I showed above. Here is a list of "sumtype" `string | number | object`:
In the same sense "1e-12" is not a number, it's a string. Yes, it's a string that encodes a number in a certain notion, but for alle the tooling, the IDE, the libraries, etc. it will stay a string.
Sum types =/= union types. Sum types are also called 'tagged' or 'discriminable' unions because they have some way to discriminate between them. That is, if you have an element a of type A, a is not part of the sum type A + B because it's missing a tag.
[5,"hello",3] has the type list (int ∪ string), not list (int + string). You can emulate the latter by manually adding a tag, but native support is much preferable.
I know the differences between untagged and tagged unions, I'm trying to provide a minimal example without distracting details but sure we can talk about tagged unions. Here is a list of tagged unions, so I once again point out that sum types are "supported" in JSON/ions just as much as any other data type:
There is no such thing in JSON or Ions as defining this "X" schema somewhere. So I may as well say that your [A,B,...] is a list[Any].
Now, I wouldn't actually call it a list of any, I would say you proved my point for me. Your example is functionally the same as mine. I would give this example:
`[A, B, ...]`
and say that that is a list of sum types. You may say "no no no! Only now is it a list of sum types!":
`data X = A | B
[A, B, ...]`
But my point is that there is no JSON/Ion equivalent of your `data X = A | B`. Everyone in this comment tree is confusing the data itself with out-of-band schema over that data. "Sumtype" is nothing more than a fiction, or a schema. Saying that JSON/Ions don't support sumtypes is like saying JSON doesn't support "NonNegativeInteger" type. Sure it does! Here are some: 1, 2, 3, 10. What tooling or type system you use outside of the data itself to enforce constraints on the data types is orthogonal to the data format itself.
> But my point is that there is no JSON/Ion equivalent of your `data X = A | B
No one disagrees - it's just that we complain about this. We _want_ to have such an equivalent.
> Saying that JSON/Ions don't support sumtypes is like saying JSON doesn't support "NonNegativeInteger" type.
Correct. But your conclusion is wrong. You seem to assume that no one has a problem with the fact that JSON doesn't support a "NonNegativeInteger" type. But I at least would happily use a format that explicitly supports that.
I mean, look at ION. Json doesn't support the concept of (restricted) integers, but ION extends JSON and offers this type. That's great, because it means if a library reads an integer field, it can map it to an integer and knows that there are constraints.
This is a _very_ relevant issue. Many json libraries in the past have had bugs or could be ddos-ed by feeding them json with large numbers, since the json spec does not constrain the size of numbers.
In that sense, ION could have _also_ added support for "NonNegativeInteger" or sumtypes, or other specific types, but they haven't. And since sumtypes are very fundamental, we complain about it more than we would complain about the lack of "NonNegativeInteger".
data interchange formats try to encode as little backwards incompatible information as possible. in this case, it would be the restriction that something is a sum type when it could have multiple fields set in the future. another example is protobuf moving to all fields being optional by default.
as for the wire format, a variant struct where you've only instantiated a single field will encode down to just about the minimum amount of information required.
Avro went the opposite way to most and just makes the concept of an optional field implementable via a union with null
Non union fields can even be upgraded to unions later
Personally I find the protobufs "everything is optional!" Behaviour fucking insane and awful to deal with, but it is true to the semantics of its underlying wire format.
One can always choose not to use (native) sumtypes if they are interested in extreme performance or compatibility.
But logically speaking, it is _good_ that it's a restriction that a sumtype can't just turn into a multiple-fields type. Because while my software (as the consumer) might still be able to deserialize it, the assumption that only one field is set would be broken and my logic would now potentially broken. Much better if that happens at deserialization time then later one when I find out that my data is incorrect/corrupt.
Well, there are already sumtypes, just only specific builtin ones, not custom ones. E.g. booleans are sumtypes (true | false). Everything else that is nullable is also a sumtype (e.g. number | null).
I think it should be pretty obvious how these are helpful and why they are needed no?
Protobuf supports sum types in the higher-level generated descriptors and languages -- on the wire they're just encoded as, well... oneof a number of possible options.
Avro had unions in version 1.0 [0], which is from 2012.
Capnproto had unions back in 2013 [1]. That's from the v0.1 days, or maybe even earlier.
Protobuf has had oneof support for about 7 years. They were added in version 2.6.0, from 2014-08-15 [2]. That's still 6 years after the initial public release in 2008, though, so this is maybe what you were thinking of? I don't know too many people who were using protobuf in those days outside of Google, though.
And yes, I definitely am primarily thinking of protobuf, as I struggled with this back with version 2.5. I had the (apparently mistakenly) impression that Avro and Cap'n Proto (which I think actually first came out in this timeframe) were about on par.
I mean that metaphorically but I do have a bunch of keyboard shortcuts (in a browser extension) that make finding these, and formatting the comments, much faster.
Wow I remember using Ion back at Amazon in 2012.
I can’t remember but I think the order data warehouse was using it …
I also now remember back to using something that was akin to FaaS but wasn’t called that.
I could give them a JAR of some code that would execute on some Ion data for the order data when it changed. Basically FaaS for an ETL pipeline…
ION pros:
- easy to skip around while reading a file
- no need to write a schema
- backed by amazon so major langs will have impls
- good date support
- better concatenation, probably better suited to logging than bare
ION cons
- what's the text format even for?
BARE pros:
- schemas keep things tightly versioned
- smaller binaries (not self describing like ion)
- simpler to implement so tons of devs have impl'ed for their favorite lang
- better suited to small messages (think REST json api)
BARE cons:
- no skip read
- no date support
I might do an ion ruby implementation too, to really feel out the difference.
Ion is already a little too complex for my taste. It'd be a shame to see it go the same way as yaml where it's so complex that most major implementations are not safely interoperable.
One problem with Ion is that it doesn't have a map type, but instead a struct type that allows duplicate keys. I created Zish https://github.com/tlocke/zish as a serialization format that addresses the shortcomings of JSON and Ion. Any comments / criticisms welcome.
Even JS stopped parsing JSON as a subset JS a long time ago. JSON lineage has been irrelevant in terms of popularity transmission ever since people stopped doing var jsonobj = eval(jsonstring);
> The dominance of JSON just shows that JS is dominant.
I don't know that's the case... I've used JSON in lots of non-JS languages because it just works, and errors rarely are caused by mismatches in how JSON behaves in language X and language Y. A lot of that is that it is simple, and rigid.
Ions text format is a nice JSON alternative while it's binary format is very compact and allows for efficient sparse parsing. Fields are prefixed with their length so you can skip over unneeded fields or structs while only creating objects for values you'll use.
Did anything ever become of the lispy language that was being built using Ion as its homoiconic syntax? I'm afraid I can't recall what it was called. Fusion maybe?
I built a system in my previous team where clients can register "filters" described in Fusion. My system, which was a source of a lot of different notifications, would then run these filters and only send those notifications that passed the filters. It became very popular very quickly because of the easy on-boarding and the fact that clients not get only a fraction of the messages they were interested in. I just checked the Java implementation, it seems to be still active and get commits.
Yeah, Fusion was the name. Last I heard, they discontinued it, saying essentially "If you really want a full Lisp, there's already Clojure." S-exps continued to be used in Ion for embedded 1-liners but they only supported a handful of operators, not a full language.
Nice!
This thing is actually sane and thought through. A first for serialization formats. They're usually a shitshow.
(Should have gone with 'rational' instead of 'decimal', though. Decimal will be too painful to implement accross languages and implementations. Java bias?)
But decimal are way more useful as they can represent currency amounts. It would be strange to show a currency amount like "3/4" or "11/12".
Personally, the two datatypes I have always been adding manually to json are datetimes and decimals (from python)
I don't thing there is a more "correct" representation. Representing it as a string is equaly correct. Blockchain or banking often represents money as an integer with the smallest divisible unit (cent, or satoshi), but it is not applicable here because there is no smallest divisible unit.
I understand the case for Base91, but why hex over Base64? Base64 for readability and sticking to multiples of two, Base91 for maximum efficiency with readable ASCII.
Base 64 is good at nothing and bad at some things.
- Hex is human readable, case insensitive, not that "inefficient", and always aligns to bytes.
- Base 85 and basE91 are efficient.
- Bitcoin uses Base58 because they thought base 64 was too human unreadable. Ethereum uses Hex.
- Base 256 (bytes) is efficient and the native language of computers.
Base 64 is not efficient, not human readable, and not easy to encode.
The biggest problem with base 64 is that base 64 is not base 64. Are you doing base 64 with padding? Are you doing base 64 with URL safe characters or URL unsafe characters? Are you following the standard RFC 4648 bucket encoding, or are you using iterative divide by radix? I think a great place where the cracks show is JOSE, where for things like thumbprints there's a ton of conversion steps (UTF-8 key -> base 64 -> ASCII bytes -> digest (bytes) -> base 64 thumbprint).
My personal advise for 90% of projects considering looking at base 64 should just use Hex or bytes. If needing human readability, use Hex. Otherwise use binary.
So basically it's Amazon's version of Apache Avro.
Avro supports binary/json serialization, schema evolution , logical types (e.g. timestamp) and other cool stuff.
Weird to see the library I work on show up in HN —- Mir Ion is a pretty complicated library (and admittedly our documentation needs work — I’m working on that!), but I’m very proud of our work.
Some fun things about Mir Ion:
- We can fully deserialize Ion at compile-time (via D’s CTFE functionality)
- We’re one of the fastest JSON parsing libraries (and one of the most memory efficient too — we actually store all JSON data in memory as Ion data, which is vastly more efficient)
- We’re nearly 100% compliant to all of the upstream test cases (our main issue is that we’re often too lax on spec, and allow files that are invalid through)
- The entire library is (nearly) all `@nogc`, thanks to the Mir standard library
If anyone has any questions on Mir Ion, feel free to shoot me a line at harrison (at) 0xcc.pw
That’s an interesting question. On the one hand, it feels weird that you can’t represent those dates at all.
On the other hand, representability of a given date becomes progressively less useful the further back in time you go, and stuff becomes really gnarly once you go past the Julian calendar in 45BC.
Also, simplifying to “no dates before Jan 1 0001” has very little impact on applications dealing with the modern-ish world (with “modern” generously defined as “anything after the collapse of the Roman Empire”), and I can only assume applications dealing with earlier times could do with a more specialised representation for dates anyway.
Consider that binary, binary coded decimal, Gray code, hexadecimal, octal, etc. are all 'formats' expressing the same (numerical) idea.
You can't say the same of, for example, YAML & JSON, since the former (if not the latter?) has constructs unrepresentable in the other.
It's slightly confused because an application might 'serialise to' JSON or YAML or Ion equivalently - but really that's saying the application's data being serialised fits a model that's a subset of the intersection between those formats.
You could call Ion two, but it's more than that in that it's also a promise that they're 1:1 (err, and onto if you like) - their intersection is their union.
Ion text is like JSON, in fact all JSON is valid ion text. Ion text has comments, trailing commas, dates, and unquoted keys. It's a really good alternative to JSON, YAML, or TOML.
Ion binary is compact and fast to parse. Values are length prefixed so the parser can skip over unneeded fields or structs, saving time parsing and memory allocated. Common string values, like struct keys and enum values, are given numeric ids and stored once in a header table.
I don't like ion. The added features (over json) don't pull their weight. Symbols, annotations and the binary format all add significant complexity but don't make the format much better. As a consequence of the added complexity language support is poor.
For RPC the binary encoding compares poorly to external schema formats like protobuf. In this context binary ion is a poorly compressed text format.
I don't think the partial document read capability of the binary format is all that important, but I've never worked on an application that would benefit from it either.
It seems like an odd choice to make the type "metadata" a prefix to the value, rather than a separate field. It feels like overloading. What's the advantage?
Not sure I understand exactly what "a separate field" would look like, but:
1. Considering that a goal of Ion is to be a strict superset of JSON, separate syntax ensures that any JSON value can be parsed without misinterpreting some field as an annotation--there are no reserved/"magic" field names.
2. Annotations can be applied to any type of value, not just objects, which are the only type that have fields.
> JSON numbers, just like all human readable formats, are decimal...
All JSON numbers are implemented as integers or floating point, and as a result, have to be cast as a decimal (a decimal type is generally something that meets this specification: http://speleotrove.com/decimal/) when you import them.
Decimal types differ from floating point types in three ways: they are accurate, and they take into account rounding rules and precision. Decimal math is slower, can have greater precision and is better suited to domains where finite precision is needed. Floating point is faster, but is not as precise, so it's good for some scientific uses... or where perfect precision isn't important but speed is... say 3d graphics.
I've billed lots of hours over the years fixing code where a developer used floats where they should have used decimals. For example, if you are dealing with money, you probably want decimal. It's one of those problems like trying to parse email addresses with a regex or rolling your own crypto... it will kind a work until someone finds out it really doesn't (think accounting going, our numbers are off by random amounts, WTF?).
A binary double can hold any decimal value to 15 digits of precision, so as a serialisation format it's a bit of a non-issue... you just need to convert to decimal and round appropriately before doing any arithmetic where it matters.
And you're confusing JSON the format with typical implementations. Open a JSON file and you see decimal digits. There is no limit to the number of the digits in the grammar. Parsing these digits and converting them to binary doubles, for example, is actually slower than parsing them as decimals, because you have to do the latter anyway to accomplish the former. Almost all JSON libraries convert to binary (e.g. doubles) because of their ubiquitous hardware and software support...but some libraries like RapidJSON expose raw numeric strings out of the parser if you want to plug in a decimal library
> And you're confusing JSON the format with typical implementations. Open a JSON file and you see decimal digits. There is no limit to the number of the digits in the grammar. Parsing these digits and converting them to binary doubles, for example, is actually slower than parsing them as decimals, because you have to do the latter anyway to accomplish the former.
JSON spec for numbers: integer or float (implemented as a double precision float). JSON libraries read numbers as double precision float because that is the correct type for JSON numbers, not for any other reason.
Putting annotations before values is likely to be more useful for streaming parsers than putting them after. Imagine the case where the annotation represents a class that you want to deserialize a large object into.
Seems like you have to handle that yourself. The serialized data includes the type, so your app code might have to have logic a la “if type1: … else: …” after parsing it.
OK, so it's one of the more flexible ones (like those binary jsons) rather than something like protobuf. I guess that should have been obvious from "self-describing".
I feel like a lot of file formats came out of companies, but even protocol buffers isn't calling itself google protocol buffers. What is it with modern companies putting their name everywhere they can?
Disambiguation. There is one thing called protobufs. There are hundreds called "ion", a lot of which are more notable than an internal file format.
Edit: I was going to paste in a relevant quote from Zarf (i.e. Andrew Plotkin) on naming. Some of his most important programs have total nonsense names like "glulx", and the reasoning was that at least it would be easy to search for when the name is unique. But ironically, "Zarf" is so common a term that I can't find the quote.
It's funny, I didn't realize protobuf was a Google thing for a long time because of that. At least `protobuf` is a reasonably-specific search term. `ion` returns too much noise. Almost a good reason to name things weirder, like `iyon`. But then they'd get laughed at. EDIT: oh, its a Tagolog name too, and a light company.
Parsing ion text should be similar to json, it has the same characteristics. All JSON is valid ion text so you can even parse JSON with an ION parser.
The binary parser is much faster. All fields are length-prefixed so a parser doesn't have to scan forward for the next syntax element.
The ion parsers (lexer? not sure the right vocab) I've worked with have a `JSON.parse` equivalent that returns a fully realized object, a Map, Array, Int, ect but they also have a streaming parser that yields value by value. You can skip over values you don't need, step over structs or into structs without creating a Map or Array. That can be much faster.
Disclosure: I manage the Ion and PartiQL teams at Amazon.
If you want to create an issue for it (the best repo is probably the ion-docs one: https://github.com/amzn/ion-docs/issues) that will help to show us there is demand for it. Providing information on your use case helps us prioritize.
In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key. I now work with web based systems. On days like this, I miss the old ways.