This reminded me of a tight-packed binary format we used in the trading systems domain almost 20 years ago. Instead of including metadata/field names in each message, it had a central message dictionary that every client/server would first download a copy from. Messages had only type IDs, followed by binary packed data in the correct field order. Because of microsecond latency requirements, we even avoided the serialization/deserialization process by making the memory format of the message and the wire format one and the same. The message class contained the same buffer that you would send/store. The GetInt(fieldID) method of the class simply points to the right place in the buffer and does a cast to int. Application logs contained these messages, rather than plain text. There was a special reader to read logs. Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection.
In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key. I now work with web based systems. On days like this, I miss the old ways.
And to top it off you could fit the entire message into whatever the MTU of your network supported. Cap it at 1500 bytes and subtract the overhead for the frame headers and you get an extremely tight TCP/IP sequence stream that buffers through 16MB without needing to boil the ocean for a compound command sequence.
Having been in industry only 2 decades it amuses me how many times this gets rediscovered.
That just reminded me of the most mysterious scaling issue I ever faced. We had a message to disseminate market data for multiple markets (e.g. IBM: 100/100.12 @ NYSE, 101/102 @ NASDAQ etc.). The system performed admirably under load testing (think 50,000 messages per second). One day we onboarded a single new regional exchange and the whole market data load test collapsed. We searched high and low for days without success, until someone figured out that the new market addition had caused the market data message to exceed the Ethernet frame size for the first time. Problem was not at the application layer or the transport, it was data link layer fragmentation! Figuring that out felt like solving a murder mystery (I wasn't the one who figured it out though).
A lot of "transparent RPC" systems are like this. "It's just like a normal function call, it's sooo convenient" . . . until it isn't, because it involves the network hardware and configuration, routing environment, firewalls, equipment failure . . .
I’ve worked on systems like this too - the max packet size is very well documented.
Then post trade it all gets turned into FIXML which somehow manages to be both more verbose and less readable.
Yeah, that's part of the trick for large listing responses to be spread across frames. Usually with some indicator like a "more flag" so the client can say "get me the next sequence by requesting the next index in the listup with the prior btree index. People do this all the time with large databases and it's a very similar use case.
Ouch that’s rough. One nice bit of IPv6 is that it doesn’t allow fragmentation. It often much nicer to get no message or an error than subtly missing data.
Ah yah that’s right. I’m just learning more of ipv6 and get it mixed up. It appears what I had in my mind was about intermediate routers: “Unlike in IPv4, IPv6 routers (intermediate nodes) never fragment IPv6 packets.” (Wikipedia). To the previous point, it looks like ipv6 does require networks to send 1280 byte or smaller packets unfragmented.
I don't understand why serialization formats that separate structure and content aren't more popular.
Imagine a system every message is a UID or DID (https://www.w3.org/TR/did-core/) followed by raw binary data. The UID completely describes the shape of the rest of the message. You can also transmit messages to define new UIDs: these messages' UID is a shared global UID that everyone knows about.
Once a client learns a UID, messages are about as compact as possible. And the data defining UIDs can be much more descriptive than e.g. property names in JSON. You can send documentation and other excess data when defining the UID, because you don't have to worry about size, because you're only sending the UID once. And UIDs can reference other UIDs to reduce duplication.
It’s a 64bit random number so it’ll never have unintentional collisions.
Also note that a capnp schema is natively represented as a capnp message. Pretty convenient for the “You can also transmit messages to define new UIDs” part of your scheme :)
Protobufs is a boring old tag-length-value format. It's kind of the worst of both worlds because it has no type information encoded in to it, meaning it's useless without the schema, while still having quite a bit of overhead.
Capn'Proto is more like a formalization of C structs in that new fields are only added to the end. If memory serves, on the wire there is no tag, type or length info (for fixed size field types), and everything is rooted at fixed offsets
Mostly right. Allow me to provide some wonky details.
Protobuf uses tag-type-values, i.e. each field is encoded with a tag specifying the field number and some basic type info before the value. The type info is only just enough information to be able to skip the field if you don't recognize it, e.g. it specifies "integer" vs. "byte blob". Some types (such as byte blob) also have a length, some (integer) do not. Nested messages are usually encoded as byte blobs with a length, but there's an alternate encoding where they have a start tag and an end tag instead ("start group" and "end group" are two of the basic types). On one hand, having a length for nested messages seems better because it means you can skip the message during deserialization if you aren't interested in it. On the other hand, it means that during serialization, you have to compute the length of the sub-message before actually serializing it, meaning the whole tree has to be traversed twice, which kind of sucks, especially when the message tree is larger than the L1/L2 cache. Ironically, most Protobuf decoders don't actually support skipping parsing of nested messages so the length that was so expensive to compute ends up being largely unused. Yet, most decoders only support length-delimited nested messages and therefore that's what everyone has to produce. Whoops.
Now on to Cap'n Proto. In a given Cap'n Proto "struct", there is a data section and a pointer section. Primitive types (integers, booleans, etc.) go into the data section. This is the part that looks like a C struct -- fields are identified solely by their offset from the start of the data section. Since new fields can be added over time, if you're reading old data, you may find the data section is too small. So, any fields that are out-of-bounds must be assumed to have default values. Fields that have complex variable-width types, like strings or nested structs, go into the pointer section. Each pointer is 64 bits, but does not work like a native pointer. Half of the pointer specifies an _offset_ of the pointed-to object, relative to the location of the pointer. The other half contains... type information! The pointer encodes enough information for you to know the basic size and shape of the destination object -- just enough information to make a copy of it even if you don't know the schema. This turns out to be super-important in practice for proxy servers and such that need to pass messages through without necessarily knowing the details of the application schema.
In short, both formats actually contain type information on the wire! But, not a full schema -- only the minimal information needed to deal with version skew and make copying possible without data loss.
I wouldn't call what protobuf encodes type information. If I recall all the group stuff is deprecated, so what's left basically boils down to 3 types: 32 bit values, 64 bit values and length prefixed values, which covers strings and sub-messages. Without the schema you can't even distinguish strings from sub-objects, as they are both length prefixed as you described.
Can you even distinguish floats and ints without a schema in protobufs? I don't remember.
I really enjoy capnproto, flatbuffers and Avro and bounce between them depending on the task at hand.
> I wouldn't call what protobuf encodes type information.
Well... I would call it type information, just not complete.
> 32 bit values, 64 bit values and length prefixed values
In protobuf, most integer types are actually encoded as varints, i.e. variable-width integers, not fixed 32-bit or 64-bit. varint encoding encodes 7 bits per byte, and uses the extra bit to indicate whether there are more bytes. (It's not a very good encoding, as it is very branch-heavy. Don't use this in new protocols.)
> Can you even distinguish floats and ints without a schema in protobufs?
You can't distinguish between float vs. fixed32. But int32 would be varint-encoded, while floats are always fixed-width, so you could distinguish between those. (You can't distinguish between int32, uint32, and sint32, though -- and sint32 in particular won't coerce "in the obvious way" to the others.)
The really unfortunate thing is you can't distinguish between strings vs. nested messages (of the length-delimited variety). So if you don't specifically know that something is a nested message then it's not safe to try parsing it...
I wonder if giving it a name based on the hash of the definition has been explored; like Unison [0] where all code is content addressable, but for just capnproto definitions. Is there a reason not to?
Capnp uses the name of your message, but not its full definition because that would make it impossible to extend protocols in a backwards compatible way. Without the ability to add new fields, making changes to your protocol would be impossible in large orgs.
Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.
Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.
It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.
Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.
(Plus if you actually created a global database, then you'd find out if there were any collisions...)
If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.
On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.
>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.
If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?
If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.
The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.
I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.
Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.
For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.
EDIT: It does apply, I misunderstood what the number was being used for.
You have a collision if any two schemas share the id, not if a specific schema collides with any
of the others. So it is exactly like the birthday paradox.
If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.
I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?
MD5 is a 128 bit random number no one would ever have thought would collide. 64 bits is peanuts especially when message types are being defined dynamically
Dude that’s why I said “unintentional collisions”.
Of course you can get intentional collisions. The security model here assumes that anyone that wants to know your message’s ID can just ask.
Did you know that the Internet Protocol uses a 4-bit header to specify the format (v4 or v6) of the rest of the message? They should have used 128 bits. What a bunch of fools.
If you read the protobuf source, you can see a bunch of places where you can hook in custom type-fetching code, e.g. in the google.protobuf.Any type.
After studying it a bit, I'm certain this is how it's used inside Google (might also be mentioned elsewhere).
All you'd really need to do is to compile all protos into a repository (you can spit out the binary descriptors from protoc), then fetch those and decode in the client.
I think the system OP is describing is a little bit more complex. You're not just describing message types, you also have message templates; a template declares a message type and a set of prefilled fields. You save data by just sending the subset of fields that are actually changing, which is a very good abstraction for market data. The template is hydrated on the protocol parsing layer so your code only has to deal with message types itself.
Serialization is platform-dependent (to make it a simple memcpy most of the time), and the schema is sent up front (but can be updated later, with in-bound messages at will). See the User Guide (http://binlog.org/UserGuide.html) and the Internals (http://binlog.org/Internals.html) for more.
Interesting. Confluent Avro + Schema registry + Kafka uses exactly the same approach - binary serialized Avro datums are prefixed with schema id which can be resolved via Schema registry
Same here, I wrote an exchange core that did this using SBE. Basically you don't serialize in the classical sense, because you're simply taking whatever bytes are at your pointer and using them as some natural type. The internals of the exchange also simply used the same layout, so there was minimal copying and interpreting. On the way out it was the same, all you had to do was mask a few fields that you didn't want everyone to see and ship it onto the network.
Even an unoptimized version of this managed to get throughput in the 300K/s range.
Somehow it's the endpoint of my journey into serialization. Basically, avoid it if you need to be super fast. For most things though, it's useful to have something that you can read by eye, so if you're not in that HFT bracket it might be nicer to just use JSON or whatever.
Fab story, thank you! I understood up to
"Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection."
Can you go into more details about how the sequence number and resuming after disconnection worked?
Server used a global sequence number for all messages they transmit. Clients are stateful so they know exactly what was the latest message they processed and would send that id when creating a new connection. This was very important as a lot of the message types used delta values, one of the most important ones being the order book. So in order to apply a new message you had to make sure that you're internal state was at the correct sequence id, failing to do so would make your state go bonkers, specially when you're talking about hundreds of messages being received per second. It's scary but you had a special message type that would send you a snapshot of the expected state with a sequence id that they correspond to. So your error handling code would fetch one of these and them ask for all the messages newer than that.
This is exactly right. It was almost always deltas in favor of snapshots. One of the downsides was that sometimes, debugging an issue required replaying the entire market up to the point of the crash/bug.
Pretty basic. The receiving process usually has an input thread that just puts the messages into a queue. Then a processing thread processes (maybe logic, maybe disk writes, maybe send) the messages and queues up periodic batch acks to the sender. The sender uses these acks to clear its own queue. The receiver persists the last acked sequence number, so that in case of a restart, it can tell upstream senders to restart sending messages from that point.
What you're describing is exactly what still takes place in trading platforms, although a few i've seen now use SBE for consistency sake (it's very common on the market data side)
Don't know if you're describing the original FIX itself with the TCP connection. On FAST FIX they got rid of the TCP connection and market data was sent over UDP using several parallel connections, data was reordered on the client side at consumption time and it only used a TCP connection to recover data when a sequence gap was found.
Actually, even FAST was too slow for us. This was a proprietary messaging middleware library. And this particular market data feed was the direct one into the matching engine itself. For the rest of the system, we used a sort of reliable multicast using UDP for the first transmission and TCP for missed messages. We initially tried out a Gossip/Epidemic protocol but that didn't work out too well.
I had exactly the same implementation except that type / version belonged to the whole message and would map to appropriate binary buffer in memory. No real de/serialization was needed.
I still use it in my UDP game servers, with added packet id if message exceeds max datagram length and has to be split
The one concern I'd have with this format is a length field getting corrupted in transit and causing an out-of-bounds memory access. The network protocols' checksums won't save you 100% of the time, especially if there's bad hardware in the loop. If every field is fixed length this is less of a concern, of course; you might get bad data but you won't get e.g. a string with length 64M.
In our system, if the message didn't unpack properly, the application would send a retransmit request with that message's sequence number. But in practice, this scenario never occurred because TCP already did this for us.
Very neat and similar to a project I am starting for packet radio. I went further with the dictionary concept so that it contains common data. This way, your message contains only a few dictionary "pointers" (integers in base 64). This makes it easier to fit messages in ASCII for 300 baud links.
Old school texty FIX is incredibly slow. FAST FIX is faster but not fun to use. Largely SBE has won adoption on the market data side, with huge platforms like Euronext (biggest on Europe) using it.
FAST FIX protocol is terrible performance-wise, its format requires multiple branching at every field parsing. Even "high-performance" libraries like mFAST are slow: I recently helped a client to optimize parsing for several messages and got 8x speed improvement over mFAST (which is a big deal in HFT space).
> In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key.
If the order submission process depends on the manual press on the enter key (+/- 50ms) is there any point to that though?
Despite all the algorithms we employed, the concept of a manual trade never went away. Also, when the front-end was taken out of the equation, the latencies were in the microsecond range. 50ms would be excruciatingly slow for an algorithm.
Is this a number that came from an actual benchmark or from some marketing material from a keyboard maker? I ask this because [1] finds latency (measured from touching the key to the usb packet arriving) of 15ms with the fastest keyboard and around 50ms with others, though apparently some manufacturers have since improved. Or are you talking about midi keyboards where I guess latency is more noticeable to users?
From the countless review and small time YouTube channels that test these things regularly.
I think that post must be a few years out of date - and moreover by its own admission doesn’t even test hardly any “gaming” keyboards. There is a tremendous amount of competition in keyboards that has been building for the past 10 years.
Input latency is now a marketing thing like horsepower, and there are reasonably reputable [1] places and countless small time YouTube reviewers that test these things.
It’s not like it is difficult to improve latency, and now that it is something that is competitively marketed it is delivered on.
Personally I think it’s a bit ridiculous. This fetishization with minimizing latency to now sub-ms levels doesn’t necessarily lead to better performance as many top level gamers do not use the lowest latency level keyboards. But that doesn’t change the fact that modern mainstream gaming keyboards can hit a latency far below 50ms.
The link I posted was 2017. The site you link gives quite different ratings. I assume partly it is different methodology (the site you link tries to account for key travel somehow and they do something with a display and try to account for display latency rather than using a logic analyzer), but I’m not really sure. For some keyboards in common:
- apple magic keyboard (? vs 2017) 15ms vs 27ms
- das keyboard (3 vs S professional/4 Professional) 25 vs 11/10ms
- razer ornata (chroma vs chroma/chroma 2) 35 vs 11.4/10.1ms
Interestingly it is not some simple uniform difference: the Apple keyboard does much worse in the rtings test, perhaps getting not much of a bonus from key travel compensation. But the das keyboard vs the razer that are 10ms apart on my link perform equally on rtings (but maybe I found the wrong model). I don’t have a good explanation for that discrepancy.
I know it is 2017, but that is a very long time in the gaming/mech keyboard market. I remember just about 10 years ago when mech keyboards were a niche for weirdos and a few others that swore by their Model M's - you now can buy these in Walmart. The point about discrepancy is well taken* but I think the bigger point is on the rtings list the number of offerings that are an order of magnitude lower latency - such that the methodology used in your link is not even viable.
Why is using a high speed camera and a logic analyzer less viable than measuring the end to end latency and trying to subtract the computer part of it? Or are you suggesting that a solenoid should be used to press the key instead of a finger?
I assume you were using C++? I'm not sure what you describe is possible these days due to UB. At the very least just casting bytes received over the wire to a type is UB, so you technically need a memcpy() and hope that the compiler optimises it out.
Yes, it was C++. I was unfamiliar with the acronym "UB" so did a Google search. Does it mean "Undefined Behavior"? If I remember correctly, primitive types other than strings are memcpy'd. GetStr basically returned a char* to the right place in the buffer.
In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key. I now work with web based systems. On days like this, I miss the old ways.