This reminded me of a tight-packed binary format we used in the trading systems ...

angstrom · on Nov 20, 2021

And to top it off you could fit the entire message into whatever the MTU of your network supported. Cap it at 1500 bytes and subtract the overhead for the frame headers and you get an extremely tight TCP/IP sequence stream that buffers through 16MB without needing to boil the ocean for a compound command sequence.

Having been in industry only 2 decades it amuses me how many times this gets rediscovered.

hliyan · on Nov 20, 2021

That just reminded me of the most mysterious scaling issue I ever faced. We had a message to disseminate market data for multiple markets (e.g. IBM: 100/100.12 @ NYSE, 101/102 @ NASDAQ etc.). The system performed admirably under load testing (think 50,000 messages per second). One day we onboarded a single new regional exchange and the whole market data load test collapsed. We searched high and low for days without success, until someone figured out that the new market addition had caused the market data message to exceed the Ethernet frame size for the first time. Problem was not at the application layer or the transport, it was data link layer fragmentation! Figuring that out felt like solving a murder mystery (I wasn't the one who figured it out though).

jkhdigital · on Nov 20, 2021

Classic example of a leaky abstraction, and the principle that implementation details inevitably become undocumented API behavior.

kabdib · on Nov 20, 2021

A lot of "transparent RPC" systems are like this. "It's just like a normal function call, it's sooo convenient" . . . until it isn't, because it involves the network hardware and configuration, routing environment, firewalls, equipment failure . . .

andylynch · on Nov 20, 2021

I’ve worked on systems like this too - the max packet size is very well documented. Then post trade it all gets turned into FIXML which somehow manages to be both more verbose and less readable.

angstrom · on Nov 20, 2021

Yeah, that's part of the trick for large listing responses to be spread across frames. Usually with some indicator like a "more flag" so the client can say "get me the next sequence by requesting the next index in the listup with the prior btree index. People do this all the time with large databases and it's a very similar use case.

vendiddy · on Nov 20, 2021

This was a fun back and forth to read!

cutemonster · on Nov 21, 2021

How did you solve the problem?

elcritch · on Nov 20, 2021

Ouch that’s rough. One nice bit of IPv6 is that it doesn’t allow fragmentation. It often much nicer to get no message or an error than subtly missing data.

depereo · on Nov 20, 2021

IPv6 does allow fragmentation.

elcritch · on Nov 20, 2021

Ah yah that’s right. I’m just learning more of ipv6 and get it mixed up. It appears what I had in my mind was about intermediate routers: “Unlike in IPv4, IPv6 routers (intermediate nodes) never fragment IPv6 packets.” (Wikipedia). To the previous point, it looks like ipv6 does require networks to send 1280 byte or smaller packets unfragmented.

kwertyoowiyop · on Nov 20, 2021

Every multiplayer game programmer from the 1990s agrees with you!

agumonkey · on Nov 20, 2021

Smells like engineering

armchairhacker · on Nov 20, 2021

I don't understand why serialization formats that separate structure and content aren't more popular.

Imagine a system every message is a UID or DID (https://www.w3.org/TR/did-core/) followed by raw binary data. The UID completely describes the shape of the rest of the message. You can also transmit messages to define new UIDs: these messages' UID is a shared global UID that everyone knows about.

Once a client learns a UID, messages are about as compact as possible. And the data defining UIDs can be much more descriptive than e.g. property names in JSON. You can send documentation and other excess data when defining the UID, because you don't have to worry about size, because you're only sending the UID once. And UIDs can reference other UIDs to reduce duplication.

NavinF · on Nov 20, 2021

You just described protobufs and all its successors.

See the “@0xdbb9ad1f14bf0b36” at the top of this capnproto file for example: https://capnproto.org/language.html

It’s a 64bit random number so it’ll never have unintentional collisions.

Also note that a capnp schema is natively represented as a capnp message. Pretty convenient for the “You can also transmit messages to define new UIDs” part of your scheme :)

nly · on Nov 20, 2021

Protobufs is a boring old tag-length-value format. It's kind of the worst of both worlds because it has no type information encoded in to it, meaning it's useless without the schema, while still having quite a bit of overhead.

Capn'Proto is more like a formalization of C structs in that new fields are only added to the end. If memory serves, on the wire there is no tag, type or length info (for fixed size field types), and everything is rooted at fixed offsets

kentonv · on Nov 20, 2021

Mostly right. Allow me to provide some wonky details.

Protobuf uses tag-type-values, i.e. each field is encoded with a tag specifying the field number and some basic type info before the value. The type info is only just enough information to be able to skip the field if you don't recognize it, e.g. it specifies "integer" vs. "byte blob". Some types (such as byte blob) also have a length, some (integer) do not. Nested messages are usually encoded as byte blobs with a length, but there's an alternate encoding where they have a start tag and an end tag instead ("start group" and "end group" are two of the basic types). On one hand, having a length for nested messages seems better because it means you can skip the message during deserialization if you aren't interested in it. On the other hand, it means that during serialization, you have to compute the length of the sub-message before actually serializing it, meaning the whole tree has to be traversed twice, which kind of sucks, especially when the message tree is larger than the L1/L2 cache. Ironically, most Protobuf decoders don't actually support skipping parsing of nested messages so the length that was so expensive to compute ends up being largely unused. Yet, most decoders only support length-delimited nested messages and therefore that's what everyone has to produce. Whoops.

Now on to Cap'n Proto. In a given Cap'n Proto "struct", there is a data section and a pointer section. Primitive types (integers, booleans, etc.) go into the data section. This is the part that looks like a C struct -- fields are identified solely by their offset from the start of the data section. Since new fields can be added over time, if you're reading old data, you may find the data section is too small. So, any fields that are out-of-bounds must be assumed to have default values. Fields that have complex variable-width types, like strings or nested structs, go into the pointer section. Each pointer is 64 bits, but does not work like a native pointer. Half of the pointer specifies an _offset_ of the pointed-to object, relative to the location of the pointer. The other half contains... type information! The pointer encodes enough information for you to know the basic size and shape of the destination object -- just enough information to make a copy of it even if you don't know the schema. This turns out to be super-important in practice for proxy servers and such that need to pass messages through without necessarily knowing the details of the application schema.

In short, both formats actually contain type information on the wire! But, not a full schema -- only the minimal information needed to deal with version skew and make copying possible without data loss.

nly · on Nov 20, 2021

I wouldn't call what protobuf encodes type information. If I recall all the group stuff is deprecated, so what's left basically boils down to 3 types: 32 bit values, 64 bit values and length prefixed values, which covers strings and sub-messages. Without the schema you can't even distinguish strings from sub-objects, as they are both length prefixed as you described.

Can you even distinguish floats and ints without a schema in protobufs? I don't remember.

I really enjoy capnproto, flatbuffers and Avro and bounce between them depending on the task at hand.

kentonv · on Nov 21, 2021

> I wouldn't call what protobuf encodes type information.

Well... I would call it type information, just not complete.

> 32 bit values, 64 bit values and length prefixed values

In protobuf, most integer types are actually encoded as varints, i.e. variable-width integers, not fixed 32-bit or 64-bit. varint encoding encodes 7 bits per byte, and uses the extra bit to indicate whether there are more bytes. (It's not a very good encoding, as it is very branch-heavy. Don't use this in new protocols.)

> Can you even distinguish floats and ints without a schema in protobufs?

You can't distinguish between float vs. fixed32. But int32 would be varint-encoded, while floats are always fixed-width, so you could distinguish between those. (You can't distinguish between int32, uint32, and sint32, though -- and sint32 in particular won't coerce "in the obvious way" to the others.)

The really unfortunate thing is you can't distinguish between strings vs. nested messages (of the length-delimited variety). So if you don't specifically know that something is a nested message then it's not safe to try parsing it...

infogulch · on Nov 20, 2021

Interesting. Ids in particular are described here: https://capnproto.org/language.html#unique-ids

I wonder if giving it a name based on the hash of the definition has been explored; like Unison [0] where all code is content addressable, but for just capnproto definitions. Is there a reason not to?

[0]: https://www.unisonweb.org

NavinF · on Nov 20, 2021

Capnp uses the name of your message, but not its full definition because that would make it impossible to extend protocols in a backwards compatible way. Without the ability to add new fields, making changes to your protocol would be impossible in large orgs.

garmaine · on Nov 20, 2021

> It’s a 64bit random number so it’ll never have unintentional collisions.

It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers. That's not inconceivable.

NavinF · on Nov 20, 2021

Yes it is. Message schemas are made by humans. Most of these messages will be extended in a backwards compatible manner over the life of a project rather than replaced entirely so their IDs don’t change. That’s kinda the point of protobufs and its successors.

I’ve probably generated 100 IDs over my lifetime.

garmaine · on Nov 20, 2021

Which puts it on the same order of magnitude as the number of people on the planet. If every person alive generated a schema (or if 1/100th of all people generate 100 IDs each like you) then we'd have a small number of collisions. More likely you'd get large numbers of schema like that if there's a widespread application of a protocol compiler that generates new schema programmatically, e.g. to achieve domain separation, and then is applied at scale. I'm not saying that's likely, just that it is not, as is claimed, inconceivable.

kentonv · on Nov 20, 2021

It's only really a problem if you use the IDs in the same system. It's highly unlikely that you'd link 4B schemas into a single binary. And anyway, if you do have a conflict, you'll get a linker error.

Cap'n Proto type IDs are not really intended to be used in any sort of global database where you look up types by ID. Luckily no one really wants to do that anyway. In practice you always have a more restricted set of schemas you're interested in for your particular project.

(Plus if you actually created a global database, then you'd find out if there were any collisions...)

heavenlyblue · on Nov 20, 2021

If you have 4 billion of them generated there’s another 1/4billionth chance you’ll generate a duplicate.

On top of that you would not only need to generate the same ID, you would need to USE it in the same system where that is could have some semantics to not cause an error.

logicchains · on Nov 20, 2021

>It'll have unintentional collisions if you ever generate more than 4 billion of these random numbers.

If it's 64 bit, doesn't that mean you'd need to generate ~10000000000000000000000000000000000000000000000000000000000000000 (2^64) of those numbers to have a collision, not 2^32?

tomerv · on Nov 20, 2021

If you generate randomly then, due to the birthday paradox, after generating sqrt(N) values you have a reasonable chance of collision.

The birthday paradox is named after the non-intuitive fact that with just 32 people in a room you have > 50% of 2 people having a birthday on the same day of the year.

doo_daa · on Nov 20, 2021

I think it's 23 people in a room. The canonical example is people on a football (soccer) pitch. With 11 per side plus the referee there's a 50% chance that two will share the same birthday.

adwn · on Nov 20, 2021

> 32 people

Slight correction: only 23 people, actually. So in every second football ("soccer") game, you have two people on the field with the same birthday.

ratorx · on Nov 20, 2021

Does birthday paradox apply here? It’s about any pair of people having the same birthday, whereas in this case you need someone else with a specific birthday.

For example, if you generate 2 numbers and they are the same, but are different to the capnproto number, that’s a collision but doesn’t actually matter.

EDIT: It does apply, I misunderstood what the number was being used for.

elcritch · on Nov 20, 2021

It does apply, according to https://www.johndcook.com/blog/2017/01/10/probability-of-sec...

ratorx · on Nov 20, 2021

You’re right, I misunderstood what the magic number was being used for.

lozenge · on Nov 20, 2021

But if my application only uses 100 schemas, I only care about a collision if it's with one of those 100.

gpderetta · on Nov 20, 2021

You have a collision if any two schemas share the id, not if a specific schema collides with any of the others. So it is exactly like the birthday paradox.

heavenlyblue · on Nov 20, 2021

Yeah, but that collision probably doesn’t matter because there’s a bunch of other variables that need to come together for it to be an issue at all.

gpderetta · on Nov 20, 2021

If the schema id is the message id, in principle it could be an issue as the protocol on the wite would be ambiguous. Then again, you should be able to detect any collisions when you register a schema with the schema repo and deal with it at that time.

heavenlyblue · on Nov 20, 2021

I don’t understand your maths here: how is generating 4billion of them is any different from generating 3 billion except a slight raise in the probability measure?

remram · on Nov 20, 2021

When you reach the 4 billionth version of your protocol?

kentonv · on Nov 20, 2021

All versions of the same protocol have the same ID. That is the point of IDs -- to link together different versions of the protocol.

remram · on Nov 20, 2021

You're right! That makes collisions even less likely then.

boxfire · on Nov 20, 2021

MD5 is a 128 bit random number no one would ever have thought would collide. 64 bits is peanuts especially when message types are being defined dynamically

NavinF · on Nov 20, 2021

Dude that’s why I said “unintentional collisions”.

Of course you can get intentional collisions. The security model here assumes that anyone that wants to know your message’s ID can just ask.

Did you know that the Internet Protocol uses a 4-bit header to specify the format (v4 or v6) of the rest of the message? They should have used 128 bits. What a bunch of fools.

remram · on Nov 20, 2021

MD5 is safe against unintentional collisions.

dboreham · on Nov 20, 2021

This is protocol buffers + a global type registry. I worked on such a system.

pcarolan · on Nov 20, 2021

Is it public? Id love to learn more about it.

jenny91 · on Nov 20, 2021

If you read the protobuf source, you can see a bunch of places where you can hook in custom type-fetching code, e.g. in the google.protobuf.Any type.

After studying it a bit, I'm certain this is how it's used inside Google (might also be mentioned elsewhere).

All you'd really need to do is to compile all protos into a repository (you can spit out the binary descriptors from protoc), then fetch those and decode in the client.

It'd actually be quite straightforward to set up,

mtrovo · on Nov 20, 2021

I think the system OP is describing is a little bit more complex. You're not just describing message types, you also have message templates; a template declares a message type and a set of prefilled fields. You save data by just sending the subset of fields that are actually changing, which is a very good abstraction for market data. The template is hydrated on the protocol parsing layer so your code only has to deal with message types itself.

erenon · on Nov 20, 2021

We do something very similar in binlog: https://github.com/morganstanley/binlog

Serialization is platform-dependent (to make it a simple memcpy most of the time), and the schema is sent up front (but can be updated later, with in-bound messages at will). See the User Guide (http://binlog.org/UserGuide.html) and the Internals (http://binlog.org/Internals.html) for more.

mendigou · on Nov 20, 2021

This is exactly how it's done for spacecraft telemetry and telecommand too, but in this case it's to save bytes rather than processing time.

I also miss working on those systems.

oandrew · on Nov 20, 2021

Interesting. Confluent Avro + Schema registry + Kafka uses exactly the same approach - binary serialized Avro datums are prefixed with schema id which can be resolved via Schema registry

lordnacho · on Nov 20, 2021

Same here, I wrote an exchange core that did this using SBE. Basically you don't serialize in the classical sense, because you're simply taking whatever bytes are at your pointer and using them as some natural type. The internals of the exchange also simply used the same layout, so there was minimal copying and interpreting. On the way out it was the same, all you had to do was mask a few fields that you didn't want everyone to see and ship it onto the network.

Even an unoptimized version of this managed to get throughput in the 300K/s range.

Somehow it's the endpoint of my journey into serialization. Basically, avoid it if you need to be super fast. For most things though, it's useful to have something that you can read by eye, so if you're not in that HFT bracket it might be nicer to just use JSON or whatever.

porker · on Nov 20, 2021

Fab story, thank you! I understood up to "Messages were exchanged over raw TCP. They contained their own application layer sequence number so that streams could resume after disconnection." Can you go into more details about how the sequence number and resuming after disconnection worked?

mtrovo · on Nov 20, 2021

Server used a global sequence number for all messages they transmit. Clients are stateful so they know exactly what was the latest message they processed and would send that id when creating a new connection. This was very important as a lot of the message types used delta values, one of the most important ones being the order book. So in order to apply a new message you had to make sure that you're internal state was at the correct sequence id, failing to do so would make your state go bonkers, specially when you're talking about hundreds of messages being received per second. It's scary but you had a special message type that would send you a snapshot of the expected state with a sequence id that they correspond to. So your error handling code would fetch one of these and them ask for all the messages newer than that.

hliyan · on Nov 20, 2021

This is exactly right. It was almost always deltas in favor of snapshots. One of the downsides was that sometimes, debugging an issue required replaying the entire market up to the point of the crash/bug.

hliyan · on Nov 20, 2021

Pretty basic. The receiving process usually has an input thread that just puts the messages into a queue. Then a processing thread processes (maybe logic, maybe disk writes, maybe send) the messages and queues up periodic batch acks to the sender. The sender uses these acks to clear its own queue. The receiver persists the last acked sequence number, so that in case of a restart, it can tell upstream senders to restart sending messages from that point.

mianos · on Nov 20, 2021

The number of times people have "invented" ASN.1 now is ridiculous.

makotobestgirl · on Nov 20, 2021

Sounds like Google's flatbuffers [0], which indexes directly into a byte buffer using the field size prefix.

[0] https://google.github.io/flatbuffers/

nly · on Nov 20, 2021

What you're describing is exactly what still takes place in trading platforms, although a few i've seen now use SBE for consistency sake (it's very common on the market data side)

amitport · on Nov 20, 2021

Well that's just like using C structs. The best serialization protocol :).

nly · on Nov 20, 2021

Some finance software systems do that too. It tends to be a nightmare because people end up adding new message types just to add a single field

mtrovo · on Nov 20, 2021

Don't know if you're describing the original FIX itself with the TCP connection. On FAST FIX they got rid of the TCP connection and market data was sent over UDP using several parallel connections, data was reordered on the client side at consumption time and it only used a TCP connection to recover data when a sequence gap was found.

hliyan · on Nov 20, 2021

Actually, even FAST was too slow for us. This was a proprietary messaging middleware library. And this particular market data feed was the direct one into the matching engine itself. For the rest of the system, we used a sort of reliable multicast using UDP for the first transmission and TCP for missed messages. We initially tried out a Gossip/Epidemic protocol but that didn't work out too well.

FpUser · on Nov 20, 2021

I had exactly the same implementation except that type / version belonged to the whole message and would map to appropriate binary buffer in memory. No real de/serialization was needed.

I still use it in my UDP game servers, with added packet id if message exceeds max datagram length and has to be split

ericbarrett · on Nov 20, 2021

The one concern I'd have with this format is a length field getting corrupted in transit and causing an out-of-bounds memory access. The network protocols' checksums won't save you 100% of the time, especially if there's bad hardware in the loop. If every field is fixed length this is less of a concern, of course; you might get bad data but you won't get e.g. a string with length 64M.

hliyan · on Nov 20, 2021

In our system, if the message didn't unpack properly, the application would send a retransmit request with that message's sequence number. But in practice, this scenario never occurred because TCP already did this for us.

FpUser · on Nov 20, 2021

I do not remember it ever happening but being semi-paranoid I had length in 2 places - beginning and the end of the message.

mrlemke · on Nov 20, 2021

Very neat and similar to a project I am starting for packet radio. I went further with the dictionary concept so that it contains common data. This way, your message contains only a few dictionary "pointers" (integers in base 64). This makes it easier to fit messages in ASCII for 300 baud links.

ktzar · on Nov 20, 2021

Is it FIX messages? https://en.wikipedia.org/wiki/Financial_Information_eXchange It's a good idea, extensible (ranges available for banks to implement their own codes), and fast.

nly · on Nov 20, 2021

Old school texty FIX is incredibly slow. FAST FIX is faster but not fun to use. Largely SBE has won adoption on the market data side, with huge platforms like Euronext (biggest on Europe) using it.

mtrovo · on Nov 20, 2021

I stopped working in the area on the age of FAST FIX, which was extremely good for the time. Do you know what are the differences to SBE?

nly · on Nov 20, 2021

I guess I'm biased based on experience at the companies I've worked at but FAST never seemed to have good libraries or tooling

hliyan · on Nov 20, 2021

It was a proprietary messaging middleware library. We actually found even FAST FIX slow.

o_bender · on Nov 20, 2021

FAST FIX protocol is terrible performance-wise, its format requires multiple branching at every field parsing. Even "high-performance" libraries like mFAST are slow: I recently helped a client to optimize parsing for several messages and got 8x speed improvement over mFAST (which is a big deal in HFT space).

Aeolun · on Nov 20, 2021

> In that world, lantencies were so low that the response to your order submission would land in your front-end before you've had time to lift your finger off the enter key.

If the order submission process depends on the manual press on the enter key (+/- 50ms) is there any point to that though?

hliyan · on Nov 20, 2021

Despite all the algorithms we employed, the concept of a manual trade never went away. Also, when the front-end was taken out of the equation, the latencies were in the microsecond range. 50ms would be excruciatingly slow for an algorithm.

sodality2 · on Nov 20, 2021

It was probably used with high frequency trading so fully automated unless you happened to be testing it manually.

danachow · on Nov 20, 2021

OT but keyboard latency can and often is far below 50ms, more like 1ms. It seems to be a common misconception that denouncing mandates increased lag.

dan-robertson · on Nov 20, 2021

Is this a number that came from an actual benchmark or from some marketing material from a keyboard maker? I ask this because [1] finds latency (measured from touching the key to the usb packet arriving) of 15ms with the fastest keyboard and around 50ms with others, though apparently some manufacturers have since improved. Or are you talking about midi keyboards where I guess latency is more noticeable to users?

[1] https://danluu.com/keyboard-latency/

danachow · on Nov 20, 2021

From the countless review and small time YouTube channels that test these things regularly.

I think that post must be a few years out of date - and moreover by its own admission doesn’t even test hardly any “gaming” keyboards. There is a tremendous amount of competition in keyboards that has been building for the past 10 years.

Input latency is now a marketing thing like horsepower, and there are reasonably reputable [1] places and countless small time YouTube reviewers that test these things. It’s not like it is difficult to improve latency, and now that it is something that is competitively marketed it is delivered on.

[1] https://www.rtings.com/keyboard/tests/latency

Personally I think it’s a bit ridiculous. This fetishization with minimizing latency to now sub-ms levels doesn’t necessarily lead to better performance as many top level gamers do not use the lowest latency level keyboards. But that doesn’t change the fact that modern mainstream gaming keyboards can hit a latency far below 50ms.

dan-robertson · on Nov 20, 2021

The link I posted was 2017. The site you link gives quite different ratings. I assume partly it is different methodology (the site you link tries to account for key travel somehow and they do something with a display and try to account for display latency rather than using a logic analyzer), but I’m not really sure. For some keyboards in common:

- apple magic keyboard (? vs 2017) 15ms vs 27ms

- das keyboard (3 vs S professional/4 Professional) 25 vs 11/10ms

- razer ornata (chroma vs chroma/chroma 2) 35 vs 11.4/10.1ms

Interestingly it is not some simple uniform difference: the Apple keyboard does much worse in the rtings test, perhaps getting not much of a bonus from key travel compensation. But the das keyboard vs the razer that are 10ms apart on my link perform equally on rtings (but maybe I found the wrong model). I don’t have a good explanation for that discrepancy.

danachow · on Nov 21, 2021

I know it is 2017, but that is a very long time in the gaming/mech keyboard market. I remember just about 10 years ago when mech keyboards were a niche for weirdos and a few others that swore by their Model M's - you now can buy these in Walmart. The point about discrepancy is well taken* but I think the bigger point is on the rtings list the number of offerings that are an order of magnitude lower latency - such that the methodology used in your link is not even viable.

dan-robertson · on Nov 21, 2021

Why is using a high speed camera and a logic analyzer less viable than measuring the end to end latency and trying to subtract the computer part of it? Or are you suggesting that a solenoid should be used to press the key instead of a finger?

danachow · on Nov 22, 2021

The latter. Even the linked article points out the inherent limitations.

Aeolun · on Nov 20, 2021

I was thinking more of the time a human finger needs to push the button down.

formerly_proven · on Nov 20, 2021

That's because a lot of input hardware uses moronic debouncing.

secondcoming · on Nov 20, 2021

I assume you were using C++? I'm not sure what you describe is possible these days due to UB. At the very least just casting bytes received over the wire to a type is UB, so you technically need a memcpy() and hope that the compiler optimises it out.

hliyan · on Nov 20, 2021

Yes, it was C++. I was unfamiliar with the acronym "UB" so did a Google search. Does it mean "Undefined Behavior"? If I remember correctly, primitive types other than strings are memcpy'd. GetStr basically returned a char* to the right place in the buffer.

secondcoming · on Nov 20, 2021

Apologies, yes Undefined Behaviour

gpderetta · on Nov 21, 2021

It is not UB if it is correctly aligned.

It is UB if the underlying dynamic type is not compatible from the access type

As the bytes are coming from the kernel, you get to decide the dynamic type.

sattoshi · on Nov 20, 2021

Apache Thrift works on the same principle of separating structure from data.

atlgator · on Nov 20, 2021

We did the same in high fidelity flight simulators for a lot less money I'm sure.