Hacker News new | past | comments | ask | show | jobs | submit login
C++ patterns for low-latency applications including high-frequency trading (arxiv.org)
389 points by chris_overseas 6 months ago | hide | past | favorite | 231 comments



Fairly trivial base introduction to the subject.

In my experience teaching undergrads they mostly get this stuff already. Their CompArch class has taught them the basics of branch prediction, cache coherence, and instruction caches; the trivial elements of performance.

I'm somewhat surprised the piece doesn't deal at all with a classic performance killer, false sharing, although it seems mostly concerned with single-threaded latency. The total lack of "free" optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout was also surprising.

Neither this piece, nor my undergraduates, deal with the more nitty-gritty elements of performance. These mostly get into the usage specifics of particular IO APIs, synchronization primitives, IPC mechanisms, and some of the more esoteric compiler builtins.

Besides all that, what the nascent low-latency programmer almost always lacks, and the hardest thing to instill in them, is a certain paranoia. A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers. A creeping feeling that causes them to compulsively run the benchmarks through callgrind looking for calls into the object cache that miss and go to an allocator in the middle of the hot loop.

I think a formative moment for me was when I was writing a low-latency server and I realized that constructing a vector I/O operation ended up being overall slower than just copying the small objects I was dealing with into a contiguous buffer and performing a single write. There's no such thing as a free copy, and that includes fat pointers.


> Fairly trivial base introduction to the subject.

Might be, but low-latency C++, in spite of being a field on its own, is a desert of information.

The best resources available at the moment on low-latency C++ are a hand full of lectures from C++ conferences which left much to be desired.

Putting aside the temptation to grandstand, this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject. Vague claims that you can piece together similar info from other courses does not count as a contribution, and helps no one.


> this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject.

I don't know how you arrive at this conclusion. The document really is an introduction to the same basic performance techniques that have been covered over and over. Loop unrolling, inlining, and the other techniques have appeared in countless textbooks and blog posts already.

I was disappointed to read the paper because they spent so much time covering really basic micro techniques but then didn't cover any of the more complicated issues mentioned in the parent comment.

I don't understand why you'd think this is an "outstanding contribution to the field" when it's basically a recap of simple techniques that have been covered countless times in textbooks and other works already. This paper may seem profound if someone has never, ever read anything about performance optimization before, but it's likely mundane to anyone who has worked on performance before or even wondered what inlining or -Funroll-loops does while reading some other code.


> I don't know how you arrive at this conclusion.

I am well aware of this fact because I've researched the topic and I can state it without any degree if uncertainty. The only and resources there are scattered loose notes and presentations in conferences such as Timur Doulmer's work for C++ On Sea, and even so he's clear on how his work is mainly focused on real-time audio processing, which has different requirements than say HFT.

> The document really is an introduction to the same basic performance techniques that have been covered over and over. Loop unrolling, inlining, and the other techniques have appeared in countless textbooks and blog posts already.

Go ahead and cite the absolute best example you can come up from this incredible list of yours. The very top of your list will suffice to evaluate your whole list. I'll wait.


> Go ahead and cite the absolute best example you can come up from this incredible list of yours. The very top of your list will suffice to evaluate your whole list. I'll wait.

This could really do without the "incredible list of yours" and "I'll wait" snark. You don't need to be so condescending, but for the sake of helping others who are actually curious about the subject:

https://www.agner.org/optimize/ is a great place to start that I link to frequently.

It has significantly more breadth than the small collection of techniques in the paper linked in this post. It doesn't have the "HFT" buzzword attached, but then again the techniques in the linked paper above aren't unique to HFT either.

There are also several books on the subject, but given that the above link is a good place to start and the resources are freely available, that's where I'd recommend you look.


”covered countless times in textbooks”

What’s your favourite textbook on the subject?


it's an oldie but _Advanced Compiler Design and Implementation_ covers these concepts really well in a language-independent way.


I don't think so. For instance, one basic technique in low-latency C++ is running specific code periodically even with no-op parameters just to keep it warm in the cache.

It's one thing to claim that you can cherry pick some directly or indirectly pieces of information from an unrelated source. It's another entirely different thing to claim that unrelated source is an authoritative reference in an unrelated field, specially when no one recognizes it as such.


Interestingly you can find a lot of good tips in the optimization section of 'Realtime Collision Detection ' https://realtimecollisiondetection.net/


> A creeping feeling that causes them to compulsively run the benchmarks through callgrind

I'm happy I don't deal with such things these days, but I feel where the real paranoia always lies is the Heisenberg feeling of not even being able to even trust these things, the sneaky suspicion that the program is doing something different when I'm not measuring it.


Out of interest, do you have any literature that you'd recommend instead?


On the software side I don't think HFT is as special a space as this paper makes it out to be.[1] Each year at cppcon there's another half-dozen talks going in depth on different elements of performance that cover more ground collectively than any single paper will.

Similarly, there's an immense amount of formal literature and textbooks out of the game development space that can be very useful to newcomers looking for structural approaches to high performance compute and IO loops. Games care a lot about local and network latency, the problem spaces aren't that far apart (and writing games is a very fun way to learn).

I don't have specific recommendations for holistic introductions to the field. I learn new techniques primarily through building things, watching conference talks, reading source code of other low latency projects, and discussion with coworkers.

[1]: HFT is quite special on the hardware side, which is discussed in the paper. The NICs, network stacks, and extensive integration of FPGAs do heavily differentiate the industry and I don't want to insinuate otherwise.

You will not find a lot of SystemVerilog programmers at a typical video game studio.


As someone who does quant trading professionally and game development as a hobby, they both are performance sensitive, but they emphasize different kinds of performance. Trading is about minimizing latency while video games are about maximizing bandwidth.

Video games try to cram as much work as possible within about 16 milliseconds whereas for most trading algorithms 16 milliseconds is too slow to do anything, you want to process and produce a response to input within the span of microseconds, which is 3 orders of magnitude faster than a single frame in a video game.

The problem spaces really are quite distinct in this respect and a lot of techniques from one space don't really carry over to the other.


I'm curious how HFT relates to pro audio programming. The timescale is close to gaming (usually <10ms for desktop PC work), but you really care about the tail of the distribution. All the audio processing happens in a high-priority callback that's called for every audio frame and needs complete within the time budget, so you typically never allocate, do IO, or use any blocking synchronization primitives [1].

It's not hard real-time like you're going to crash your car, but if you miss your deadline it causes an unacceptable and audible glitch.

I've always been a bit surprised that Jane Street uses OCaml. I know they've put a lot of attention into the GC, but it still seems fundamentally indeterminate in a way that would make most audio devs nervous.

[1]: http://www.rossbencina.com/code/real-time-audio-programming-...


Audio has a lot of buffering behaviour that you wouldn't generally see in event-reactive HFT. Think of all the plugins that you know of that have non-zero latency, compressors with 'lookahead' etc. There are maybe some similarities where the logic is more complex (loop unrolling, SIMD and so on) but I feel like plugins are generally optimizing for throughput (CPU usage) and quality (oversampling etc) rather than purely latency in most cases.


I guess the difference I'm interested in is whether HFT tends to be "realtime", in the sense that there's a hard deadline that you need to hit every time.

Put another way, audio is operating on a much longer timescale, but cares a lot about worst-case latency (as well as throughput and quality). Is that true of HFT also, or are they more concerned with average latency? If computing a trade takes too long, can they just discard it, or is it catastrophic to miss that deadline?


I can't go into too much detail (I currently work at an HFT, but not on the HFT bits), but some of the FPGAs make decisions without even seeing a entire packet coming in on the wire. Latencies are more often measured in nanos than micros from what I've seen lately.


Tangentially, how do you like working for an HFT? I'm a low level software / FPGA developer and have thought about going into the HFT space. Is the work life balance as terrible as people say?


It's really going to come down to which firm you're at and who your manager is. I'm at a smaller shop currently, and it's great. I basically work 8-5 most days, sometimes a little later if processes don't come up cleanly for market open at 5P (CME basically operates 23 hours a day with different trading sessions).

There are firms that are more like sweatshops. I spent roughly a decade at Citadel; when I started, average tenure was thought to be about 18 months. Hours were brutal, but it largely depended on your manager. I had one (he was fired 6 months after I started) that insisted on 80 hour weeks, ass in seat, no work from home, not even on the weekends. The kicker was, my team didn't even have the work at the time to justify all of us pulling 80 hour weeks. I also spent months at times pulling 100 hour weeks. That is, if you slept, it was at your desk (there were cots/sleeping bags available, but I never saw anyone bother). Maybe go home every 2nd or 3rd day for a change of clothes (you could shower at the office). Then, I had other managers where it was more 8-5. But, I was always on call 24/7, even while on vacation. That was the truly rough part. I would say on a whole, the reputation of working for Citadel is worse than it deserves, but it really matters who your manager is. That said, not sure I know anyone left at Citadel anymore other than Ken G. (I've been gone over a decade now).

My best advice when you're interviewing people, ask pointed questions about work-life balance. Avoid asking overly broad questions like "How's the work/life balance?". Instead prefer: "When was the last time you took a 2 or 3 week vacation?"

edit: At my current role, I do sometimes work outside of 8-5 and on weekends, but its because I choose to, because I'm working on something interesting that excites me. I'm not expected to, and no one gives me a hard time if I don't.


Not OP but it varies wildly between companies. JS and Citadel are both top tier trading shops and they could not be more different when it come to wlb.

It's not hard to sniff out during the process though.


@vineyardlabs: JS - Jane Street


Ah, duh


Curious, what firm is "JS" in this context?


Yeah that's pretty much what I figured.


Yes, there is a lot of effort to understand, for example, the 99th percentile tick-to-trade latency as well as the average latency. For a lot of those worst-case timings there are things that are occasionally out of your control like the kernel, the network, the hardware. Then there are things always in your control like how you implement your software.

Most HFT algos are busy-spinning. You'll see a core pinned at 100% as it checks for input availability over and over. The vast majority of the time it is actually doing nothing, from an external point of view. When that tick appears that it needs to react to, it springs into action. It might react to some input in the range of a few hundred nano seconds but for literally billions of nanoseconds it's doing nothing at all.

Audio is processing a buffer of data (usually whatever your audio interface buffer is set to), then waits patiently until the next buffer arrives. There's a regular pulse to it. If your CPU isn't hurting, there's way more time between those buffers than are required to process them. The order of magnitude between work and 'not work' for HFT is huge compared to audio generally. HFT logic is simple, small and fast. Audio is hopefully complex if you paid good money for it. The best way to go fast is to do almost nothing.

In terms of the impact of not meeting a particular deadline, it's an opportunity cost in HFT. In audio, you get audio dropouts if you can't process it fast enough. In HFT, if your reaction time is slow, you're not at the front of the queue for that juicy trading opportunity you and everyone else just spotted, then you miss it entirely and someone else gets the prize. HFT is all about making thousands of small statistically favourable bets and being able to execute on them fast enough to realise the opportunity before the next fastest guy.


In audio, you can usually tolerate a few milliseconds of latency (more when mixing, less when recording). This means you can buffer your audio stream, for example in blocks of 64 or 128 samples.

As far as I know, these millisecond latencies are orders of magnitude higher than what would be tolerable in HFT. There the units are microseconds and nanoseconds.


As somebody who has done both, there's certainly a lot of similarities in how you write software.

I'm talking about real-time audio, though. There's also just audio playback (Pro Tools, e.g.) which tolerates much higher latency, but enables higher audio fidelity due to larger buffer sizes (e.g. in compressors to enable look-ahead) as well as more clever algorithms (which are often prone to latency spikes, such as a "standard" FFT, if you want a very simple example).

Both, finance and real-time audio, differ in their time scales, but more in the breadth of their time scales: real-time audio is often between 1–50 ms (yeah, we try to keep it below 10 ms, but that's tough to do in a complex setup. I've seen artists often compromise and increase buffer sizes (== latency) to accommodate for their complex virtual instruments and filters), finance in ns–ms (it's not HFT anymore at a certain point, sure, but those HFT architectures also have a layered response, with FPGAs being quick but simple and more complex strategies having higher latency; also, if you're e.g. asked for a quote on a bond, it's not a shame to take 500 ms to think about a price, because those trades are bigger but slower).

You also mentioned the consequences of missing a deadline. Well, indeed, I've seen this variance to be less of an issue in finance (we're only caring about xx µs, so I hesitate to call myself a HFT guy) than in real-time audio. A click in a live set is a no-go. A crash of your DAW is fatal, that's it with your live set. Nobody dies, but people will be very upset. I've seen real-time audio being obsessed with being reliable and keeping the variance in latency really small.

In finance, I've seen it more that people kind of accepted that variance was huge because there's some hidden allocations in a system. Hoping that those stop once the system is warmed up or that they rarely occur on a critical trade. As a side note, in finance I've even heard this well-known HFT firm: "Well, it's not the end of the world if your process crashes. If you can restart quick enough."

It's also that in finance, your data is more heterogenous: Real-time audio has audio (isochronous, homogenous buffer of floats), MIDI (asynchronous, formerly small events), parameter changes (asynchronous, often just a float) and config changes (changes the topology of your audio processing graph; enjoy to transition from A to B without causing an audible glitch). Finance has processing graphs that only process asynchronous data (heterogenous market data, telling you what's on the order book of an exchange; trade data for which trades you performed with your orders and quotes; symbol data when new securities appear or change (historically only with the start of the trading day); pre-computed steering data for algorithms and/or data from third-party data sources).

Thus, the processing graphs of both differ in the data that's transferred as well as in the structure and domain. In finance, there is no regular tick-based processing. You have to accommodate a very varying amount of data per time period, whereas in audio, it's a very regular pace and the amount of data is often the same (± activations of voices in instruments).

Real-time audio has plugins, though. And those, from experience, do whatever you can imagine. Plugins are great, from a creative perspective, but the craftsmanship is of very varying degree. I bet I'll even find a plugin that draws the UI from the real-time audio thread.

Any other experiences from the real-time realm?


Of course, time spans are different. But read the paper, it's dealing with incredibly basic techniques. When we're speaking on the level of "data should be in cache" and "cold code should be far away from the hot path" the fields are practically identical.

You're entirely correct that quant work is dealing with pure latency in situations where game engines might amortize throughput over multiple frames. But a question about "how do I get started in performance/latency work?" isn't usefully answered by "get a Hudson River Trading internship".


On that point we certainly agree.


"which is 3 orders of magnitude faster than a single frame in a video game."

You're right on the money. I worked in HFT for half a decade.

Back in the late 2000s you were on the cutting edge if you were writing really good C++ and had overclocked some CPUs to hell and back and then shoved them in a rack in a datacenter in new jersey. "To hell and back" means "they only crash every hour or two" (because while they're up, they're faster than your competition).

Around the mid 2010's it had moved entirely to FPGAs. Do something smart in software, then give the FPGA something dumb to wait for (e.g. use software to produce a model and wait for some kind of alpha to develop, then tell the FPGA "ok wait for X > Y and fire the order" was basically the name of the game. And it was often better to be faster than it was to be smarter, so really you just kept the FPGA as simple and fast as possible.

At that point, your hard realtime constraints for the language doing the smart stuff really dissolve. Compared to an FPGA getting an order out the door in some fraction of a mic, a CPU doing anything is slow. So you might as well use python, yknow?

People also play all kinds of different speed games. There's latency races around NJ and chicago where microseconds matter around price movements and C++ isn't really part of the picture in a competitive way anymore, but that's not to say someone isn't ekeing out a niche where they're doing something vaguely smarter faster than opponents. But these things, at scale, tend to be questions of - either your alpha is very short-lived (microseconds, and if it isn't organically microseconds, it will be competed until it is), or it is fundamentally longer-lived (seconds to minutes?) and you might as well use python and develop 100x faster.

The silly thing is, the really good trades that focus on short-term alphas (ms's and less) are generally obviously good trades and so you don't have to be super smart to realize it's a really fucking good trade, you had better be fast and go get it. So there's also this kind of built-in bias for being fast because if a trade is so marginally good that you needed a crazy smart and slow model to tell it apart from a bad trade, it's probably still a relatively crummy trade and all your smarts let you do is pick up a few extra pennies for your trouble.

I'll close by saying don't take my words as representative of the industry - the trades my firm specialized in was literally a dying breed and my understanding is that most of the big crazy-crazy-fast-microwave-network people have moved to doing order execution anyway, because most of the money in the crazy-crazy-fast game dried up as more firms switched to either DIYing the order execution or paying a former-HFT to do it for them.


Even if the low-latency logic moves to the FPGA, you still need your slow path to be reasonably fast, say about 100us, and absolutely never more than 1ms.

To my knowledge Python is not suitable, though I know some players embed scripting languages that are.


It depends entirely on what you've got on the FPGA and what's slow-path. My firm had order entry done exclusively by the FPGA and the non-FPGA code would have orders in the FPGA for sometimes tens of seconds before they would fire. The non-FPGA code was still C++, but the kind of everyday C++ where nobody has been wringing all the water out of it for speed gains. If we rewrote it all, it'd probably be python. But that's just us. Every firm has their own architecture and spin on how this all works. I'm sure somebody out there has C++ code that still has to be fast to talk to an FPGA. We didn't - you were either FPGA-fast or we didn't care.

edit: also, fwiw, "say about 100us, and absolutely never more than 1ms." things measured in that many microseconds are effectively not what my firm considered "low-latency". But like I said, there are lots of different speed games out there, so there's no one-size-fits-all here.


Your hardware is configured by your software. You pre-arm transactions to be released when a trigger occurs.

Releasing the transaction when the trigger occurs is the fast path, and what the hardware does. The slow path is what actually decides what should be armed and with what trigger.

If your software is not keeping up with information coming in, either you let that happen and trigger on stale data, or you automatically invalidate stale triggers and therefore never trigger.

100us is short enough to keep up with most markets, though obviously not fast enough to be competitive in the trigger to release loop.

But it's mostly an interesting magnitude to consider when comparing technologies; Python can't really do that scale. In practice a well-designed system in C++ doesn't struggle at all to stay well below that figure.


> obviously good trades

Are you able to expand with any examples of this?


The idea for aggressive orders in HFT is to buy before the market goes up, and sell before the market goes down.

HFT signals are about detecting those patterns right before they occur, but as early as the information is available globally. For example someone just bought huge amounts of X, driving the price up, and you know Y is positively correlated to X, so Y will go up as well, so you buy it before it does.

X and Y might be fungible, correlated mechanically or statistically.

You can also do it on Y=X as well; then what you need is the ability to statistically tell whether a trade will initiate/continue a trend or revert. You only need to be right most of the time.


There are really two schools of approach to this.

On one hand you have quantitatively driven strategies that try to predict either a price or direction based on various inputs. Here you’re mostly focused on predictive accuracy, and the challenge is in exiting the trade at the right time. This is where a lot of the speed comes into play (what is your predictive horizon, and can you act fast enough to take advantage of the current market prices?).

The other mode of trading tends to focus on structural mispricing in the market. An easy to understand example is an intermarket arbitrage trade where one market’s buyer or seller crosses prices with the opposite side of the market on another exchange. These events permit a trader to swoop in a capture the delta between the two order prices (provided they can get to both markets in time).

As easy opportunity has dried up (markets have grown more efficient as systems have gotten faster, and parties understanding of the market structure has improved) you see some blending of the two styles (this is where another commenter was talking about mixing a traditionally computed alpha with some hardware solution to generate the order), but both come with different technical challenges and performance requirements.


Isn't there challenges with slippage and managing the volumes while exiting? And isn't speed also about processing the data feed as fast as possible to time the exit decisions accurately?


Absolutely to both questions, with different answers depending on what style of strategy you’re running.

The market mechanics trades tend to have no recoverability if you miss the opportunity, so you’re often trading out in error and it’s a matter of trying to stem the loss on a position that you do not have an opinionated value signal on.

And there’s definitely an angle to inbound processing speed for both styles of trading, with differing levels of sensitivity depending on the time horizons you are attempting to predict or execute against. Using the example above again, detecting the arb opportunity and firing quickly is obviously paramount, but if you’re running a strategy where you have a 1 minute predictive time horizon sure, there’s some loss that can be associated with inefficiency if you aren’t moving quickly and someone else is firing at a similar signal, but generally speaking there’s enough differentiation in underlying alpha between you and any competitors that the sensitivity to absolute speed isn’t as prevalent as most people expect.

Basically it boils down to going fast enough to beat the competition, and if there isn’t any you have all the time in the world to make decisions and act on them.


This is true -- in the early 00's there was hardly any competition and we could take out both sides of a price cross with 100ms latency. Even after colocation we could still be competitive with over 4ms latency (plus the network). Trading technology has come a long way in 20 years.


If every other exchange is selling $AAPL at $100 and suddenly the top level of one exchange drops to $99, then if you just take out that order you basically gain a free dollar. Do this very fast and have pricing the product accurately and you will print tons of money.


It's not that simple. It could just be that exchange is the first one to drop to 99 but all others will as well.


Yeah I gather that is the expectation, but if you are the first to execute an order you will sell that order at the old 100 price before it lowers. You are fighting for making an order before the information spreads to the other bots. (Right?!)


That's different from what you hinted in your previous message.

"Making" has specific meaning as well, and I don't think it's what you're trying to say.


Latency isn't even as important in HFT as people claim. What's most important is deterministically staying into a reasonable enveloppe (even at the 99.9 percentile) to satisfy real-time requirements and not fall behind.

When it's really important, it's implemented in FPGA or with an ASIC.


This is close to the question I asked above, comparing to audio. If you're processing 32 samples at a time at 48kHz, you need to process a frame every 667us. If you only hit the deadline 99.9% of the time, that's a glitch more than once per second. You'd need a bunch more nines for an acceptable commercial product, so most audio devs just treat it as a hard realtime constraint.

I think the main split is whether you care more about average latency or worst-case latency (or some # of nines percentile).


In HFT you typically only measure latency for events that you did react on, not all events (at least, in the systems that I've built).

That's arguably a bit problematic as events you don't react to can cause you to fall behind, so it's useful to also track how far behind you are from the point where you start processing a packet (which requires hardware timestamping, unfortunately not present on many general-purpose NICs like what's in the cloud).

For a given market, you may have less than 1M samples a day (I've even seen some so-called HFT strategies or sub-strategies that only had 300 samples per day).

So overall, you'd have fewer events than 44.1kHz, and there are typically expected outliers at the beginning of the data before the system gets warmed up or recovers from bad initial parameters (which I suppose you could just ignore from your distribution, or you could try not to require a warmup).

But you're right, you probably want to look at quantiles closer to 1 as well. You're also looking at the max regardless.


What other low latency projects in public domain are worth learning from?


Whatever is relevant to the domain you're tackling. How you structurally approach latency in a web server is different than a 3D renderer, which is different than an audio stack, etc. All of those domains have great open source or source available code to learn from, but it wouldn't be useful for me to say "go read HAProxy" if you're never going to touch an HTTP packet. When I'm tackling a problem the first thing I do is research everything everyone else has done on the problem and read their code, benchmark it if possible, and steal all the good ideas.

The basic principles never change. Avoid copies, never allocate, keep the hot path local, and really, good codebases should be doing these things anyway. I don't code any differently when I'm writing a command line argument parser vs low-latency RPC servers. It's just a matter of how long I spend tweaking and trying to improve a specific section of code, how willing I am to throw out an entire abstraction I've been working on for perf.

In the domain of web stuff, effectively all the major load balancers are good to study. HAProxy, nginx, Envoy. Also anything antirez has ever touched.

Application servers are also interesting to study because there's many tricks to learn from a piece of software that needs to interface with something very slow but otherwise wants to be completely transparent in the call graph. FastWSGI is a good example.


I'd recommend: https://www.computerenhance.com

The author has a strong game development (engine and tooling) background and I have found it incredibly useful.

It also satisfies the requirement for "A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers."


Very anecdotal but of the people I know in game studios who are tasked with engine work, and people who make a killing doing FPGA work for HFT firms, both camps shook their head at Casey’s HMH thing. Uniformly I do not know of a single professional developer of this sort of caliber who looked at HMH and thought it looks great. Quite the opposite. I think they found his approach and justifications unsound as it would instil awful practices of premature unfounded optimization and a disdain for normal library code in favour of hand-rolling your own half-baked implementations based on outdated trivia. I agree with them on the basis that HMH exposed an unprepared and inexperienced audience to something that has to be regarded with utmost care. For this, I refer to Jonathan Blow’s presentation of “a list is probably good enough” as an antidote. I think JB’s recommendations are more in line with actual practices, whereas Casey just raised red flags uniformly from here-and-now engine devs shipping multi platform games.


I have not watched Handmade Hero, so I can't offer any comments any their comments based on factual claims about the software. A few things though:

I am not linking to handmade hero, I'm linking to a separate project of his (his performance aware programming course) that is actually aimed at being an educational piece.

I lied, I will comment on one factual piece. "normal library code in favour of hand-rolling your own half-baked implementations based on outdated trivia." Yes, that is the whole point of the series (not the characterization as half-based and outdated trivia). The point was to show how to build a game (and its engine) from scratch to as big of a degree as possible. The avowing of library code is the point, to show what it takes to build engines rather than call a library so that the industry has more people who would even attempt doing such a thing.

Equally anecdotally, based on available online information, he worked for a long time on core technologies at RAD Game tools, a company which essentially every gamer, expect maybe pure mobile gamers, has purchased a game that used their technology. It may be possible that he acts (or acted in HMH) based on outdated trivia and favoured premature unfounded optimization, but I find it hard to believe based on the content of his I've engaged with and his track record.


Casey's a fundamentalist, in the religious extremist sense. Just as a theologian might have something to learn from the most extremely devout, so it is that there is valuable information in Casey's proselytizing.

But you should no more follow Casey in form and function than you would any other fundamentalist.


While gesturing at religion and saying he's a fundamentalist is an easy rhetorical technique, it doesn't add much. Yes good resources do exist, but they are few and far between. The number of new undergraduates from cs or software engineering who learned to think in the ways needed to make low latency applications are a rounding error based on my experience with them. Most forgot it, weren't taught it well, or never saw it at all. They are more worried about learning the syntax of the current hot framework rather than on anything even resembling "Avoid copies, never allocate, keep the hot path local".

The religious comparison is also a telling one given the state of the industry; for we aren't the theologian, we're the common folk looking for someone or something to follow in order to write better code.

Who are Casey's alternatives? Gesturing the cppcon as a learning resource has "read research papers to learn about a field" vibes. They can be highly informative and worth the effort, but not for beginners.

Who are Casey's contemporaries? If he's a fundamentalist then the atheists are nowhere to be seen by the beginners. Instead we have agile ceremony shamans, clean code bible missionaries, tdd doomsday cults, oop/rust/haskell/etc. zealots, a thriving megachurch industry of Lambda schools, and the mother of all cargo cults masquerading as web dev.


Agree there are not enough performance zealots. Modern architectures are almost magically fast if you get all of the tiny invisible things that kill your performance sorted out correctly. The fact that the knowledge of how to achieve this is closer to black magic lore than foundational educational knowledge is sad.

Performance is surprisingly often a feature. When things become instantly fast, instead of horribly slow, new opportunities are suddenly feasible for a lot of things.


> The fact that the knowledge of how to achieve this is closer to black magic lore than foundational educational knowledge is sad.

It is sad, doubly so since the things you need to get within an order of magnitude of what your hardware can do aren't the arcane assembly and "premature optimization" boogeymen students picture. Forget the 10x engineer, going from the 0.0001x engineer to the 0.1x would be a massive improvement and it's low hanging fruit.

They're simple things: have your code do less, learn what good performance should be, understand the hardware and internalize that the point is to program it, and use/build better tools (e.g. perhaps your programming model is fundamentally flawed if it essentially incentivizes things like the N+1 Selects Problem).

> Performance is surprisingly often a feature.

Performance is, unsurprisingly, often a missing feature in most software. Every day I need to boot up Teams I feel we stray further from Moore's light.


Totally agree on all points!


The whole point of Handmade Hero is to make everything from scratch as a learning exercise, so using third-party libraries would be counter-productive.


How I might approach it. Interested in feedback from people closer to the space.

First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc. Input may be a FIX stream or similar, output is a series of asset-specific binary event streams along low-latency buses, split in to asset-specific segments of a scalable cluster of low-end MCUs. Second, get rid of the general purpose operating system assumption on your asset-specific MCU-based execution platform to enable faster turnaround using low-level code you can actually find people to write on hardware you can actually purchase. Third, profit? In such a setup you'd need to monitor the overall state with a general purpose OS based governor which could pause or change strategies by reprogramming the individual elements as required.

Just how low are the latencies involved? At a certain point you're better off paying to get the hardware closer to the core than bothering with engineering, right? I guess that heavily depends on the rules and available DCs / link infrastructure offered by the exchanges or pools in question. I would guess a number of profitable operations probably don't disclose which pools they connect to and make a business of front-running, regulations or terms of service be damned. In such cases, the relative network geographic latency between two points of execution is more powerful than the absolute latency to one.


The work I do is all in the single-to-low-double-digit microsecond range to give you an idea of timing constraints. I'm peripheral to HFT as a field though.

> First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc.

This is largely incorrect, or more generously out-of-date, and it influences everything downstream of your explanation. Think of FPGAs as far more flexible GPUs and you're in the right arena. Input parsing and filtering are the obvious applications, but this is by no means the end state.

A wide variety of sanity checks and monitoring features are pushed to the FPGAs, fixed calculation tasks, and output generation. It is possible for the entire stack for some (or most, or all) transactions to be implemented at the FPGA layer. For such transactions the time magnitudes are mid-to-high triple digit nanoseconds. The stacks I've seen with my own two eyeballs talked to supervision algorithms over PCIe (which themselves must be not-slow, but not in the same realm as <10us work), but otherwise nothing crazy fancy. This is well covered in the older academic work on the subject [1], which is why I'm fairly certain its long out of date by now.

HRT has some public information on the pipeline they use for testing and verifying trading components implemented in HDL.[2] With the modern tooling, namely Verilator, development isn't significantly different than modern software development. If anything, SystemVerilog components are much easier to unit test than typical C++ code.

Beyond that it gets way too firm-specific to really comment on anything, and I'm certainly not the one to comment. There's maybe three dozen HFT firms in the entire United States? It's not a huge field with widely acknowledged industry norms.

[1]: https://ieeexplore.ieee.org/document/6299067

[2]: https://www.hudsonrivertrading.com/hrtbeat/verify-custom-har...


Doing anything other than the dumb trigger-release logic on FPGA is counter-productive IMHO.

You're already heavily constrained by placement in order to achieve the lowest latency; you can't afford to have logic that's too complicated.


> optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout

If you do PGO, aren't hinting attributes counter-productive?

In fact, the common wisdom I mostly see compiler people express is that most of the time they're counter-productive even without PGO, and modern compilers trust their own analysis passes more than they trust these hints and will usually ignore them.

FWIW, the only times I've seen these hints in the wild were in places where the compiler could easily insert them, eg the null check after a malloc call.


I said "or even", if you're regularly using PGO they're irrelevant, but not everyone regularly uses PGO in a way that covers all their workloads.

The hinting attributes are exceptional for lone conditionals (not if/else trees) without obvious context to the compiler if it will frequently follow or skip the branch. Compilers are frequently conservative with such things and keep the code in the hot path.

The [[likely]] attribute then doesn't matter so much, but [[unlikely]] is absolutely respected and gets the code out of the hot path, especially with inlined into a large section. Godbolt is useful to verify this but obviously there's no substitute for benchmarking the performance impact.


> allocations, copies, and other performance killers

Please elaborate on those other performance killers.


My emphasis:

> The output of this test is a test statistic (t-statistic) and an associated p-value. The t-statistic, also known as the score, is the result of the unit-root test on the residuals. A more negative t-statistic suggests that the residuals are more likely to be stationary. The p-value provides a measure of the probability that the null hypothesis of the test (no cointegration) is true. The results of your test yielded a p-value of approximately 0.0149 and a t-statistic of -3.7684.

I think they used an LLM to write this bit.

It's also a really weird example. They look at correlation of once-a-day close prices over five years, and then write code to calculate the spread with 65 microsecond latency. That doesn't actually make any sense as something to do. And you wouldn't be calculating statistics on the spread in your inner loop. And 65 microseconds is far too slow for an inner loop. I suppose the point is just to exercise some optimisation techniques - but this is a rather unrepresentative thing to optimise!


I've got an implementation of a stock exchange that uses the LMAX disruptor pattern in C++ https://github.com/sneilan/stock-exchange

And a basic implementation of the LMAX disruptor as a couple C++ files https://github.com/sneilan/lmax-disruptor-tutorial

I've been looking to rebuild this in rust however. I reached the point where I implemented my own websocket protocol, authentication system, SSL etc. Then I realized that memory management and dependencies are a lot easier in rust. Especially for a one man software project.


It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue. Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper. In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.


>> In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.

I did not realize this. Thank you so much for pointing this out. I'm going to take a look.

>> use std::atomic for your producer

Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html I'll check out the paper.


I sincerely doubt the big HFT firms use anything of Fowler’s. Their optimizations are down to making their own hardware. LL is very context dependent and Amdahl’s law applies here.


I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:

1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.

2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].

[1] https://preshing.com/20130618/atomic-vs-non-atomic-operation...

[2] https://en.wikipedia.org/wiki/Memory_ordering

[3] https://en.cppreference.com/w/cpp/atomic/memory_order


Fowler's implementation is written in Java which has a different memory model from C++. To see another example of Java memory model vs a different language, Jon Gjengset ports ConcurrentHashMap to Rust


Instead of this:

  T *item = &this->shared_mem_region
                 ->entities[this->shared_mem_region->consumer_position];
  this->shared_mem_region->consumer_position++;
  this->shared_mem_region->consumer_position %= this->slots;
you can do this.

  uint64_t mask = slot_count - 1;  // all 1's in binary

  item = &slots[ pos & mask ];

  pos ++;
i.e. you can replace a division / modulo with a bitwise AND, saving a bit of computation. This requires that the size of the ringbuffer is a power of two.

What's more, you get to use sequence numbers over the full range of e.g. uint64_t. Wraparound is automatic. You can easily subtract two sequence numbers, this will work without a problem even accounting for wraparound. And you won't have to deal with stupid problems like having to leave one empty slot in the buffer because you would otherwise not be able to discern a full buffer from an empty one.

Naturally, you'll still want to be careful that the window of "live" sequence numbers never exceeds the size of your ringbuffer "window".


I briefly looked over your stock exchange code:

- For memory management, consider switching to std::shared_ptr. It won't slow anything down and will put that concern to rest entirely.

- For sockets, there are FOSS libraries that will outperform your code and save you a ton of headaches dealing with caveats and annoyances. For example, your looping through FD_ISSET is slower than e.g. epoll or kqueue.

- For dependencies, C++ is definitely wilder than other languages. Dependencies are even harder to find than they are to manage. There's a lot of prospective library code, some of it hidden in little forgotten folds of the Internet. Finding it is basically a skill unto itself, one that can pay off handsomely.


When I did low latency everyone was offloading TCP to dedicated hardware.

They would shut down every single process on the server and bind the trading trading app to the CPUs during trading hours to ensure nothing interrupted.

Electrons travel slower than light so they would rent server space at the exchange so they had direct access to the exchange network and didn't have to transverse miles of cables to send their orders.

They would multicast their traffic and there were separate systems to receive the multicast, log packets, and write orders to to databases. There were redundant trading servers that would monitor the multicast traffic so that if they had to take over they would know all of the open positions and orders.

They did all of their testing against simulators - never against live data or even the exchange test systems. They had a petabyte of exchange data they could play back to verify their code worked and to see if tweaks to the algorithm yielding better or worse trading decisions over time.

A solid understanding of the underlying hardware was required, you would make sure network interfaces were arranged in a way they wouldn't cause contention on the PCI bus. You usually had separate interfaces for market data and orders.

All changes were done after exchange hours once trades had been submitted to the back office. The IT department was responsible for reimbursing traders for any losses caused by IT activity - there were shady traders who would look for IT problems and bank them up so they could blame a bad trade on them at some future time.


You don't need to shut down processes on the server. All you have to do is isolate CPU cores and move your workloads onto those cores. That's been a common practice in low latency networking for decades.


I'm not in HFT, but I wouldn't expect that to be enough.

Not only do you want to isolate cores, you want to isolate any shared cache between cores. You do not want your critical data ejected from the cache because a different core sharing the cache has decided it needs that cache. Which of course starts with knowing exactly what CPU you are using since different ones have different cache layouts.

You also don't want those other cores using up precious main memory or IO bandwidth at the moment you need it.


Just to add to your good points: since there's always a faster cache for your working set to not fit in, you can use memory streaming instructions to reduce cache pollution. Depending on the algorithm, increasing cache hit rates can give ridiculous speed-ups.


Correct. I was just pointing out to OP that moving processes is not worthwhile and isolation is how you'd do it


I’ve worked at a few firms and never heard of an IT budget for f-ups. Sounds like a toxic work environment.


Same. That sounds like a way to make that relationship between front office and back office as toxic and unproductive as possible.


Depends on how it's set up. You take a chunk of profits as well if things go well.


It's just business, no? Would you rather trade with a service that's liable for their mistakes or one that isn't?


Any good books/resources you can recommend to learn about the above architectures/techniques?


Some years ago I wrote a gist about HFT/HPC systems patterns (versus OPs C++ patterns) applied to dockerized Redis. Might be dated, but touches on core isolation/pinning, numa/cgroups, kernel bypass, with some links to go deeper. Nowadays I do it with Kubernetes and Nomad facilities, but same basic ideas:

https://gist.github.com/neomantra/3c9b89887d19be6fa5708bf401...


Nice; reminds me of the Redhat Performance Tuning and Real Time Low Latency Optimization guides.


A few episodes of Signals and Threads, a podcast from Jane Street, go into parts of it.


Thank You.


A great insightful comment, thank you!


I did not know std::shared_ptr would not slow things down. I've learned something new today! :)

Yes, I agree, epoll is a lot better than FD_ISSET.

Maybe I can keep moving with my C++ code but do people still trust C++ projects anymore? My ideal use case is a hobbyist who wants a toy stock exchange to run directly in AWS. I felt that C++ has a lot of bad publicity and if I want anyone to trust/try my code I would have to rebuild it in rust.


Here's how to maximize shared_ptr performance:

- In function signatures, use const references: foo(const std::shared_ptr<bar> &p). This will prevent unnecessary bumps of the refcount.

- If you have an inner loop copying a lot of pointers around, you can dereference the shared_ptr's to raw pointers. This is 100% safe provided that the shared_ptr continues to exist in the meantime. I would consider this an optimization and an edge case, though.

I would say people trust C++ projects at least as much as any other professional language - more so if you prove that you know what you're doing.


> In function signatures, use const references: foo(const std::shared_ptr<bar> &p). This will prevent unnecessary bumps of the refcount.

This advice doesn't seem quite right to me, and in my codebases I strictly forbid passing shared_ptr by const reference. If you don't need to share ownership of bar, then you do the following:

    foo(const bar&);
If you do need to share ownership of bar, then you do the following:

    foo(std::shared_ptr<bar>);
Why do we pass by value when sharing ownership? Because it allows for move semantics, so that you give the caller to option to make a copy, which bumps up the reference count, or to entirely avoid any copy whatsoever, which allows transfering ownership without bumping the reference count.

Having said that, shared_ptrs do have their uses but they are very very rare and almost all of our use cases do not expose shared_ptr's in the public API but rather use them as an implementation detail. We use them almost exclusively for things like immutable data structures, or copy-on-write semantics, or as a part of a lock-free data structure.


> If you don't need to share ownership of bar, then you do the following: > > foo(const bar&);

Exactly!

> This advice doesn't seem quite right to me, and in my codebases I strictly forbid passing shared_ptr by const reference

There is at least one use case I can think of: the function may copy the shared_ptr, but you want to avoid touching the reference count for the (frequent) case where it doesn't. This is an edge case, though, and personally I almost never do it.


Additionally: if you care about nullability semantics within your function, then you write foo(const bar*) and pass in bar_ptr.get(), and of course check that the value is != nullptr before dereferencing it.

Otherwise, I'm inclined to agree -- don't pass around smart pointers unless you're actually expressing ownership semantics. Atomics aren't free, ref-counting isn't free, but sometimes that genuinely is the correct abstraction for what you want to do.

One more point: shared ownership should not be used as a replacement for carefully considering your ownership model.

(For readers who might not be as familiar with ownership in the context of memory management: ownership is the notion that an object's lifetime is constrained to a given context (e.g. a scope or a different object -- for instance, a web server would typically own its listening sockets and any of its modules), and using that to provide guarantees that an object will be live in subcontexts. Exclusive ownership (often, in the form of unique_ptr) tends to make those guarantees easier to reason about, as shared ownership requires that you consider every live owning context in order to reason about when an object is destroyed. Circular reference? Congrats, you've introduced a memory leak; better break the cycle with weak_ptr.)


Let me preface by noting that I don't necessarily disagree with any part of what you wrote. However, there are design patterns that exceed the guardrails you're thinking of, and those are the patterns that benefit the most from shared_ptr.

Typically, they involve fine- to medium-grained objects, particularly those that have dynamic state (meaning by-value copies are not an option.)

An example might be a FlightAware-like system where each plane has a dynamically-updated position:

   class Plane { ... void UpdatePosition(const Pos &); Pos GetPosition() const; };
   using PlanePtr = std::shared_ptr<Plane>;
   using PlaneVec = std::vector<PlanePtr>;
   class Updater { ... PlaneVec mPlanes; };
   class View { ... PlaneVec mPlanes; };
Updater routinely calls UpdatePosition(), whereas View only calls const methods on Plane such as GetPosition(). There can be a View for, say, Delta flights and one for United. Let's simplify by assuming that planes are in the sky forever and don't get added or removed.

Destructing Updater doesn't affect Views and vice-versa. Everything is automatically thread-safe as long as the Pos accesses inside each Plane are thread-safe.

The key here is that Plane is fine-grained enough and inconsequential enough for lazy ownership to be ideal.


> Let's simplify by assuming that planes are in the sky forever and don't get added or removed.

If planes are around forever, wouldn't you be better off interning them? e.g. having a single global std::vector<Plane> (or std::array<Plane, N>) and passing around offsets in that array? And your PlaneVec would just be a glorified std::vector<size_t> (or int)? I don't see any value in maintaining a reference count if you're never intending to clean up these objects.

(The argument for using int here would be if you always have fewer than 2 billion planes, and so you can store a PlaneVec in less space. size_t is indistinguishable from Plane* in this context; you have the same amount of indirections either way.)

As I said, shared ownership has its uses, but most instances I've seen could have been replaced with a different model and would have been less painful to debug memory leaks and use-after-free.


> If planes are around forever, wouldn't you be better off interning them? e.g. having a single global std::vector<Plane> (or std::array<Plane, N>) and passing around offsets in that array? And your PlaneVec would just be a glorified std::vector<size_t> (or int)? I don't see any value in maintaining a reference count if you're never intending to clean up these objects.

Definitely not. :) I added that restriction just to sidestep the need to add and remove planes to/from existing Updaters and Views. Besides, you might have an Updater for US flights and one for European ones, and a View might span worldwide Delta flights or just US ones. Updaters and Views might come and go dynamically. The reference counting is key.

In this example, it doesn't matter when Planes get cleaned up, but it does matter that they do. A better alternative than the one you're proposing would be to just use vector<Plane *> and leak the planes, but that's crappy for different reasons (e.g. long-term memory usage and it would bar Plane from, say, RAII-owning its own log file.)


> This is an edge case, though, and personally I almost never do it.

My experience is the opposite. It has to do with the coarseness of the objects involved and the amount of inter-object links. We typically have a vast variety of classes. Many of them have shared_ptr members, resulting in rich graphs.

Many methods capture the shared_ptr parameters by copying them inside other objects. However, many methods just want to call a couple methods on the passed-in object, without capturing it. By standardizing on const shared_ptr &, all calls are alike, and callees can change over time (e.g. from not capturing to capturing.)


foo(const bar&) is ideal if you precisely wish to bar ownership. If (and in many kinds of projects, invariably it's more like when) you later decide to share ownership, or if nullptr is an option, then it's no good.

foo(std::shared_ptr<bar>) is copy-constructed as part of your function call (bumping the refcount) unless copy elision is both available and allowed. It's only ideal if you almost always pass newly instantiated objects.

Pass by const reference is the sweet spot. If you absolutely must minimize the refcount bumps, overload by const reference and by rvalue.

As for shared_ptrs being very rare, uh, no. We use them by the truckload. To each their own!


foo(const bar&) is ideal if you precisely wish to bar ownership.

What?

invariably it's more like when) you later decide to share ownership,

shared_ptr shouldn't even be necessary for keeping track of single threaded scope based ownership.

As for shared_ptrs being very rare, uh, no. We use them by the truckload. To each their own!

You might want to look into that, you shouldn't need to count references in single threaded scope based ownership. If you need something to last longer, make it's ownership higher scoped.

If something already works it works, but this is not necessary and is avoiding understanding the actual scope of variables.


> What?

The context you're missing is in your post's GP. That poster holds a std::shared_ptr<bar> (for whatever perfectly valid reason) and wishes to pass it to foo(). However, he declares it as foo(const bar &) because the callee does not need to share in the ownership of the shared_ptr. That means it gets called as foo(*p.get()).

> scope based ownership

That's the incorrect assumption that you came to. Obviously if bar is only used for stack-based scope variables, no shared_ptr is needed.


The context you're missing is in your post's GP.

I didn't miss any of that, that exactly what I thought it meant. I just don't know what you mean by precisely wish to bar ownership

That's the incorrect assumption that you came to.

Prove it. In a single threaded program with scope based ownership, that shared_ptr is going to be freed somewhere, so why not just have it exist in that scope as a unique_ptr so the ownership scope is clear?

Obviously if bar is only used for stack-based scope variables, no shared_ptr is needed.

Are you saying I'm wrong then saying the exact thing I just said?


Not sure if this is what GP was getting at, but in games a shared pointer (or similar custom resource handle) to something like a texture can be very useful - you want to share resources between objects in the scene so that you don't load a thing from disk when it's already in memory, but don't want to hold onto it forever to save RAM.


Very true, and in between threads and in other areas too, but this isn't scope based ownership, it's completely separate from the scope hierarchy of the program.

Scope based ownership would be a unique_ptr that frees the heap memory when it goes out of scope (if it isn't moved).


> I just don't know what you mean by precisely wish to bar ownership

If foo() doesn't need to share ownership now but may need to later, declaring it as foo(const std::shared_ptr<bar> &) instead of foo(const bar &) allows this change without revising any prototypes. However, if we precisely wish to prohibit shared ownership by foo(), we can do so by declaring it as foo(const bar &).

> Prove it. In a single threaded program with scope based ownership

The incorrect assumption that you came to is that we were talking about stack variables. But anyways, here's an example that's both scope-based and single threaded:

   std::vector<std::shared_ptr<bar> > v;
with an algorithm that selectively adds our pointer to the vector one or more times, and another that duplicates and removes elements according to some ongoing criteria. The object gets deleted when no pointer instances are left in the vector.

In practice most code is multithreaded (not that it matters) and most shared_ptrs are held inside other objects (not that it makes a difference either.)

> Are you saying I'm wrong then saying the exact thing I just said?

I'm saying you misunderstood and now I clarified again. I'm at the troll-detection threshold, so this is my last clarification. Take care!


If foo() doesn't need to share ownership now but may need to later,

This doesn't make sense. Why would a function in a more narrow scope need to take or share ownership? The variable passed will persist before and after the function call.

The incorrect assumption that you came to is that we were talking about stack variables.

I don't know what this means here. I said scope, you are saying stack. If lifetime is being dictated purely by scope, you don't need shared_ptr, you can just use a unique_ptr at the correct scope.

But anyways, here's an example that's both scope-based and single threaded: std::vector<std::shared_ptr<bar> > v;

This isn't scope based lifetime, this is lifetime based on some other criteria than going out of scope. I will show you with what you wrote:

and another that duplicates and removes elements according to some ongoing criteria.


> Why do we pass by value when sharing ownership? Because it allows for move semantics, so that you give the caller to option to make a copy, which bumps up the reference count, or to entirely avoid any copy whatsoever, which allows transfering ownership without bumping the reference count.

What if the callee sometimes wants to get a reference count and sometimes doesn't? In the latter case, your proposed signature forces an unnecessary pair of atomic reference count operations. But if you use

    foo(bar const&)
instead, then foo can't acquire a reference even when it wants to.

You could stick std::enable_shared_from_this` under `bar`. But `std::enable_shared_from_this` adds a machine word of memory, so you might not want to do that.

If you pass

    foo(shared_ptr<bar> const&)
you incur an extra pointer chase in the callee. Sure, you could write

    foo(bar const&, shared_ptr<bar> const&)
but then you burn an extra argument register. You can't win, can you?

You can win actually. Just use https://www.boost.org/doc/libs/1_85_0/libs/smart_ptr/doc/htm... or your favorite intrusive reference-counted smart pointer, not `std::shared_ptr`. If you do, you get the same capabilities that `std::enable_shared_from_this` grants but without any of the downsides.


> If you pass

   > foo(shared_ptr<bar> const&)
> you incur an extra pointer chase in the callee.

Actually this is usually not the case (assuming of course that caller is holding the original pointer in a shared_ptr<bar> which is the use case we were discussing.)

That shared_ptr<bar> instance is held either on the stack (with address FP + offset or SP + offset) or inside another object (typically 'this' + offset.) To call foo(const shared_ptr<bar> &), the compiler adds the base pointer and offset together, then passes the result of that addition - without dereferencing it.

So as it turns out, you may actually have one fewer pointer chase in the const shared_ptr<bar> & case. For example, if foo() is a virtual method and a specific implementation happens to ignore the parameter, neither the caller nor the callee ever dereference the pointer.

The one exception is if you've already resolved the underlying bar& for an unrelated reason in the caller.

I do agree that intrusive_ptr is nice (and we actually have one codebase that uses something very similar.) However shared_ptr has become the standard idiom, and boost can be a hard sell engineering-wise.


> That shared_ptr<bar> instance is held either on the stack (with address FP + offset or SP + offset) or inside another object (typically 'this' + offset.) To call foo(const shared_ptr<bar> &), the compiler adds the base pointer and offset together, then passes the result of that addition - without dereferencing it.

You're overthinking it. Think in cache lines. No matter what the instructions say, with all their fancy addressing modes, foo has to load two cache lines: one holding the shared_ptr and another holding the pointee data. If we instead passed bar* in a register, we'd need to grab only one cache line: the pointee's.

Sure. Maybe the caller already has a fully formed shared_ptr around somewhere but not in cache. Maybe foo often doesn't access the pointee. But how often does this situation arise?


> No matter what the instructions say, with all their fancy addressing modes, foo has to load two cache lines: one holding the shared_ptr and another holding the pointee data. If we instead passed bar* in a register, we'd need to grab only one cache line: the pointee's.

The cache doesn't make a difference here. To clarify: we start with a shared_ptr<bar> instance. It must get dereferenced to be used. It must either be dereferenced by the caller (the const bar & contract) or by the callee (the const shared_ptr<bar> & contract).

If the caller dereferences it, it might turn out to be superfluous if the callee wasn't actually going to use it. In this case const shared_ptr<bar> & is more efficient.

However, if the caller happened to have already dereferenced it prior to the call, one dereferencing would be avoided. In this case const bar & is more efficient.

> Sure. Maybe the caller already has a fully formed shared_ptr around somewhere but not in cache.

This is where our misunderstanding is. The caller starts out by only having a shared_ptr. Someone (caller or callee) has to dig the bar * out.


Sure. I've just only rarely encountered the situation you're describing. When it comes up, passing a shared_ptr by const reference is fine --- just so long as it's documented and the code explains "Look, I know this looks like a newbie error, but in context, it's actually an optimization. Here's why..." lest someone "optimize" away the deferred load win.


If you _maybe_ need to share ownership, the second is a little pessimistic - you always increase the ref count.


That is correct and I can see that being a justification for passing a const&, in fact the C++ Core Guidelines agree with you that such a scenario is the only acceptable reason for passing a shared_ptr by const&, although they encourage passing by value, or just passing a const T&.


Obviously the correct way is to accept a templated type and use perfect forwarding. /s /s /s /s /s


Bonus points for using static_cast<T &&>(p) instead of std::forward<T>(p) ;)


Reference counting definitely slows down tight loops if you are not careful.

The way to avoid that in low latency code is to break the abstraction and operate with the raw pointer in the few areas where this could be a bottleneck.

It is usually not a bottleneck if your code is decently exploiting ipc, an extra addition or subtraction easily gets executed while some other operation is waiting a cycle for some cpu resource.


That's not true. It does slow things down because it has an atomic access. How slow depends on the platform.

unique_ptr does not slow things down.


C++ might have a bad reputation, but in many fields the only alternative, in terms of ecosystem, tooling and tribal knowledge is C.

Between those two, I rather pick the "Typescript for C" one.


> I felt that C++ has a lot of bad publicity and if I want anyone to trust/try my code I would have to rebuild it in rust.

C++ gets bad publicity only from evangelists of the flavour of the month of self-described "successor of C++". They don't have a sales pitch beyond "C++ bad" and that's what they try to milk.

And yet the world runs on C++.


std::shared_ptr definitely slows things down. It's non-intrusive therefore requires a memory indirection.


The LMAX disruptor is a great data structure when you have threads bound to cores and most/all of them are uncontended. It has some terrible pathologies at the tails if you aren't using this pattern. Threads getting descheduled at bad times can really hurt.

SPSC ring buffers are going to be hard to beat for the system you are thinking of, and you can likely also implement work stealing using good old locks if you need it.


fun fact: the original LMAX was designed for and written in Java.

https://martinfowler.com/articles/lmax.html


I think it made sense at the time. From what I understand, you can make Java run as fast as C++ if you're careful with it and use JIT. However, I have never tried such a thing and my source is hearsay from friends who have worked in financial institutions. Then you get added benefit of the Java ecosystem.


From my hearsay, you absolutely can, given two things: fewer pointer-chasing data structures, and, most crucially, fewer or no allocations. Pre-allocate arrays of things you need, run ring buffers on them if you have to use a varying number of things.

A fun but practical approach which I again heard (second-hand) to be used, is just drowning your code in physical RAM, and switch the GC completely off. Have enough RAM to run a trading day, then reboot. The cost is trivial, compared to spending engineering hours on different approaches.


I worked in trading and we did the first one, in C++. We'd load all the instruments (stocks etc.) on startup to preallocate the "universe", and use ring buffers as queues. Instruments don't change during trading hours so restarting daily to pick up the new data is enough.

I saw a Java team do the second one in an order router (a system that connects to various exchanges and routes+translates orders for each exchange's requirements), and they wrote an interesting retrospective doc where they basically said it wasn't worth it - it caused a lot of trouble without giving a significant edge in performance. YMMV! That was around 2012.


I honestly don't know why the real time trading devs don't make their own OS/programming language for this. It's not like they don't have the money.


Making your own OS or language is hard, if you care about both performance and correctness.

But HFT people do a lot of OS-level hacking, squeezing the last bits of performance from the kernel where the kernel is needed, and/or running parts of the kernel stack (like networking) in userspace, avoiding context switches. CPU core pinning, offloading of everything possible to the network cards, etc, goes without saying.


That is basically what they do when deploying into FPGAs.


> fewer or no allocations

Quite fitting in a thread about HFT that has already referenced game development as a parallel universe of similar techniques.

In the feature phone era, mobile phone games were written in Java (well, a subset: Java EE). Practically all games followed the same memory management strategy. They allocated the memory they needed during the early initialisation, and then never again during the actual runtime of the game. That was the only way to retain even a semblance of performance.


So literally remove as much allocation as possible and then reboot each day. That makes a lot of sense to me!


All the java libs that you use can never do an allocation -- ever!. So you don't really get that much benefit to the java ecosystem (other than IDE's). You have to audit the code you use to make sure allocations never happen during the critical path.

Fifteen years ago, the USN's DDX software program learned this the hard way when they needed a hard real time requirement in the milliseconds.


In my experience: Allocation is OK, but garbage collection is bad.


I think back then GC defaulted running potentially at allocation.

shared_ptr is a much better solution for garbage collection. One I wish that java had implemented.


    > shared_ptr is a much better solution for garbage collection. One I wish that java had implemented.
I'm pretty sure there is a large body of (computer science) research work around the topic of deterministic (reference-counted) vs non-deterministic (non-reference counted) garbage collection. There are lots of pros and cons for both sides. Also, I find it interesting that Java, C#, and GoLang all chose non-deterministic GC, but Perl and Python use deterministic GC. (I'm not sure what Ruby does.)


I think CPython does reference counting for its memory management, it still has to run a GC since reference counting does not handle reference cycles.


I'll have to take a look at the code. Maybe I can integrate it into https://github.com/rburkholder/trade-frame


You say it's easier in Rust, but you still have a complete C++ implementation and not a Rust one. :)


It took me about a year to build all of this stuff in C++. So I imagine since I've had to learn rust, it will probably take me the same amount of time if I can save time with dependencies.


In my experience, once you know the problem really well, yes you're right.

If you are building a complex prototype from scratch, you'll usually spend more time fighting the Rust compiler than trying out alternate design decisions.


I do know the problem domain well. My second iteration in rust already almost has an order book implemented.

Writing code in rust however is very fun! (At least so far lol)


Better than C++? Have you experienced any problems in Rust that C++ instantly solved? Not because I'm pro/anti rust/C++ here, just curious.


Writing Rust is fun in the sense that solving a puzzle is fun.

Great when that's what you set out to do (rewrite it in Rust^TM) can get annoying otherwise (solving/researching a problem).


Linus said he wouldn't start Linux if Unix was ready at that time.


Don't know why so many downvotes, probably because I confused Unix with "GNU kernel".

Here's the source (Jan 29, 1992):

    If the GNU kernel had been ready last spring, I'd not have bothered to
    even start my project: the fact is that it wasn't and still isn't. Linux
    wins heavily on points of being available now.
https://groups.google.com/g/comp.os.minix/c/wlhw16QWltI/m/P8...


Minix....



This is an excellent slideshow.

The slide on measuring by having a fake server replaying order data, a second server calculating runtimes, the server under test, and a hardware switch to let you measure packet times is so delightfully hardcore.

I don't have any interest in working in finance, but it must be fun working on something so performance critical that buying a rack of hardware just for benchmarking is economically feasible.


Delightfully hardcore indeed!

But of course you don't have to buy a rack of servers for testing, you can rent it. Servers are a quickly depreciating asset, why invest in them?


Why would replaying data for testing be "Delightfully hardcore indeed!". That's how people program in general, they run the same data through their program.

Servers are a quickly depreciating asset, why invest in them?

I don't think they are a quickly depreciating asset compared to the price of renting, but you would want total control over them in this scenario anyway.


I thought that the hardcore part is taking the data from the switch to account for the network latency.


> Why would replaying data for testing be "Delightfully hardcore indeed!".

Replaying data isn't hardcore. Buying a dedicated server and running it through a dedicated switch just to gather precise timing info is.


You'd want it to be the exact same hardware as in production, for one.


The self driving space does this :)


I made a C++ logging library [1] that has many similarities to the LMAX disruptor. It appears to have found some use among the HFT community.

The original intent was to enable highly detailed logging without performance degradation for "post-mortem" debugging in production environments. I had coworkers who would refuse to include logging of certain important information for troubleshooting, because they were scared that it would impact performance. This put an end to that argument.

[1] https://github.com/mattiasflodin/reckless


> The noted efficiency in compile-time dispatch is due to decisions about function calls being made during the compilation phase. By bypassing the decision-making overhead present in runtime dispatch, programs can execute more swiftly, thus boosting performance.

The other benefit with compile-time dispatch is that when the compiler can statically determine which function is being called, it may be able to inline the called function's code directly at the callsite. That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).


> That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).

AFAIK, the speedup is almost never function call overhead. As you mention at the tail end, it's all about the compiler optimizations being able to see past the dynamic branch. Good JITs support polymorphic inlining. My (somewhat dated) experience for C++ is that PGO is the solve for this, but it's not widely used. Instead people tend to avoid dynamic dispatch altogether in performance sensitive code.

I think the more general moral of the story is to avoid all kinds of unnecessary dynamic branching in hot sections of code in any language unless you have strong/confidence your compiler/JIT is seeing through it.


The real performance depends on the runtime behavior of the machine as well as compiler optimizations. I thought this talk was very interesting on this subject.

https://youtu.be/i5MAXAxp_Tw


OTOH, it might be a net negative in latency if you're icache limited. Depends on the access pattern among other things, of course.


Yup, you always have to measure.

Though my impression is that compilers tend to be fairly conservative about inlining so that don't risk the inlining being a pessimization.


My experience has been that it's rather heuristic based. It's a clear win when you can immediately see far enough in advance to know that it'll also decrease the amount of generated code. You can spot trivial cases where this is true at the point of inlining. However, if you stopped there, you'd leave a ton of optimizations on the table. Further optimization (e.g. DCE) will often drastically reduce code size from the inlining, but it's hard to predict in relationship to a specific inlining decision.

So, statistics and heuristics.


In my experience it's the "force inline" directives that can make this terrible.

I had a coworker who loved "force inline". A symptom was stupidly long codegen times on MSVC.


Is there any good reason for high-frequency trading to exist? People often complain about bitcoin wasting energy, but oddly this gets a free pass despite this being a definite net negative to society as far as I can tell.


Bid/ask spreads are far narrower than they were previously. If you look at the profits of the HFT industry as a whole they aren't that large (low billions) and their dollar volume is in the trillions. Hard to argue that the industry is wildly prosocial but making spreads narrower does mean less money goes to middlemen.


> but making spreads narrower does mean less money goes to middlemen.

On individual trades. I would think you'd have to also argue that their high overall trading volume is somehow also a benefit to the broader market or at the very least that it does not outcompete the benefits of narrowing.


Someone is taking the other side of the trade. Presumably they have a reason for making that trade, so I don't see how higher volume makes people worse off. Probably some of those trades are wealth destroying (due to transaction costs) but it is destroying traders' and speculators' wealth, not some random person who can't afford it, since if you trade rarely your transaction costs are lower than before HFT became prominent.


Why do high spreads mean more money for middlemen?


When you buy stock, you generally by it from a "market maker", which is a middleman. When you sell, you sell to a market maker. Their business is to let you buy and sell when you want instead of waiting for a buyer/seller to show up. The spread is their profit source.


Wouldn’t the price movement overwhelm the spread if you sell more than a few days after you buy? I guess if spreads were huge it would matter more


The price movement does indeed overwhelm the spread. Half the time it goes up, half the time down.


>Half the time it goes up, half the time down.

This is not true and in fact when I hire quants or developers, I have to spend a surprising amount of time even teaching people with PhD's in statistics that the random nature of the stock market does not mean that it's a coin toss. It's surprising the number of people who should know better think trading is just about being right 51% of the time, or that typically stocks have a 50/50 chance of going up or down at any given moment...

What's closer to the truth is that stocks are actually quite predictable the overwhelming majority of the time, but a single mistake can end up costing you dearly. You can be right 95% of the time, and then lose everything you ever made in the remaining 5% of the time. A stock might go up 10 times in a row, and then on the 11th trial, it wipes out everything it made and then some.


Sorry about that, I didn't mean exactly half or anything like that.

Still, I don't feel that it's wrong: Even on rereading, my phrasing seems to address GP's misunderstanding in an immediately accessible way. Which is better, a complicated answer that leads to proper understanding (if you understand it) or a simple answer that solves the acute misunderstanding (and leads to a smaller misunderstanding)? Both kinds of answer have merit IMO.


Because it's not explicitly forbidden ?

I would argue that HFT is a rather small space, albeit pretty concentrated. It's several orders of magnitude smaller in terms of energy wasting than Bitcoin.

The only positive from HFT is liquidity and tighter spreads, but it also depends what people put into HFT definition. For example, Robinhood and free trading, probably wouldn't exist without it.

They are taking a part of the cake that previously went to brokers and banks. HFT is not in a business of screwing 'the little guy'.

From my perspective there is little to none negative to the society. If somebody is investing long term in the stock market, he couldn't care less about HFT.


Buying and quickly selling a stock requires a buyer and a seller, so it doesn't seem like an intrinsic property that real liquidity would be increased. Moreover, fast automated trading seems likely to increase volatility.

I tend to agree that for long term investment it probably doesn't make a huge difference except for possible cumulative effects of decreased liquidity, increased volatility, flash crashes, etc. Also possibly a loss of small investor confidence since the game seems even more rigged in a way that they cannot compete with.


Warren Buffett proposed that the stock market should be open less frequently, like once a quarter or similar. This would encourage long-term investing rather than reacting to speculation.

Regardless, there are no natural events that necessitate high-frequency trading. The underlying value of things rarely changes very quickly, and if it does it's not volatile, rather it's a firm transiton.


> Warren Buffett proposed that the stock market should be open less frequently

This will result in another market where deals will be made and then finalized on that 'official' when it opens. It's like with employee stock. You can sell it before you can...


> It's like with employee stock. You can sell it before you can...

I thought that this was explicitly forbidden in most SV employment contracts? "Thou shalt not offer your shares as collateral or (I forget the exact language) write or purchase any kind of derivative to hedge downside.' No buying PUTS! No selling CALLs! No stock-backed loans!

Or do people make secondary deals despite this, because, well, the company doesn't know, does it?


Not if we disallow it. We have laws in place to try and prevent a lot of natural actions of markets.


Instant biggest black market of all time, corruption skyrockets, only the richest people get fair deals on investments.


They'll get better than fair deals. They will almost exclusively own all the available pricing information.


AIUI the point of HFT isn't to trade frequently, but rather to change offer/bid prices in the smallest possible steps (small along both axes) while waiting for someone to accept the offer/bid.


Why shouldn’t people be allowed to do both? I don’t see much of an advantage to making the markets less agile.

It would be nice to be able to buy and sell stocks more than once a quarter, especially given plenty of events that do affect the perceived value of a company happen more frequently than that


The value of a stock usually doesn't change every millisecond. There doesn't seem to be a huge public need to pick winners and losers based on ping time.


most trading volume already happens at open or before close.


Seems like a testable hypothesis. Choose the four times a year that the stock is at a fair price, and buy when it goes below it and sell when it goes above it?


Non-bitcoin transactions are just a couple of entries in various databases. Mining bitcoin is intense number crunching.

HFT makes the financial markets a tiny bit more accurate by resolving inconsistencies (for example three pairs of currencies can get out of whack with one another) and obvious mispricings (for various definitions of "obvious")


That's a nice fairy tale that they probably tell their kids when asked, but what the profitable firms are doing at the cutting edge is inducing responses in the other guys' robots, in a phase that the antagonist controls, then trading against what they know is about to happen. It is literally market manipulation. A way to kill off this entire field of endeavor is to charge a tax on cancelled orders.


Exchanges put limits on cancellation rates as measured by a multiple of filled orders.

You have to allow strategies that can induce other strategies as by definition those also increase liquidity. It’s a difficult problem to explain to anyone except the very few people who can understand the extremely complicated feedback loops that result from bots fighting bots, however the regulators actually have access to counterparty tagged exchange event data and what is found when this is analyzed is that the net cost for liquidity that is extracted by market makers and short term traders from longer term participants is continuously decreasing not increasing. The system is becoming more and more efficient and not less. This is good for markets and the economy. There are also less people working in financial markets per capita than ever before, granted those who are might include a higher percentage of highly skilled and specialized and educated individuals than previously, which some might argue might be better used in some other industry, but that is rightfully not what the market wants.

There is absolutely no logical reason to “kill off this entire field” those sentiments are purely envy based reactions from those who don’t understand what is happening.


The other fundamental thing is that if you can’t sell something, you don’t really own it.

So wether pairs of people want to buy-and-hold or HFT their assets is really neither here nor there for uninvolved third parties.


Yep and what's worse is many hft firms aren't in the market-making business at all but actually REMOVE liquidity.


What's the mechanism for removing liquidity?


Adding liquidity means to place an order that sits on the book while removing liquidity means to execute against an order that's already resting on the book.


> What's the mechanism for removing liquidity?

Liquidity removal = market order

Liquidity providing = limit order (not immediately executable)


The order type is a red herring (you can take liquidity with a limit order).

The only difference between an order that removes liquidity and an order that adds liquidity is whether it executes upon arrival (removing liquidity) or rests on the order book on arrival (adding liquidity).


This comment is how to tell me you completely misunderstood CLOB market structure without telling me.


The exchanges already charge fees per order https://www.nyse.com/publicdocs/nyse/markets/nyse-arca/NYSE_...


If one outlawed / disincentivized hostile the bot behavior you described, there would still be the opportunity to do the good and profitable things I described.


These commenters have no clue what they are talking about. You don’t need to worry about hostile behaviors in the way they claim. These HN individuals are not domain educated and are spouting highly uninformed nonsense. Silly ideas like market orders take liquidity and limit orders provide it show an extremely rudimentary familiarity with only basic terminology of the field and with no understanding of the extremely dynamic and complex nature of modern capital markets. For instance they don’t realize that most longer term managers execute as much as 80% of their orders as passive limit orders, yet they are the ones that are actually liquidity demanders not suppliers.


> These HN individuals are not domain educated and are spouting highly uninformed nonsense.

Yeah, this is highly frustrating particularly for people like me who don't know anything about the domain i.e. HFT/Trading and would like to know more.

Can you recommend some good introductory books/resources ?


I might be biased for some almost personal reasons, however I think there is one book that anyone serious about the field should just read, period.

“Trades, Quotes and Prices”

Authors:

Jean-Philippe Bouchaud, Capital Fund Management, Paris, Julius Bonart, University College London, Jonathan Donier, Capital Fund Management, Martin Gould, CFM - Imperial Institute of Quantitative Finance.

I’m a practitioner and this book is foundational.


Thank You.

Just FYI for others; the full name of the book is "Trades, Quotes and Prices : Financial Markets Under the Microscope".

Looks like a very comprehensive book though somewhat advanced for me at my current level. Do you have a more beginner level book/resource recommendation to go with this where from i can get an overview and familiarize myself with the jargon?


How far have you tried to tell, and do you buy/sell stocks?

There's someone on the other side of your trade when you want to trade something. You're more likely than not choosing to interact with an HFT player at your price. If you're getting a better price, that's money that you get to keep.

*I'm going to disagree on "free pass" also. HFT is pretty often criticized here.


Generally gets attributed with:

- Increased liquidity. Ensures there's actually something to be traded available globally, and swiftly moves it to places where it's lacking.

- Tighter spreads, the difference between you buying and then selling again is lower. Which often is good for the "actual users" of the market.

- Global prices / less geographical differences in prices. Generally you can trust you get the right price no matter what venue you trade at, as any arbitrage opportunity has likely already been executed on.

- etc..


> Tighter spreads, the difference between you buying and then selling again is lower. Which often is good for the "actual users" of the market.

I just wanted to highlight this one in particular - the spread is tighter because HFTs eat the spread and reduce the error that market players can benefit from. The spread is disappearing because of rent-seeking from the HFTs.


What needs to be pointed out is that the rent and spread is the same thing in this equation. Before the rise of HFT actual people performed these functions, and then the spread/rent-seeking was a lot higher.


Yes: it put a much larger, more expensive, and less efficient part of wall street out of business. Before it was done with computers, it was done with lots of people doing the same job. What was that job? Well, if you want to go to a market and sell something, you generally would like for there to be someone there who's buying it. But it's not always the case that there's a buyer there right at the time who actually wants the item for their own use. The inverse is also true for a prospective buyer. Enter middle-men or market makers who just hang around the marketplace, learning roughly how much people will buy or sell a given good for, and buy it for slightly less than they can sell it later for. This is actually generally useful if you just want to buy or sell something.

Now, does this need to get towards milli-seconds or nano-seconds? No, this is just the equivalent of many of these middle-men racing to give you an offer. But it's (part of) how they compete with each other, and as they do so they squeeze the margins of the industry as a whole: In fact the profits of HFT firms have decreased as a percentage of the overall market and in absolute terms after the initial peak as they displaced the day traders doing the same thing.


> it's not always the case that there's a buyer there right at the time

This hits the nail on the head. For a trade to happen, counterparties need to meet in price and in time. A market place is useless if there is nobody around to buy or sell at the same time you do.

The core service market makers provide is not liquidity. It's immediacy: they offer (put up) liquidity in order to capture trades, but the value proposition for other traders - and the exchanges! - is that there is someone to take the other side of a trade when a non-MM entity wants to buy or sell instruments.

It took me a long time to understand what the difference is. And in order to make sure that there is sufficient liquidity in place, exchanges set up both contractual requirements and incentive structures for their market makers.


To go a step further, I don't think you should be required to talk to middlemen when buying stocks, yet here we are. The house wants its cut.


So who would sell it to you then? At any given time there's not very many actual natural buyer and sellers for a single security.


Central counterparty concept implemented by most exchanges is a valid service, as otherwise counterparty risk management would be a nightmare - an example of a useful “middleman”.


Yes, but you and I can't even talk to the exchange. We have to talk to one of many brokers that are allowed to talk to the exchange, and brokers do much more than just passing your orders to exchanges. For example, IIRC they can legally front-run your orders.


Brokers are required to give you NBBO or better. You could set up a direct connection to an exchange if you were willing to pay and were sufficiently motivated to deal with all of that


What exactly do you mean when you say this: > IIRC they can legally front-run your orders.

I doubt this is true, but there are definitions attached to front-running.


What I was actually thinking about was the infamous Payment for Order Flow, where your broker doesn't actually send your orders to an exchange, but gets paid to send them to a giant HFT firm which can do whatever it wants with no oversight, including using the information that you are about to place an order to inform their algorithms which may result in them putting out a similar order before yours.


FYI, this is false. These HFT firms are subject to the same regulation as other brokers, which is extensive. In particular, it's explicitly forbidden to front-run pending customer orders, or to share any information about them with the proprietary side of the firm. In my experience, these firms take these rules very seriously.


Under the carpet deals would make it less transparent. It would be hard to detect that top manager sold all his stock to competitors and now is making decisions in their favor.


It would be trivial (and vastly more equitable) to quantize trade times.


You mean like settling trades every 0.25 seconds or something like that? Wouldn't there be a queue of trades piling up every 0.25 seconds, incentivizing maximum speed anyway?


Usually the proposal is to randomize the processing of the queue. So, as long as your trades get in during the window, there's no advantage to getting in any earlier. In theory the window is so small as to not have any impact on liquidity but wide enough to basically shut down all HFT.


How would that work? Would you randomly select a single order posted, go by market participant (like randomly select some entities that posted a trade in this window), and would you allow prices to move during this window?


Just from an energy perspective, I'm pretty sure HFT uses many orders of magnitude less energy than bitcoin mining.


Why does it exist?

Because it's legal and profitable.

If you don't like it, try to convince regulators that it shouldn't be legal and provide a framework for criminalizing/fining it without unintended consequences, and then find a way to pay regulators more in bribes than the HFT shops do, even though their pockets are deeper than yours, and then things may change.

If that sounds impossible, that's another answer to your question


> a definite net negative to society as far as I can tell.

What did you examine to reach that conclusion? If high-frequency trading were positive for society, what would you expect to be different?

The reason for high-frequency trading to exist is that the sub-penny rule makes it illegal to compete on price so you have to compete on speed instead. Abolishing the sub-penny rule would mean high-frequency trading profits got competed-away to nothing, although frankly they're already pretty close. The whole industry is basically an irrelevant piece of plumbing anyway.


A net negative to society, but a positive for the wealthiest.


> A net negative to society, but a positive for the wealthiest.

No.

When your passive index fund manager rebalances every month because “NVDA is now overweighted in VTI, QQQ” the manager does not care about the bid/ask spread.

When VTI is $1.6 trillion, even a $0.01 difference in price translates to a loss $60 million for the passive 401k, IRA, investors.

HFT reduces the bid/ask spread, and “gives this $60 million back” to the passive investors for every $0.01 price difference, every month. Note that VTI mid price at time of writing is $272.49.


Just in case you are a pro developer, the whole thing is worth looking at:

https://github.com/CppCon/CppCon2017/tree/master/Presentatio...

and up


I am curious: why does this field use/used C++ instead of C for the logic? What benefits does C++ have over C in the domain? I am proficient in C/assembly but completely ignorant of the practices in HFT so please go easy on the explanations!


C++ is more expressive and allows much more abstraction than C. For a long time C++ was the only mainstream language that provided C-level performance as well as rich abstractions, which is why it became popular in fields that require complex domain modeling, like HFT, gamedev, and graphics. (Of course one can debate whether this expressivity is worth the enormous complexity of the language, but in practice people have empirically chosen C++.)


The structure and tone of this text reeks of LLM.


the irony being that if something should not be high frequency, it is trading


Anyone know of resources like this for Java?



Thanks! Looks like a great list of resources.


Very cool intro to the subject!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: