Hacker News new | past | comments | ask | show | jobs | submit login
Low Latency Optimization: Understanding Pages (Part 1) (hudsonrivertrading.com)
118 points by Jumptadel on Nov 29, 2022 | hide | past | favorite | 57 comments



So many puzzling things here, from the brand new user account created to post this (a portmanteau of Jump Trading and Citadel), to the very minimal information presented (even my own article on it for my software covers about as much), to the people in the comments here conflating virtual memory with hard disk paging in spite of TFA, to red herring comments about RT scheduling, ...


Performance is a shiny toy that attracts and distracts many from understanding the fundamentals. Interestingly, understanding fundamentals is a pre-requisite to understanding performance!


I'd argue that understanding what happens on every single memory access qualifies as fundamental.


URL for your article?


https://www.chaoticafractals.com/manual/getting-started/enab...

Admittedly it's a bit terse, but at least it gives some steps you can use to enable it on Windows. It also benefits other software, such as 7zip. I need to update the page because these days the performance benefits are larger, due to the ever-widening divide between compute and memory speeds, CPUs having bigger large page TLBs, and additional optimisations...


I am biased but I don't think it's fair to say that your article covers as much. There is more content in the article written in a way that's trying to be approachable. I certainly will agree it's wordy but it's hard to make content that's interesting to a wide audience

Your article does cover how to use/enable them while the post does not. But it's meant for part 2


Agreed, "about as much" is a bit unfair.


Low latency trading (sometimes referred as HFT as well) focuses a lot in data locality. That is to make sure the critical path, that is from market data coming in to new order or cancel order sending out (tick-to-trade), operates in cache as much as possible and avoid memory access as much as possible. Put all those needed data together within few cachelines as possible. To make sure those data are in the cache so the tick-to-trade operates on cache, sometimes warmup of caches between orders are employed too. This is to prevent those caches swapped out, and involves sending fake orders that won't actually go out, but swapped in the needed cachelines before the real orders, so that caches are hot when they are needed.


are there any hardware/software systems/setups that can put guarantees around L1/L2/main memory usage? it seems like that would be a major boon rather than just put everything in tiny arrays and hope for the best


Intel's RDT stack has that in the form of CAT (Cache Allocation Technology) and MBA (Memory Bandwidth Allocation). Some more advanced HFT shops use that extensively

https://github.com/intel/intel-cmt-cat


CAT is indeed a good thing to look at. But there are some important caveats 1) unless you have a very small number of cores, it's not possible to reserve a cache slice for all programs running (some slices are shared for things like DDIO), 2) it's still not possible to lock some specific data in the cache because any collision will replace the data 3) the slices are kinda big, so it's hard to be properly fine grained. Basically, CAT just prevents other processes from stealing all the cache. It does that by reserving ways (as in the cache associativity meaning)


1) fully agreed but most HFT apps with the exception of really simple ones like market data feed handlers which can easily fit their working set into L2 anyway will be the only thing running on a host

2) mutual cache eviction by hash collisions is solvable with a number of tricks (although those methods are not easy and often wasteful). The "DDIO slice" issue used to be a problem back when Intel used ring topology for LLC. These days they are built as a mesh thus minimizing this effect.

3) CAT doesn't recognize threads or processes. COS (class of service) uses CPU cores for way-of-cache assignments

Recent micro-architectures like SKX or CLX have 11 ways of L3 and what often happens is 1-2 ways get assigned to cpu0 for non latency-critical workloads while the rest are assigned to latency-sensitive, isolated cores usually running a single user space thread each.


2) Agreed about the solvability and difficulty of avoiding cache collisions. DDIO must write its data somewhere in the L3 cache. It ends up in the shareable slice. So either you're okay with sharing your cache or cannot use these slices if you want exclusive access for your processes. That was my point.

3) CAT does not recognize processes but resctl does. Feels we're kinda nitpicking here...

Last of your point: Agree, that gives you 9ish usable slices which is not very much depending on the number of cores. That was my point I was trying to make.


3) resctl just uses COS under the hood. The same limitation applies

> Yeah, that gives you 9ish usable slices which is not very much. Again that was my point

This is 9 ways that you can use for your latency-sensitive workloads exclusively. This is MUCH better than letting all that LLC get trashed by non-critical processes/threads. Typically after applying such partitioning we've observed a 15-20% speed up in our apps.

In my area shaving off a few micros that way is a huge deal and definitely worth spending a couple of minutes implementing.


Custom OS that runs in L3 only (cache as RAM).


I've been working for HFT firms since I moved to NYC over a decade ago. The article looks like a good summation HugePage benefits (I'm a sysadmin, not a programmer so I understand it on a topical level only).

What I do find fascinating is that HRT is actively blogging about this stuff. Ten years ago, everyone in the biz was super secretive and never made any public announcement about what we did - even stuff that I would take from HPE and RHEL low latency manuals (which were public knowledge). You never said anything publicly because protecting the "secret ingredients" of the trading system was paramount and any disclosure was one step towards breaking that barrier.

Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?.

Anyway, as a side note, if you liked this article, you'd also probably like this:

http://hackingnasdaq.blogspot.com/

It was one of my favorite reads because it was written by someone going thru the journey of low latency exploration - before everything was taken over by FPGAs.


> Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?

I think you’re right in the money here. All the secret sauce is in FPGA trading now so there’s nothing secret about sharing this info.


> everyone in the biz was super secretive [..] even stuff that I would take from [..] (which were public knowledge)

It's not just Finance but other industries have this culture as well. I suspect it manifests in an environment that is perceived to be hyper competitive--any perceived advantage regardless of where it came from or how differentiated it is, is held closely and over-weighed if proper metrics aren't in place to continue pushing for improvement.


This is ezpz optimization. Everyone and their dog knows about huge pages (or at least anyone I deem worthy).

It's for recruiting, clout, and also generally expressing the culture of the firm.


This but unironically - I was at a top HFT firm and admittedly lots of SWE talent we try to get end up going to HRT (comp being equal) because they present themselves as more tech forward, due in part to articles like this. I'm guessing the author of the article wrote this with good intentions of providing some insight into their process, but tech blogs at the end of the day are recruiting tools.

Also agreed that this is a pretty surface level optimization, theres a reason why they are talking about it. If you are doing true HFT with purely software traders, you will probably lose to more serious players using FPGAs, which as OP mentioned isn't exactly new.


I wasn't being ironic, I genuinely think this should be required knowledge for programmers who want to use write native code.


If any of this is interesting to you, but you would like some deeper content to bite into, perhaps start here:

https://lmax-exchange.github.io/disruptor/disruptor.html

This technical paper sent me on a multi-year journey regarding one simple question: "If this stuff is fast enough for fintech, why can't we make everything work this way?" Handling millions of requests per second on 1 thread is well beyond the required performance envelope for most public "webscale" products.


The thing is, some tech companies do optimize to this degree (fb, google) - except it only really makes a lot of sense for companies that have

A. Capital (human and money)

B. Bespoke internal tools from the server up

C. Insane scale

Google and meta put in lots of effort to have fast C++ code for their core infra, and they have several teams contributing to the LLVM project. Even places like Figma optimize to some extent because part of their business alpha is being performant and smooth. When you are at the scale of FB or G, optimizing the small things can lead to massive aggregate gains, and they have the eng talent/time to justify it.

At smaller companies though, as others have mentioned, iteration time and efficient dev spend are paramount. Optimizing for microsecond latency with on your B2B SaaS product written in Python/React is most likely not part of your business case, and it is a waste of engineering time and effort to do this when you could be putting that time and money into new and better features. Most of the time, these very niche performance considerations are taken care of to a decent degree with off the shelf tools, maybe with a bit of tuning.


That's like asking why a motorbike isn't a car - they have the same goal: get from A to B but quite different considerations.

Most webscale products aren't written in performant languages and are, instead, optimized around fast feature generation and being easy (cheap) to hire for.

There's a reason laggy Electron apps are the norm $$.


HFT architecture tends to be grossly inefficient from a throughput perspective: i.e you'll usually be busy waiting for new data.


This article is pretty thin but it's not wrong.

If you're interested in consistent low latency you do need to avoid TLB misses, and also page faults, cache contention, cache coherency delay (making sure no other cores are accessing your memory) from the CC protocol (MOESI/MESI(F)) and mis-prediction, and that's after you have put all your core's threads into SCHED_FIFO. Using https://lttng.org/ can be really helpful in checking what's happening.


Again, I am biased. But the article explains mem translation in fairly simple terms, hammers the main advantages of HPs (better use of the TLB, simpler and smaller PT). Explains clearly what how much mem the TLB can cover, what a page walk is and how much time it takes (before even loading actual data), the importance of the cache wrt PT. It shows some perf numbers of random vs iterative mem accesses.

I don't think you'll find many articles that detail these points. Now, they might be trivial to you and that's totally fair. But the goal is to address a wide audience. Additionally, the article is not addressing how to use HPs but that's for part 2.

Wrt to other points, I certainly agree they are important topics to explore. I would add using perf is super important to easily access the perf counters


Feels a bit blogspammy. Drepper's article is linked to for good reason


They don’t want to release actually interesting optimization content but need something to fill the tech blob maybe?

Although huge pages are pretty basic table stakes for hft software nowadays, not much alpha left to high by really going into detail on them?


An OS page size is such a prevalent notion in software it's shocking

I was oblivious to this a year ago before I got interested in database internals

Something that I found interesting, there's a recent presentation by Neumann about the Umbra DBMS where he fields a question about hugepages at the end. I recall him saying they don't use it, which I found interesting.

I know Oracle and MySQL recommended Transparent Hugepages IIRC


Transparent huge-pages are actually rarely recommended, and I think it's a quite prevalent idiom in database systems world, and that is because they can cause unexpected stalls and latency spikes during the application runtime. This is possible because THB is an in-kernel implementation of system-wide huge-page support and it's basically hard to get a lot of control over the process.

OTOH to make use of "normal" huge-pages, you have to allocate them up front so it's not possible to run into THB type of issues.

That said, I doubt that enabling huge-pages for complex database workloads, that cannot run solely in-memory, will show any noticeable performance improvement. There's a lot of IO and memory R/W involved and I think this is what shadows the TLB miss cost. What would be interesting, and what I haven't done so far, is to estimate the number of CPU cycles needed for a TLB miss.


You need to understand the bottleneck to determine whether or not huge pages are useful. THP requires the kernel to do additional work to give you the huge pages, which is usually more expensive then allocating huge pages through mmap directly.


So useless.


I'm with you: zero lines of code presented for the most verbose description on the topic I've seen. No mention of other software that benefits from it, no actual latency graphs (it's somewhat implied by the throughput graph), only one CPU measured, ...


Lol, this is the very definition of interesting. They included source code here: https://github.com/hudson-trading/hrtbeat/blob/master/huge_m....


If you're doing truly low latency stuff you shouldn't be swapping at all, everything should be 100% resident in memory at all times. So "pages" are totally irrelevant to you. (You should also probably be using something like the PREEMPT_RT patchset, adjust scheduling priorities and try your best to ensure that the CPU core(s) your app is running on aren't burdened by serving interrupts. Plus likely a lot of other stuff that I haven't touched on in this brief comment.)


Stock / near stock Linux is pretty close to fine for HFT.

You basically only interact with the kernel on init/shutdown or outside of the fast path, and do something like isolcpus to delegate the kernel and interrupt handling to some garbage cores and give you the rest to do what you want with.


Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.


The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.


It can be done by manually remapping the relevant sections upon application startup.

Perhaps [1] is a good resource to start with (page nr. 7). Example code is here [2]. And [3] makes some experiments with it.

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-83-90.pd...

[2] https://github.com/intel/iodlr/blob/master/large_page-c/exam...

[3] https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...


Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?


I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.

Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().

Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.


IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.


Just wondering, how useful is it to get code and stack memory into hugepages? I thought you usually access them sequentially so it doesn't matter that much to put them in hugepages.


Code is not really accessed sequentially. Just imagine a function calling another function which sits in another translation unit. Depending what the linker is going to do but there's a good chance that these won't sit near to each other unless you explicitly optimized for that case. This is why source-code level locality is also important - to minimize the instruction cache misses. And also why you don't want to go mad about making everything dynamically dispatched in the code unless you really need to (e.g. virtual functions).

EDIT: Putting the code segment into hugepages will relief some of the pressure of VADDR translation which is otherwise larger with 4K segments. Whether this will impact the runtime execution time positively or stay neutral I think it greatly depends on the workload and cannot be said upfront.


Wrt code, look at the bench in the article. Even with sequential access, you can get a decent speedup using huge pages. But unless you have a good profile and using PGO, it'll likely not be that sequential for code. Like everything else you'll need to measure it to know exactly what benefit you might get. As a starting point, you can start at looking at the itlb load misses with perf stat -d

Stack access is another story as it's usually local and sequential so it might not be that useful.


The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.


Not really. Most HFTs would choose some sort of kernel bypass for critical path networking needs. (unless that's what you mean by near stock)


Nothing needs to be changed about your kernel to bypass it.

You can outright install the openonload drivers, preload to intercept epoll, and it literally just works.

Going to efvi can cut out ~1us but that requires specifically targeting efvi, and has more operational / code setup pain. Works on stock Linux all the same though


Wondering if there is any guide to programming and accurately measuring low latency stuff. I am working on some low level memory management code and would like to see the latency behavior, but I always get several microseconds of standard deviation (~10%) when I try to benchmark it.

I am pinning the cores, disabled SMT and turbo boost, but haven't tried isolcpu because this has to reboot the computer.


Really? With Google Bench or Criterion I've gotten pretty good resolution on microbenchmarks

The gold standard (in my experience) for latency measurement is setting up a packet splitting, marking your outbound packets with some hash/id of the inbound packet, taking hardware timestamps of all these on some dedicated host, and putting it all together after the fact. Ultimately packet-in to packet out is all the matters anyways

You can actually get pretty solid internal timestamps (how long did I take to fully process event X) with TSC counters, but you then have a coordinated omissions problem: https://www.programmingtalks.org/talk/how-not-to-measure-lat...


Have you tried one the link of the article: https://docs.kernel.org/admin-guide/kernel-per-CPU-kthreads....? Also try running "perf stat -d" on your run and see anything pops out


It looks like you don't have a good understanding of how virtual memory works and how in that space the hardware (TLB), the OS (page tables) and higher level software are intertwined.

Also, PREEMPT_RT is the worst option for low latency because it's about execution time guarantees and not speed specifically. If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.


> If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

PREEMPT_RT includes priority inheritance, specifically to avoid this scenario. So your app should indeed be favored if you tune accordingly. What you also seem to be saying is that using PREEMPT_RT may lead to lower throughput, but that's not the same thing as latency.


Priority inheritance is a fundamental feature of all scheduling domains in the kernel and has nothing to do with the problem I described.


Pages still exist even if you disable swap. Maybe you could benefit from reading TFA?


I'm not entirely sure you understand memory hierarchy and that RAM is volatile so paging has to happen and keep in mind that reading memory from disk is a LOT slower than from main memory.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: