JedMartin's comments

JedMartin · 2024-07-08T22:27:17 1720477637

It's not easy to get data structures like this right in C++. There are a couple of problems with your implementation of the queue. Memory accesses can be reordered by both the compiler and the CPU, so you should use std::atomic for your producer and consumer positions to get the barriers described in the original LMAX Disruptor paper. In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.

sneilan1 · 2024-07-08T22:36:05 1720478165

>> In the get method, you're returning a pointer to the element within the queue after bumping the consumer position (which frees the slot for the producer), so it can get overwritten while the user is accessing it. And then your producer and consumer positions will most likely end up in the same cache line, leading to false sharing.

I did not realize this. Thank you so much for pointing this out. I'm going to take a look.

>> use std::atomic for your producer

Yes, it is hard to get these data structures right. I used Martin Fowler's description of the LMAX algorithm which did not mention atomic. https://martinfowler.com/articles/lmax.html I'll check out the paper.

hi_dang_ · 2024-07-08T23:17:14 1720480634

I sincerely doubt the big HFT firms use anything of Fowler’s. Their optimizations are down to making their own hardware. LL is very context dependent and Amdahl’s law applies here.

JedMartin · 2024-07-09T14:46:52 1720536412

I have absolutely no idea how this works in Java, but in C++, there are a few reasons you need std::atomic here:

1. You need to make sure that modifying the producer/consumer position is actually atomic. This may end up being the same instruction that the compiler would use for modifying a non-atomic variable, but that will depend on your target architecture and the size of the data type. Without std::atomic, it may also generate multiple instructions to implement that load/store or use an instruction which is non-atomic at the CPU level. See [1] for more information.

2. You're using positions for synchronization between the producer and consumer. When incrementing the reader position, you're basically freeing a slot for the producer, which means that you need to make sure all reads happen before you do it. When incrementing the producer position, you're indicating that the slot is ready to be consumed, so you need to make sure that all the stores to that slot happen before that. Things may go wrong here due to reordering by the compiler or by the CPU [2], so you need to instruct both that a certain memory ordering is required here. Reordering by the compiler can be prevented using a compiler-level memory barrier - asm volatile("" ::: "memory"). Depending on your CPU architecture, you may or may not need to add a memory barrier instruction as well to prevent reordering by the CPU at runtime. The good news is that std::atomic does all that for you if you pick the right memory ordering, and by default, it uses the strongest one (sequentially-consistent ordering). I think in this particular case you could relax the constraints a bit and use memory_order_acquire on the consumer side and memory_order_release on the producer side [3].

[1] https://preshing.com/20130618/atomic-vs-non-atomic-operation...

[2] https://en.wikipedia.org/wiki/Memory_ordering

[3] https://en.cppreference.com/w/cpp/atomic/memory_order

6keZbCECT2uB · 2024-07-09T09:29:29 1720517369

Fowler's implementation is written in Java which has a different memory model from C++. To see another example of Java memory model vs a different language, Jon Gjengset ports ConcurrentHashMap to Rust

JedMartin · on Nov 29, 2022

The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.

JedMartin · on Nov 29, 2022

The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.

menaerus · on Nov 29, 2022

It can be done by manually remapping the relevant sections upon application startup.

Perhaps [1] is a good resource to start with (page nr. 7). Example code is here [2]. And [3] makes some experiments with it.

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-83-90.pd...

[2] https://github.com/intel/iodlr/blob/master/large_page-c/exam...

[3] https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...

JedMartin · on Nov 29, 2022

Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?

menaerus · on Nov 29, 2022

I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.

Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().

Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.

deadcanard · on Nov 29, 2022

IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.