Some light reading on lock-free programming

frozenport · on Nov 27, 2014

As I understand it, these techniques are oriented towards producer-consumer queues where there is only 1 producer and only 1 consumer. My software does IO in this manner. What is confusing is that these schemes don't scale beyond multiple cores for obvious reasons (1 consumer, 1 producer).

I understand that I can reduce the few ms latency when acquiring a std::mutex, and perhaps this is valuable for high frequency trading, but I am not sure if this will improve performance in a palpable manner.

Does anybody have any good uses cases for lock-free programming where the goal is to scale beyond a single producer-consumer scheme?

chollida1 · on Nov 27, 2014

Check out this paper on the LMAX Disruptor for an example of a lockless pattern.

http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0...

I posted it a while ago, but it didn't get any traction. I use it a system with multiple consumers, I wouldn't call it HFT but its as close as most trading systems will get:)

We use it to pass quotes to each algo that requires it and to pass orders from high level algos down to an execution layer that determines how to send each algo's order to market.

inaworld · on Dec 6, 2014

I once heard about a very large scale multiprocessor configuration with shared cached virtual memory and or disk storage. I don't know the specific terminology, but a cluster of DEC Alphas sharing a vast mapped memory space, not pure disk caching but something more like object caching. As an analogy imagine running all of Facebook across multiple servers with shared storage, and the cached objects are elements of Facebook pages, such as you and I sharing the same photos and simultaneously updating/tagging them. It is critical that we always retrieve consistent states, but it is not critical that my tagging happens strictly before your tagging.

Locks were not available, so it was simply not an option. So they instead chose a lockless "ethernet with memory space collisions" analog to normal "ethernet with wire timing collisions"

The implementation was to allow concurrent overlapping readers and writers into shared memory, but reading was done backward through memory, and writing was done forward. Readers could thus tell (checksum) if their read got corrupted by an overlapping write, and they could cancel and reinitiate. As long as there are not a large number of collisions, it's a very fast system with no waiting. Again, imagine the intensity and criticality of Facebook object sharing; on a human scale there might be a lot of concurrency and benefit to caching, but at processor speed there will not be much read/write contention.

There was more to the scheme which I don't exactly remember. For example, you can see that a writer can also tell if the write was corrupted by a concurrent write by following with a reread, but that's not what they did. I think there were two pools of shared storage, memory chunks and tables of handles to chunks, and the writers coordinated their lockless shared writes via the table of handles to the chunks using the same forward/backward contention (or really corruption) detection.

mikeash · on Nov 27, 2014

Lock free techniques can actually hurt performance compared to mutexes. Their advantage is predictability. A lock will often be fast and sometimes be really slow when it has to wait for some other thread. A lock less algorithm runs at about the same speed no matter what. This is important for realtime tasks such as audio recording and playback.

kinetik · on Nov 27, 2014

My understanding was that a lock free algorithm could still wait for an unbounded time, and what you're referring to are called wait free algorithms.

mikeash · on Nov 28, 2014

You're right, but much depends on how much performance you need and how critical your realtime performance is. A lock-free algorithm may retry, but it will likely only retry once or twice. Your running time on any given operation may occasionally double or triple, but it won't hit you with six order of magnitudes like might happen with a mutex if it's contended and the thread that owns a lock takes a page fault or something.

I'd probably want to go wait-free if I was writing software for a Mars lander, but lock-free is good enough for consumer audio tasks, where mutexes generally aren't.

glibgil · on Nov 27, 2014

Sharding can scale queues horizontally just like with a database.