C++11 threads, affinity and hyperthreading

saurik · on Jan 18, 2016

> So if you've wondered why hyper-threading is sometimes considered a trick played by CPU vendors, now you know. Since the two threads on a core share so much, they are not fully independent CPUs in the general sense.

> Caches isn't the only thing threads within a core share, however. They also share many of the core's execution resources, like the execution engine, system bus interface, instruction fetch and decode units, branch predictors and so on [3].

I would go even further than this... rather than starting from the idea "they sound separate, but apparently share a lot of stuff", you are probably better off working from the original goal of a hyperthread: sometimes the CPU blocks on something slow, like memory or a dependent register, and it has nothing it can easily reorder to do in the mean time; the idea of hyperthreading was to provide the CPU something else to do when it got stuck. So from that concept we would never expect it to have anything at all that it doesn't share: you are just filling in gaps in the execution of a single core by having the CPU itself implement something akin to a coroutine task scheduler. The issue I take with the mental approach in the article is then that it still implies there is at least some parallel compution possible there that is being limited by how much is being shared, but the reality is that hyperthreading has more in common with "green threading" than real concurrency: there is really only one thing there pretending to be two things entirely for the purpose of tricking legacy software into giving it something else to execute.

Symmetry · on Jan 18, 2016

It's a bit better than green threading since out of order engines are imperfect at extracting parallelism and you can often use a second thread to fill in spare functional units that the first thread isn't using. There's also the danger that the extra cache pressure from two threads on a core will cause enough thrashing that using SMT will decrease performance on net. So it's more complicated but of course you're right that green threading is a much better first approximation than multiple cores. I suspect you already knew that but just didn't want to go into it.

One technique I've heard about is to schedule producer and consumer threads on the same core so that there's zero data transfer overhead and no extra d-cache pressure.

jb1991 · on Jan 18, 2016

>This is fairly low-level thread control, but in this article I won't detour into higher-level constructs like task-based parallelism, leaving this to some future post.

That's fine, but just as an FYI to others: C++11 also provides some great tools for creating futures and promises that significantly simplify the work in multithreaded programming. In fact it closely resembles the conveniences I came to enjoy working in Clojure, which is considered to have a great approach to threading.

superfunc · on Jan 18, 2016

Also, if people should look at intel's tbb(https://www.threadingbuildingblocks.org/intel-tbb-tutorial) library, it has been really excellent in my experience using at a respectably large scale.

edit: formatting

bonzini · on Jan 19, 2016

TBB is a little underwhelming in my opinion. Stuff like mutexes, condition variables and even atomics is pretty much standard nowadays (even if you need portability, you can usually rely on stuff like glib or Qt or boost); thread-safe collections are rarely the right solution, because too many fine-grained locks incur excessive overhead. The work-stealing queue is nice, but if you really want things to scale you need stuff like highly-parallel fast paths, with message passing on the slow paths. And this is where TBB falls a bit short of the expectations.

pkolaczk · on Jan 19, 2016

I wonder what do you find so simplifying in futures and promises? I use them heavily in one of the projects and compared to old-school mutexes and condition variables they seem to be just at the same abstraction level. They can cause very similar problems like deadlocks or race conditions, with the additional "bonus" of callback hell if you don't use any higher-level libraries.

seivan · on Jan 19, 2016

I just wish GCD was ported everywhere and not just FreeBSD. It's much easier to work and think in terms of queues and groups than threads

shin_lao · on Jan 19, 2016

We've experimented a lot with thread affinity and our conclusion is that more often than not playing with thread affinity brings no performance advantage and is very problematic to work with on different platforms.

For example on Linux the name of the mask set is cpu_set_t and on BSD it is cpuset_t. On OS X you have to use the function thread_policy_set and on Windows SetThreadAffinityMask, all with different logics.

The other problem is once you play with affinity you have to take care of all the threads of your application because if you leave one thread roaming on all cores your affinity approach is ruined.

Making sure that different steps of the same operation are processed in the same logical thread makes a bigger performance difference than playing with thread affinity.

Last but not least, the code in this article is incorrect. You must first do sched_getaffinity to know the cores on which your program is allowed to run and then do pthread_setaffinity_np.

pjmlp · on Jan 19, 2016

It really made a difference back in Windows NT 3.51 and 4.0.

We used it to keep a Apache threads per core on our servers back in those days.

The scheduler wasn't still optimized for the .com loads and the Apache threads kept jumping between cores.

Also Apache still wasn't that optimized for Windows as well.

atomic77 · on Jan 18, 2016

It was worth reading this article just to discover this lstopo utility

coherentpony · on Jan 18, 2016

Also worth noting: lstopo -.txt

rkv · on Jan 18, 2016

For Debian users: apt-get install hwloc-nox