Make your program slower with threads

ScottBurson · on Dec 7, 2014

Uh, you can't just drop in 'random_r' in place of 'random'. If you do that, what will happen is that you'll get the same sequence of random numbers in every thread, making the whole multithreading exercise useless!

You have to be sure to seed the generator differently in each thread, before calling 'random_r'.

frozenport · on Dec 7, 2014

Moreover, he probably improved the single threaded version, and should re-check the scaling curve. There should be a `1 with random_r`, `2 with random_r`, and `3 with random_r`.

onestone · on Dec 7, 2014

It's unlikely the single-threaded version was improved noticeably. Futexes are nearly zero-cost when not contended.

mjb · on Dec 7, 2014

I left the random_r numbers with single threads out of the post for exactly this reason. Performance with random_r on a single thread was less than 3% faster, with a slightly higher variance (which I can't explain, to be honest).

mjb · on Dec 7, 2014

Yes, absolutely. That can be done using the global state (i.e. calling rand() or random()), or is probably better done by pulling from the system entropy pool (via /dev/urandom on Linux).

mbq · on Dec 7, 2014

Still no good; you can get spurious correlations between rng outputs on each core. You need something like L'Ecuyer RNG streams.

ScottBurson · on Dec 7, 2014

Well, using /dev/urandom is a vast improvement over doing nothing.

But your point is well taken. People doing serious Monte Carlo simulations should probably use a more sophisticated RNG.

Dylan16807 · on Dec 7, 2014

It's just an example.

userbinator · on Dec 7, 2014

It should be kept in mind that standard memory (e.g. DDR3) doesn't actually support concurrent accesses (and at least on x86, one or two cores at most can easily saturate the full bandwidth), so anytime you have shared data accessed by multiple threads it's actually serialised and that part does not perform any better than the single-threaded case; the biggest speedup occurs when the majority of the time the multiple threads are running in their own caches and not sharing any data.

zurn · on Dec 7, 2014

While you are kind of correct in that a memory channel doesn't do multiple concurrent transfers, single-threaded memory-bound workloads definitely leave significant main memory performance on the table in most cases. (And if not memory-bound, then of course a lot more still)

Main memory is high-latency, and single threads usually don't saturate the bandwidth because they are not issuing back to back pipelined requests to memory. The CPU has a single shared memory controller and it sees requests from all cores/SMTs. A desktop x86 might support 16 outstanding cache misses that might come from different cores or SMT "threads", and is designed to keep the memory channels busy and request pipelines full by scheduling in work from all the cores/SMTs, doing prefetching according to the big picture etc.

Even multiple threads accessing the same data doesn't actually look different from the DIMM point of view, all the x86 synchronization takes place inside the CPU caches and is invisible in the main memory traffic/requests. (This is true even on multi-socket systems, the caches just talk to each other using an inter-CPU bus).

Also, desktop/server systems typically have more than one independent memory channel, in beefy servers a whole lot of them

adwn · on Dec 7, 2014

> While you are kind of correct in that a memory channel doesn't do multiple concurrent transfers, single-threaded memory-bound workloads definitely leave significant main memory performance on the table in most cases.

You have to distinguish between memory-latency-bound and memory-throughput-bound. If your program has a simple, regular access pattern (which the CPU's prefetcher recognizes), then it's not that hard to saturate the memory bandwidth of a typical two-channel, DDR3-1600 system (25.6 GB/s peak throughput) with a single thread.

vardump · on Dec 7, 2014

In order to get maximum throughput, you also might need to do SIMD loads (and stores).

On Haswell this means two AVX loads, 64 bytes per clock cycle (two 32 byte loads). I think with with scalar code, you get only 16 bytes per cycle (two 8 byte loads), and 32 bytes per cycle with SSE.

So a single Haswell core at 3 GHz can load about 178 GB/s!

On Ivy Bridge, SSE is enough to saturate memory bus.

Regardless, that's quite a bit more than what lower cache levels and DRAM can supply. It's not very hard to saturate whole CPU-local memory bus with just one core.

nkurz · on Dec 7, 2014

So a single Haswell core at 3 GHz can load about 178 GB/s! On Ivy Bridge, SSE is enough to saturate memory bus.

I'd claim instead that it's surprisingly hard to saturate the memory bus with a single core[1]. Presuming your math is right, you've given the rate at which Haswell can load from L1 to registers if we can ignore latency and concurrency. That is, the CPU can sustain this number only if the latency is low enough relative to the number of "requests in flight". In practice, these prevent a single core from saturating the memory bus except for the case of dual-channel memory, perfect prediction and prefetching, and no TLB misses.

More commonly, the latency of the read from memory combined with the limited number (10) of "line fill buffers" reduces the throughput from RAM. The formula for actual bandwidth is known as "Little's Law": Concurrency = Latency * Bandwidth. Considering it's fundamental importance in gauging actual performance, it's strikingly rare to see it discussed in CS theory. John McCalpin, the author of the Stream benchmark has an excellent explanation of how it applies to memory bandwidth: http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-op...

[1] Here's my write up of trying to do so on Sandy Bridge: https://software.intel.com/en-us/forums/topic/480004

vardump · on Dec 8, 2014

Your write up was pretty interesting. Especially the instruction mixes, LFENCEs of all things helped?! Thanks.

I haven't tried optimize fill / copy rate since late nineties. Getting the last 10-20% appears to be rather tricky. Luckily maximum bandwidth per core is seldom an issue.

Sometimes I've thought about designs of having one core do memory prefetching ahead of loosely synchronized other core doing actual computation, to get maximal serial computational throughput. The idea is, that the computational core would fetch data from shared L2/L3 at all times. For those times, when the access patterns are not predictable for hardware prefetcher etc.

nkurz · on Dec 8, 2014

LFENCEs of all things helped

I don't really understand what was happening here, but it was a definite positive effect. I think the issue is that the early preloads filled the queue, preventing the stores from happening. The bright side is that getting good performance on Haswell is easier, as it supports 2 loads and 1 store per cycle (although note that address generation is still a bottleneck unless you make sure that the store is using a "simple" address).

I've thought about designs of having one core do memory prefetching ahead of loosely synchronized other core doing actual computation

I'm interested in this approach too, but haven't really tried it out. "srean" suggested this also: https://news.ycombinator.com/item?id=8711626

I think it could would work well if you know your access patterns in advance, the hard part would be keeping the cores in sync. In the absence of real-time synchronization, you'd need a really high throughput queue to make this work. Interesting write up and links here: http://www.infoq.com/articles/High-Performance-Java-Inter-Th... It's a much easier problem if you can batch up the preloads rather than doing them one at a time.

I have had success with a somewhat similar technique within a single core by issuing a batch of prefetches, then do processing on the previous data, then a batch of loads corresponding to the prefetches, then repeat. You can get 2-3x improvement of throughput on random access if you are able to tolerate the slightly increased latency. The key is figuring out how to keep the maximum number of requests in flight at all times.

nkurz · on Dec 7, 2014

What do you mean when you say that standard memory doesn't support concurrent accesses? Assume you have 2 threads running on 2 (physical) cores on a single-socket machine. In what case would the data access be serialized? How much do you think this serialization would affect the latency of a read from memory?

For current Intel, all cores in a socket share L3 cache, but have independent L1 and L2. If the two are using the same address read-only, both will have the data in their own L1. If they are read-write, performance will drop, but there still won't be any accesses to RAM, and the data will stay in L3.

If you are talking about random access, there can be concurrent access on either 2, 3, or 4 "channels". This is physically partitioned, so sometimes data can truly be accessed in parallel and sometimes it can't. But the maximum wait time for a request to be held up is small relative to the total latency of a memory access.

The "two cores at most" part of the saturation statement is trivially untrue, although correct for many systems. On the Sandy Bridge system I just checked (quad-channel DDR3 1600), I did get a small increase on the Stream benchmark going from 2 to 3 physical cores: 15 GB/s Copy for one core, 23 GB/s for two, 25 GB/s for three.

But this overstates the problem somewhat, as perfectly predictable access with prefetch isn't particularly common. For random access, it is difficult to saturate RAM bandwidth unless you pay very close attention to the parallelization of your requests. For dependent access (each request depends on the previous) I'm doubtful that you'd be able to reach saturation on most processors.

I agree with your sentiment though --- it's tough to get the performance increase you'd hope for unless the threads, and paying close attention to data access patterns is key. But I'd argue (and I'm sure you'd agree) that managing data access is crucial to single threaded performance as well.

e12e · on Dec 7, 2014

Sure, but assuming an Intel Haswell i7, L1 cache is 64K (two threads with hyper-threading) -- and the total of L1+L2+L3 is anywhere from 2 to 20 megabytes.

So if you "do little" with "much data", then that (might) matter, whereas if you "do much" with "little data" it might not.

srean · on Dec 7, 2014

I had restrained myself from responding with a 'this' to @userbinator's comment. What you say is true but it is not trivial. This not to say that you implied it to be trivial.

The key is that with some thought and change in approach one can sometimes transform between the two regimes

    "do little" with "much data"  -->    "do much" with "little data" __at_a_time__

Often it is not at all obvious how to do this. It is fairly common in discussions on parallelism for naysayers to appear and claim most things / algorithms aren't parallel and that's the end of it.

Creativity lies in rethinking algorithms so that the transformation above applies. Sometimes this is reasonably easy, for example for matrix multiplication. It is quite detrimental to write it out as a loop over large rows x large columns. The speedup will be abominable. The solution is also fairly obvious here and that is why it is everyone's favorite example, in general however solutions are rarely that obvious. Here we only reordered the data access so that it touches the same things often, there was no fundamental change in the algorithm. These are the easy cases.

Once I had to fight with my co-worker about how to structure code. I was arguing against building the operation as a huge matrix times vector and offload it to architecture specific BLAS. I lost that argument, it was not clear at that time which approach would be better. When the implementation was done the speedup was missing in action. Large Mat x Vecs are just bad primitives to use on modern CPUs (too little computation per data touched): one lesson learned. After all BLAS cannot do magic, its still going to run on the same CPU (unless your BLAS offloads it to some other sub-architecture like GPUs, even then bandwidth is a major concern).

Profiling showed that this was choking on memory bandwidth. The bus just cannot keep up with 1 core hammering it with read requests, let alone 4. At this point one could have washed one's hand off and said that the algorithm is memory bandwidth limited and that's the end of the story. But those limitations are true only if you have to stick with that particular algorithm or the specific order of the data access. Usually those restrictions are artificial.

I had far better luck by breaking it up into pieces so that the next time around most of the data was pulled from cache. This required changing the algorithm a little bit, and also proving that the new algorithm converges to the same thing.

It still pays significantly due to synchronization on de-allocation. My next target is to switch to stack based allocation/deallocation because the allocation pattern for the most part follows a stack discipline for this application. For multi-threaded allocations that is the best kind of allocation pattern you can have. Wish alloca was included in POSIX and that it returned usable error values, otherwise its pretty much a case of call and pray.

For compute heavy stuff hyperthreads are rarely useful. The bottleneck is typically floating point operations and FPU becomes the scarce resource. Hyperthreading actually ends up hurting because all the threads want to do the same thing: some floating point operation (ideal for GPU, not for hyperthreading). Sometimes it is possible to make one hyperthread take care of streaming in the data and uncompressing it in memory while the other takes care of floating point operations. I have not tried this out myself but ought to work out (in theory :)

colin_mccabe · on Dec 7, 2014

Thanks for this comment. That's really interesting. It seems like caching is becoming more and more important at every level of the system. Even GPUs are getting DRAM caches now. So "do much with little data at a time" is the way to go.

P.S. I work on Hadoop. "Do little with much data" is a pretty good high level description of what we do. Our single-node efficiency is not what I'd like it to be, but it's improving with next generation SQL and other execution engines.

seanmcdirmid · on Dec 7, 2014

Threads can share data in caches (depending on the core and cache level), which is especially true for parallel computing use cases. Hyper threading can run another thread on a core while a thread is stalled on a read, which makes reasoning about performance even more tricky.

ggreer · on Dec 7, 2014

I have some experience in this domain and I agree with the author: threads are not a performance cure-all. Many naïve implementations will actually hurt performance[1]. It takes careful thought, experimentation, measurement, and quite a bit of elbow grease to get the expected speedup.

The more general lesson is this: Just because you have very good reasons to believe a change will improve performance, that doesn't mean it will actually improve performance! Benchmark. Profile. Fix the bottleneck. Repeat. That's the only reliable way to make something faster.

1. http://geoff.greer.fm/2012/09/07/the-silver-searcher-adding-...

JoeAltmaier · on Dec 7, 2014

We're in a strange place with optimization. Hardware folks build very complex memory access paths (multiple caches/multiple accessors) to try and speed up the average case of unoptimized code. Then we try to trick that mechanism into performing for our particular computations. Maybe some programmable architecture would help here? Maybe not.

hollerith · on Dec 7, 2014

IOW, programmers assume hardware is simpler than it really is whereas hardware designers assume software is simpler than it really is.

Animats · on Dec 7, 2014

"It's little more than a tight loop around a pseudo-random number generator." One locked against concurrent access.

Yes, put a lock and context switch in a tight inner loop, and your performance will suck.

ProCynic · on Dec 7, 2014

I think the point is that he didn't know there was a lock. His closing argument seems to be that when you're writing muli-threaded code, you have to not only think about what your own code is doing, but also about what all the library and system calls you use are doing. Which is one of the many reasons multi-threaded code is harder to write than single threaded.

blt · on Dec 7, 2014

Yes, I thought rand()'s self-locking was pretty common knowledge, but it is a good example to teach the perils of multithreading because its signature gives no indication that anything bad will happen. strtok() is even worse.

C++11's excellent <random> library represents the PRNG as an object like it should. Besides making the statefulness of PRNGs explicit, its API encourages correct usage - it would be more work to share a PRNG between threads.

Animats · on Dec 7, 2014

strtok() is even worse. (Because it has global state)

There are a lot of C standard library functions from the 1970s that should have been moved to a "deprecated" library around 1990 or so and phased out by now.

vardump · on Dec 7, 2014

Throw zero-terminated strings away too, please. Scanning sequential bytes is very slow and has caused so many bugs.

kybernetyk · on Dec 7, 2014

>I think the point is that he didn't know there was a lock.

One could argue that if you decide to go concurrent for performance reasons it's your responsibility to check for implicit locking behavior. Otherwise you're just doing cargo cult programming.

mannykannot · on Dec 7, 2014

That is the point of the article. Explaining what you should be doing is the purpose of any tutorial-style article.

sargun · on Dec 7, 2014

I think this is why the actor model, or CSP, is really nice, because it makes these things easier to reason about, and trace. In the actor model, you could dispatch the data to each thread and split up the processing.

danieldk · on Dec 7, 2014

I assume that you would want to run a tight loop in the actor? Otherwise, you will be spending much time sending messages, especially in a relatively cheap computation such computing the next number in a stream of pseudo-random numbers.

If the tight loop is in the actor (or process reading from a channel). You have the same problems again.

tl;dr: I don't see how the actor model helps here, it will either be slow or have the same problems (you should know what functions lock a data structure).

bad_user · on Dec 7, 2014

BTW, you can model a thread-safe random number generator without using a lock, replacing it with a CAS operation on a reference.

You still have problems this way, but at least it is non-blocking, in the sense that if the scheduler blocks any thread for any reason, then the other active threads can still have progress. And at least in Java, for a number of threads less or equal to the number of cores available, it will have better throughput, as locks these days are optimized for the common scenario of adding "just in case" synchronization (i.e. biased towards the thread that last acquired that lock), which hurts performance because of the extra management required.

e12e · on Dec 7, 2014

While you are correct, your comment kind of misrepresents the article: perhaps a better title might have been "When the difference between 'thread-safe' and 're-entrant' matters"?

learnstats2 · on Dec 7, 2014

The author writes that this is a contrived case but I was doing Monte Carlo simulations 10 years ago, i.e. I wanted exactly this case.

In the end, I just let them run on one thread and could never work out how they could be 10x slower with threading - until today. Thanks.

pjbringer · on Dec 7, 2014

This exact problem had me worried when I heard about OpenBSD's arc4random, and how they've been promoting a pervasive use of it. I haven't taken the time to look at how they solve the problem of contention, while still maximizing the unpredictability of the random number generator's state by using it.

clarry · on Dec 7, 2014

I don't think threads are all that popular near the OpenBSD group. Most daemons handle multiple connections asynchronously without threads, and if that's not enough, multiple processes may be used to process multiple requests simultaneously.

arc4random() is simply a locking wrapper for the underlying rng.

anon4 · on Dec 7, 2014

Aren't you supposed to use arc4random once to seed a high-quality pseudorandom generator? From what I understand, seeding a high-quality PRNG with good source of randomness is really all you need.

makomk · on Dec 7, 2014

arc4random is their high-quality pseudorandom generator that's seeded from a good source of randomness (namely, the getentropy system call).

thirsteh · on Dec 7, 2014

Yes, but like urandom the general intent (at least when performance is desired) is you use it to seed your own PRNG and avoid making system calls every time you need random data.

fche · on Dec 7, 2014

The author's multithreaded random_r version has the benefit of performance, but has the problem of, well, brokenness. Without locking the internal state of random_r(), it will be corrupted, and the results won't be meaningful.

mjb · on Dec 7, 2014

Can you explain more? My understanding (which seems to be backed up by the implementation) is that I can call initstate_r in each thread to initialize thread-local state, then call random_r without synchronization from each thread as long as I used that thread's thread-local state.

random_r's internal state is all, from what I can see of the implementation, entirely passed in by the caller. Implementation here: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/rand...

ScottBurson · on Dec 7, 2014

You are correct, and 'fche is mistaken. Each 'random_data' object is accessed only from a single thread.

barrkel · on Dec 7, 2014

random_r doesn't have internal state.

plg · on Dec 7, 2014

what about using OpenMP?

My recollection is that I wrote a similar program (statistical bootstrapping, essentially a loop around random() ) and using OpenMP on a 4-core machine definitely produced a speedup (not x4 but close)

Does OpenMP somehow sidestep this issue of shared access? (or is my memory wrong)

nkurz · on Dec 8, 2014

My first thought was that this was a silly question, that there is nothing magic about OpenMP, and that the throughput would be the same. But thinking more, this would be a good thing to benchmark.

The difference is that OpenMP will be using multiple processes instead of multiple threads. Since the bottleneck in this case is access to memory in common between the threads, and since different processes won't be sharing this memory, I think OpenMP would indeed have a 4x speedup on this problem on a 4 (physical) core machine if the runtime was long enough to offset the cost of launching the processes with OpenMP.

More generally, it would be worthwhile to benchmark the performance difference between multiple processes with created with fork() versus multiple threads pthread_create(). If there is no need to write to the same memory (as with a bootstrap) the shared-nothing process based approach is going to more likely to get a linear speedup and is usually easier to reason about.

jokoon · on Dec 7, 2014

I don't really see the point of using multithreading for speed up if you have less than 10 cores. that program example might run faster with opencl.