Sure, but assuming an Intel Haswell i7, L1 cache is 64K (two threads with hyper-...

srean · on Dec 7, 2014

I had restrained myself from responding with a 'this' to @userbinator's comment. What you say is true but it is not trivial. This not to say that you implied it to be trivial.

The key is that with some thought and change in approach one can sometimes transform between the two regimes

    "do little" with "much data"  -->    "do much" with "little data" __at_a_time__

Often it is not at all obvious how to do this. It is fairly common in discussions on parallelism for naysayers to appear and claim most things / algorithms aren't parallel and that's the end of it.

Creativity lies in rethinking algorithms so that the transformation above applies. Sometimes this is reasonably easy, for example for matrix multiplication. It is quite detrimental to write it out as a loop over large rows x large columns. The speedup will be abominable. The solution is also fairly obvious here and that is why it is everyone's favorite example, in general however solutions are rarely that obvious. Here we only reordered the data access so that it touches the same things often, there was no fundamental change in the algorithm. These are the easy cases.

Once I had to fight with my co-worker about how to structure code. I was arguing against building the operation as a huge matrix times vector and offload it to architecture specific BLAS. I lost that argument, it was not clear at that time which approach would be better. When the implementation was done the speedup was missing in action. Large Mat x Vecs are just bad primitives to use on modern CPUs (too little computation per data touched): one lesson learned. After all BLAS cannot do magic, its still going to run on the same CPU (unless your BLAS offloads it to some other sub-architecture like GPUs, even then bandwidth is a major concern).

Profiling showed that this was choking on memory bandwidth. The bus just cannot keep up with 1 core hammering it with read requests, let alone 4. At this point one could have washed one's hand off and said that the algorithm is memory bandwidth limited and that's the end of the story. But those limitations are true only if you have to stick with that particular algorithm or the specific order of the data access. Usually those restrictions are artificial.

I had far better luck by breaking it up into pieces so that the next time around most of the data was pulled from cache. This required changing the algorithm a little bit, and also proving that the new algorithm converges to the same thing.

It still pays significantly due to synchronization on de-allocation. My next target is to switch to stack based allocation/deallocation because the allocation pattern for the most part follows a stack discipline for this application. For multi-threaded allocations that is the best kind of allocation pattern you can have. Wish alloca was included in POSIX and that it returned usable error values, otherwise its pretty much a case of call and pray.

For compute heavy stuff hyperthreads are rarely useful. The bottleneck is typically floating point operations and FPU becomes the scarce resource. Hyperthreading actually ends up hurting because all the threads want to do the same thing: some floating point operation (ideal for GPU, not for hyperthreading). Sometimes it is possible to make one hyperthread take care of streaming in the data and uncompressing it in memory while the other takes care of floating point operations. I have not tried this out myself but ought to work out (in theory :)

colin_mccabe · on Dec 7, 2014

Thanks for this comment. That's really interesting. It seems like caching is becoming more and more important at every level of the system. Even GPUs are getting DRAM caches now. So "do much with little data at a time" is the way to go.

P.S. I work on Hadoop. "Do little with much data" is a pretty good high level description of what we do. Our single-node efficiency is not what I'd like it to be, but it's improving with next generation SQL and other execution engines.